Finding the spirit of L-World
Brian Goetz
brian.goetz at oracle.com
Wed Jan 23 17:51:58 UTC 2019
> The key questions are around the mental model of what we're trying to accomplish and how to make it easy (easier?) for users to migrate to use value types or handle when their pre-value code is passed a valuetype. There's a cost for some group of users regardless of how we address each of these issues. Who pays these costs? Those migrating to use the new value types functionality? Those needing to address the performance costs of migrating to a values capable runtime (JDK-N?).
Indeed, this is the question. And, full disclosure, my thoughts have evolved since we started this exercise.
We initially started with the idea that value types were this thing “off to the side” — a special category of classes that only experts would ever use, and so it was OK if they had sharp edges. But this is the sort of wishful thinking one engages in when you are trying to do something that seems impossible; you bargain with the problem.
When we did generics, there was a pervasive believe that the complexity of generics could be contained to where only experts would have to deal with it, and the rest of us could happily use our strongly typed collections without having to understand wildcards and such. This turned out to be pure wishful thinking; generics are part of the language, and in order to be an effective Java programmer, you have to understand them. (And this only gets more true; the typing of lambdas builds on generics.)
The first experiments (Q world) were along the lines of value types being off to the side. While it was possible to build the VM that way, we ran into problem after another as we tried to use them in Java code. Value types would be useless if you can’t put them in an ArrayList or HashMap, so we were going to have migrate our existing libraries to be value-aware. And with the myriad distinctions between values and objects (different top types, different bytecodes, different type signatures), it was a migration nightmare.
In the early EG meetings, Kevin frequently stood up and said things like “it’s bad enough that we have a type system split in two; are you really trying to sell me one split in three? You can’t do that to the users.” (Thank you, Kevin.)
The problems of Q-world were in a sense the problems of erased generics — we were trying to minimize the disruption to the VM (a worthy goal), but the cost was that sharp edges were exposed to the users in ways they couldn’t avoid. And the solution of L World is: push more of it into the VM. (Obviously there’s a balance to be struck here.) And I believe that we are finally close to a substrate on which we can build a strong, stable tower, where we can compatibly migrate our existing billions of lines of code with minimal intrusion. So this is encouraging.
The vision of being able to “flatten all the way down”, and having values interact cleanly with all the other language features is hard to argue against. But as you say, the question is, who pays.
> One concern writ large across our response is performance. I know we're looking at user model here but performance is part of that model. Java has a well understood performance model for array access, == (acmp), and it would be unfortunate if we damaged that model significantly when introducing value types.
I agree that this is an expensive place to be making tradeoffs. Surely if the cost were that ACMP got .0000001% slower, it’s a slam dunk “who cares”, and if ACMP got 100000x slower, it’s a slam-dunk the other way. The real numbers (for which we’ll need data) will not be at either of these extremes, and so some hard decisions are in our future.
> Is this a fair statement of the projects goals: to improve memory locality in Java by introducing flattenable data? The rest of where we've gotten to has been working all the threads of that key desire through the rest of the java platform. The L/Q world design has come about from starting from a VM perspective based on what's implementable in ways that allows the JVM to optimize the layout.
It’s a fair summary, but I would like to be more precise.
Value types offer the user the ability to trade away some programming flexibility (mutability, subtyping) for flatter and denser memory layouts. And we want value types to interact cleanly with the other features of the platform, so that when you (say) put value types in an ArrayList, you still get flat and dense representations. So I think a good way to think about it is “enabling flattening all the way down”. (Flattenability also maps fairly cleanly to scalarizability, so the same tradeoffs that give us flattenability on the heap give us scalarization on the stack.)
Those are the performance goals. But there are also some “all the way up” goals I’d like to state. Programming with value types should interact cleanly with the rest of the platform; writing code that is generic over references and values should only be slightly harder than writing code that is generic only over erased references. Users should be able to reason about the properties of Object, which means reasoning about the union of references and values. Otherwise, we may gain performance, but we’ve turned Java into C++ (or worse), and one of the core values of the platform will be gone.
Balancing these things is a very tricky balance, and I think we’re still spiraling into the right balance. Q World was way too far off in one direction; it gave the experts what they needed but at the cost of making everyone’s language far more complex and hard to code in, and creating intractable migration problems. I think L World is much closer to where we want to be, but I think we’re still a little too much focused on bottom-up decision making, and we need to temper that with some top-down “what language do we get, and is it the one we want” thinking. I am optimistic, but I’m not declaring victory yet.
> One of the other driving factors has been the desire to have valuetypes work with existing collections classes. And a further goal of enabling generic specialization to allow those collections to get the benefits of the flattened data representations (ie: backed by flattened data arrays).
Yes. I think this is “table stakes” for this exercise. Not being able to use HashMap with values, except via boxing, would be terrible; not being able to generify over all the types would be equally terrible. And one of the biggest assets of the Java ecosystem is the rich set of libraries; having to throw them all out and rewrite them (and deal with the migration mess from OldList to NewList) could well be the death sentence.
We don’t have to get there all at once; the intermediate target (L10) is “erased generics over values’, which gives us reuse and reasonable calling conventions but not yet flattening. But that has to lead to a sane generics model where values are first-class type arguments, with flattening all the way down.
> The other goal we discussed in Burlington was that pre-value code should be minimally penalized when values are introduced, especially for code that isn't using them. Otherwise, it will be a hard sell for users to take a new JDK release that regresses their existing code.
Yes, I think the question here is “what is minimal.” And the answer is going to be hard to quantify, because there are slippery slopes and sharp cliffs everywhere. If we have some old dusty code and just run unchanged on a future JVM, there probably won’t be many value types flying around, so speculation might get us 99% of the way there. But once you start mixing that old legacy code with some new code that uses values, it might be different.
Also, bear in mind that values might provide performance benefits to non-value-using code. For example, say we rewrite HashMap using values as entries. That makes for fewer indirections in everyone’s code, even if they never see a value in the wild. Do we count that when we are counting the “value penalty” for legacy code?
So, we have to balance the cost to existing code (that never asked for values) with the benefits to future code that can do amazing new things with values.
> Does that accurate sum up the goals we've been aiming for?
With some caveats, its a good starting point :)
>
> A sensible rationalization of the object model for L-World would be to
> have special subclasses of `Object` for references and values:
>
> ```
> class Object { ... }
> class RefObject extends Object { ... }
> class ValObject extends Object { ... }
> ```
>
> Would the intention here be to retcon existing Object subclasses to instead subclass RefObject? While this is arguable the type hierarchy we'd have if creating Java today, it will require additional speculation from the JIT on all Object references in the bytecode to bias the code one way or the other. Some extra checks plus a potential performance cliff if the speculation is wrong and a single valuetype hits a previous RefObject only calcite.
That was what I was tossing out, yes. This is one of those nice-to-haves that we might ultimately compromise on because of costs, but we should be aware what the costs are. It has some obvious benefits (clear statement of reality, brings value-ness into the type system.) And the fact that value-ness wasn’t reflected in the type system in Q world was a real problem; it meant we had modifiers on code and type variables like “val T” that might have been decent prototyping moves, but were not the language we wanted to work with.
That said, if the costs are too high, we can revisit.
> ```
> interface Nullable { }
> ```
>
> which is implemented by `RefObject`, and, if we support value classes
> being declared as nullable, would be implemented by those value
> classes as well. Again, this allows us to use `Nullable` as a
> parameter type or field type, or as a type bound (`<T extends
> Nullable>`).
> I'm still unclear on the nullability story.
Me too :) Some recent discussions have brought us to a refined view of this problem, which is: what’s missing from the object model right now is not necessarily nullable values (we already have these with L-types!), but classes which require initialization through their constructor in order to be valid. This is more about “initialization safety” than nullability. Stay tuned for some fresh ideas here.
>
>
> #### Equality
>
> The biggest and most important challenge is assigning sensible total
> semantics to equality on `Object`; the LW1 equality semantics are
> sound, but not intuitive. There's no way we can explain why for
> values, you don't get `v == v` in a way that people will say "oh, that
> makes sense." If everything is an object, `==` should be a reasonable
> equality relation on objects. This leads us to a somewhat painful
> shift in the semantics of equality, but once we accept that pain, I
> think things look a lot better.
>
> Users will expect (100% reasonably) the following to work:
>
> ```
> Point p1, p2;
>
> p1 == p1 // true
>
> p2 = p1
> p1 == p2 // true
>
> Object o1 = p1, o2 = p2;
>
> o1 == o1 // true
> o1 == o2 // true
> ```
> We ran into this problem with PackedObjects which allowed creating multiple "detached" object headers that could refer to the same data. While early users found this painful, it was usually a sign they had deeper problems in their code & understanding. One of the difficulties was that depending on how the PackedObjects code was written, == might be true in some cases. We found a consistent answer was better - and helped to define the user model.
I am deeply concerned that this is wishful thinking based on performance concerns — and validated with a non-representative audience. I’d guess that most of the Packed users were experts who were reaching for packed objects because they had serious performance problems to solve. (What works in a pilot school for gifted students with hand-picked teachers, doesn’t always scale up to LA County Unified.)
I think that we muck with the intuitivess of `==` at our peril. Of all the concerns i have about totality, equality is bigger that all the rest put together.
> In terms of values, is this really the model we want? Users are already used to needing to call .equals() on equivalent objects. By choosing the answer carefully here, we help to guide the right user mental model for some of the other proposals - locking being a key one.
I think this is probably wishful thinking too. A primary use case for values is numerics. Are we going to tell people they can’t compare numerics with ==? And if we base `==` on the static type, then we’ll get different semantics when you convert to Object. But conversion to Object is not a boxing conversion — it’s a widening conversion. I’m really worried about this.
>
> While the conceptual model may be clean, it's also, as you point out, horrifying. Trees and linked structures of values become very very expensive to acmp in ways users wouldn't expect.
I’m not sure about the “expect” part. We’re telling people that values are “just” their state (even if that state is rich.) Wouldn’t you then expect equality to be based on state?
>
> If we do this, users will build the mental model that values are interned and that they are merely fetching the same instances from some pool of values. This kind of model will lead them down rabbit holes - and seems to give values an identity. We've all seen abuses of String.intern() - do we want values to be subject to that kind of code?
That’s not the mental model that comes to mind immediately for me, so let’s talk more about this.
>
> The costs here are likely quite large - all objects that might be values need to be checked, all interfaces that have ever had a value implement them, and of course, all value type fields plus whatever the Nullability model ends up being.
I would say that _in the worst case_ the costs could be large, but in the common cases (e.g., Point), the costs are quite manageable — the cost of a comparison is a bulk bit comparison. Thats more than a single word comparison, but it’s not so bad.
I get that this is where the cost is — I said up front, this is the pill to swallow. Let’s figure out what it really costs.
>
>
> #### Identity hash code
>
> Because values have no identity, in LW1 `System::identityHashCode`
> throws `UnsupportedOperationException`. However, this is
> unnecessarily harsh; for values, `identityHashCode` could simply
> return `hashCode`. This would enable classes like `IdentityHashMap`
> (used by serialization frameworks) to accept values without
> modification, with reasonable semantics -- two objects would be deemed
> the same if they are `==`. (For serialization, this means that equal
> values would be interned in the stream, which is probably what is
> wanted.)
>
> By return `hashCode`, do you mean call a user defined hashCode function? Would the VM enforce that all values must implement `hashCode()`? Is the intention they are stored (growing the size of the flattened values) or would calling the hashcode() method each time be sufficient?
I would prefer to call the "built-in” value hashCode — the one that is deterministically derived from state. That way, we preserve the invariant that == values have equal identity hash codes.
>
> The only consistent answer here is to throw on lock operations for values. Anything else hides incorrect code, makes it harder for users to debug issues, and leaves a mess for the VM. As values are immutable, the lock isn't protecting anything. Code locking on unknown objects is fundamentally broken - any semantics we give it comes at a cost and doesn't actually serve users.
I don’t disagree. The question is, what are we going to do when Web{Logic,Sphere} turns out to be locking on user objects, and some user passes in a value? Are we going to tell them “go back to Java 8 if you don’t like it”? (Serious question.) If so, then great, sign me up!
>
To be continue….
More information about the valhalla-spec-observers
mailing list