hg: mlvm/mlvm/hotspot: value-obj: first cut

Fri Oct 19 17:28:25 PDT 2012

On 2012-10-18, at 6:36 PM, Remi Forax <forax at univ-mlv.fr> wrote:
>> Offhand, I don't know of any library code that manipulates Integers as Integers and makes any sort of promises about their reference identity, except for the nonsense about small-value interns (which we probably replicate, because it is easy to do so).  ArrayLists and HashMaps store them in Object-typed containers, so they'll retain their reference identity there. (But there's a lot of library code, and I don't know it now as well as I used to).
> 
> IdentityHashMap<Integer, X>,
> https://duckduckgo.com/?q=IdentityHashMap%3CInteger

But it was my understanding that as yet, and for quite some time, generics are NOT reified, and that even when they are available, we're going to be careful at first about how we use them.

So at the bytecode level, that's all Objects, right?  I'm looking at this all from the point-of-view of bytecodes.
For the case of IdentityHashMap<Integer>, from old or new code, the underlying IdentityHashMap is really handling Objects, and so even though it is recompiled into "new" code, because the static type is Object at the bytecode level, the data will flow as refs, and not as Integers, and the identity will not be lost.

Am I relying on an unshared assumption?
I assume that value-typed values will be boxed whenever they are passed into a place where their static type is a reference type.
This means that many of the hoped-for benefits of value types won't arrive until generics are reified.

(I gather I need to look at Jim Laskey's tagged arrays.  From a superficial look, I'm not sure this is a huge problem; in any case, these are not yet a massive chunk of installed base.)

Maybe we need to think about both transitions (value types and reified generics) at the same time, I am not sure.
It looks to me like value safety has to come first and become widespread, but maybe I have that wrong.

> I disagree, at least for wrapper, if the code uses valueOf(), it means 
> you don't care about the identity.

Right, but I took care not to use valueOf in my example.  I took care to use methods that were declared to either return "the same" thing (toString) or a "new thing" (substring).  That code I wrote up there is designed to pass both assertions under the current ("old") semantics.

> Given that because of the overriding you can have a mix of old code and 
> new code in the very same method (with inlining),
> I don't think that the version of the code is something useful here. And 
> as I said earlier, it will not be backward compatible,
> i.e. old code compiled with the new version will behave differently.

I think, if we define value types properly at the language level, that old code that would be sensitive to value-type-identity will fail to recompile with a new javac, so it will "behave differently", but it won't be a silent change, nor will it be one that the VM has to worry about.  Maybe this is too big a leap for one revision, maybe we include some sort of a compatibility flag that allows the class to be tagged for old behavior.  If we're ever going get to well-behaved value types, we're going to need to ban some currently-legal idioms, so one day, there will be some "old source code" that fails to recompile.

I was under the impression that inlining already kept track of the source/flavor of code; for example, strictfp is handled properly in the face of inlining, so this is not new hair.

>> Ugh, a somewhat more annoying question, inspired by trying to find an example.
>> What happens if we have a "volatile Complex" (using either strategy)?
>> Assume, for the sake of amusement, that Complex is implemented with a pair of doubles, hence is 128 bits at minimum in its value representation.  (Possible implementation -- as a value if there's a native CAS of the right size, otherwise as a reference.)
> 
> good question, you can hope that the current CPU understands an 
> instruction like CMPXCHG16 or the JIT will not be able to unbox the 
> Complex value, I suppose.

I think it's worse than this (I've been thinking about this on and off yesterday/today).  Suppose you are CASing a type that happens to be a value type, and happens to unbox to more bits that are supported by your native atomic operations.  So it has to be boxed (and we'll do whatever it takes to make sure it gets boxed).  But, in the intervening code, which is "ordinary" Java that just happens to be bracketed by a LOAD-CAS, that value might get unboxed and reboxed, and that-would-be-bad (the CAS would fail, gratuitously, because the pointer would change).  Maybe we take the approach that in local variables, if a ref is the source for a value-typed value, then it is are represented as a ref+value pair (John pointed this out), and if one or the other happens to go dead, that's okay.

Doing this in general would change the wrapper strategy somewhat; rather than blindly calling the preferred-unboxed version in the new code, the ref would also (optionally, perhaps null if missing) be passed in and through.  This adds overhead in the general case; either we really do create two versions of a method (instead of just  a wrapping stub), or we impose the overhead of passing both bits and a perhaps-null ref through.  And obviously, if the ref is null (because the caller was "new" code, passing only a value) but is later used, then a new one needs to be allocated.  This is not a perfect solution to the need to maintain ref identity, but I think it works in practice.  It nicely deals with the case of String.toString() without appealing to a more specific hack, and I think it works for the instances of LOAD-CAS that I've seen.

>> Another screw case is array-of-Complex.  Do we have a single representation for such a type, or two?  If it is just one, then if it uses refs, then it loses performance in a big way, if it uses values, it loses object identity in a big way.
> 
> As you said earlier, you can switch from one representation to another 
> but I admit that this part is fuzzy for me.

I haven't figured a way out of it, and I am starting to wonder if we just need a completely separate kind of array.  If the standard array ever goes to a value representation, then poof, we lose all object identity, and it can't be restored.  Tagged instances don't seem to help here, either; if I store a million tagged-as-value instances into an array, and then store one tagged-as-ref instance into that area, if I have to convert the whole array to a ref format, it would be obnoxious.  In the short run, I think "arrays 1.0" stays at 1.0.

Not sure how we ever get to the new-array-replaces-old-array place; it seems like this is only possible after we've had value types in the language for a while.

Note that even down the road, even once all code is value safe, if arrays are all "2.0", we have to deal with issues like

  ValType[] vta = ...
  Object[] oa = (Object[]) vta;
  Object o = oa[0];
  Object p = oa[1];
  Pair<Object, Object> pair = new Pair<Object, Object>(o, p);

My assumption is that whether reified or erased, the two fields of "pair" are both statically java/lang/Object, and contain refs.
That means that the code to load o and p from oa has to check, before loading, that oa is an array of value type, and obtain boxed representations of the values stored at index 0 and 1.

>> It seems like this has to wait on either reified generics, or a different, new "array" type.
> 
> Note that you can already use the tagged array interface of Jim Laskey 
> for that.

I don't know the tagged array interface yet; seems like I have more reading to do.

>>>> - we can use different compilation strategies for code depending on its bytecode version number.
>>> No, you can't.
>> Is this a "no, derived from compatibility arguments not sufficiently explained here", or a "no, that is architecturally impossible given the current system"?
> 
> No becauseit will break backward compatibility exactly like new 
> ArrayList doesn't infer it's type even if java version is greater than 5.

I'm not following this; by different compilation strategies, I assume that "old" bytecodes treat value types as boxed, and "new" bytecodes treat them as unboxed.  The only time you notice a difference is when you mess with identity, and we've already declared that in new bytecodes those questions won't get asked, and in return, the representation will be unboxed.  But old bytecodes expect a certain identity semantics.  I'm not seeing how this breaks backwards compatibility, except at the interface between old and new, where recompilation under new semantics effectively declares that newly recompiled code might never return "the same" value object (in the style of String.toString).

And I think you want the behavior-in-old-code preserved regardless of whether the value types were allocated in "new" code or in "old" code, once the values leak out into old code.  That's why I'm not sold on the idea of tagging things with how they were allocated.  If we make old code behave the same, and can then start reasoning about changes at the old/new interface, I think the problem is more tractable.

A second issue is a variant on the "array" issue, which is shape-of-containers.  Suppose a class has a field that is a value type.  If we treat the field as merely a container for the value (and not a container for the bits), stores to the field will lose ref identity.  Here's a proposed set of decisions for how to represent value typed-fields:

(in all cases, "field" static type is a value type):

old class, instance field = ref

old class, static field = ref
new class, static field = ref (was thinking ref+value, but concurrency hair results)

new ref class, instance field = ref or value, not sure yet.
new value class, instance field = value

The reasoning here is that "ref" is preferred for compatibility with old code, so we only use value where it will really help performance.
I'm assuming that statics more often contain objects used as locks and are less often updated, so the overhead of boxing is not so bad, and the compatibility benefit is likely.
For old classes, we have to assume that ref identity will be observed, so value types are nonetheless stored in boxed form.

For new classes, I'm not sure.  We get better performance (relative to boxing/unboxing and locality) if we store values as values, but that loses reference identity.  Maybe we've done enough passing refs through old/new interfaces in a ref+value form.

David