experiments with fused strings
John Rose
john.r.rose at oracle.com
Mon Jun 1 21:31:09 UTC 2015
This dialog is cut-n-pasted from the Valhalla chat room. I'd like to continue it via Email.
[5/22/15, 4:54:22 PM] John Rose:
Here’s an experiment I’d like to try some day; maybe soon. It has three forks: 1. create a hybrid instance/array object with a variable tail of raw primitive data, 2. adapt var-handles to access data in the tail (the array part), 3. make the var-handles (temporary access capabilities) optimize away using a combination of EA and Heisenbox logic. Then, 4. shake until optimized. The end result will be a credible upgrade for our current string representation (fusing the header and data array).
[5/22/15, 4:55:50 PM] Stuart W. Marks:
Kind of like the old char[1]-at-the-end-of-the-struct trick.
John:
Yes, exactly so. Note that this is orthogonal to the char/byte union. We want both tricks.
I’m thinking that Paul S. has most of the parts for 2. and Roland can make progress on 3/4. Main sticking point is a prototype for 1. My inclination is to start with Stefan K.’s cleaned up mapping logic.
It would be a small-ish demo use case showing the interplay of flattened objects, array-like access methods, and value-like types. (The var-handles have to at least compile to identity-free values, or we cannot replace strings with the fancy logic.)
[5/26/15, 3:30:29 AM] Paul Sandoz:
John, so for say a String instance the character array length and elements will be aligned (perhaps with suitable padding) after that of the hash code field?
Paul:
Oh, and about hashCode, it sucks that the algorithm is explicitly specified, i don't think it's possible to vectorize by viewing as long[]
Re, VarHandles, i suspect it comes down to what object would be used via Unsafe to access the array contents. If it's the hybrid instance, at a particular offset from that (based on the hybrid class layout), then i believe it should be easy to create a VarHandle from the hybrid type (e.g. Unsafe.getEmbeddedArrayOffset(Class<? extends …> ) and stash it as a final static field.
John:
Yes: struct Instance_of_String { headerBits header; jint hash_field; long value_bits[]; }
Vectorizing the hash could be done, awkwardly, as a packed 4x16-bit multiply times 31 exp (iota 4).
In places (like JVM internals) where we have freedom to discard that thing, we should use FarmHash.
Yes, that's about how the VarHandle play would work. For String per se we could use Unsafe directly, but VHs are nicer. Also, the creation of the hybrids can be done via Unsafe and wrapped in a method handle, as if by Lookup.findConstructor. Eventually our bytecodes should learn to do the <new> dance instead of the <init> dance, and then we can push the MH trick back down to the bytecode level. But we don't have to depend on it for prototyping.
Stuart:
Where would the length be stored?
I'd prefer to smash it into the value_bits, in a variable-length form that favors short strings.
string_value := short_bytes | long_bytes | long_chars; short_bytes := length(1..255) byte[length]; long_bytes := length(int32) | byte[length] | long_chars; long_chars := negative_length(int32) | char[length]
(Some extra fiddling required to disambiguate the short guy from the long ones; it's easiest if the long cases can be expressed in 24 bits so the first byte can be zero.)
Stuart:
I think we need a zero-length string.
[5/26/15, 1:35:07 PM] Brian Goetz:
How about a negative length string, which, when concatenated with another string, lops characters off the end?
Stuart:
Is that like an anti-string?
John:
The empty string does not need to be covered by the short case; it can be the shortest long case and still hide in one cache line. (Size distributions of such things always tend to be singular at the zero point.)
Stuart:
Oh! So zero-length is allowed, it's just that it's the "long" form. OK.
John:
Yes. And you only need a small short form, since once you get past a few cache lines, you can afford to drop in a whole 32-bit size field. (Any log-sized length thingy is only a real overhead when its value is very close to zero.)
Open question whether to put in a short form for UTF16 also. I would make the decision by examining code paths for decoding the cases.
Stuart:
Have you been talking to Sherman and Brent?
John:
Not recently. I know about their work. (Actually, Sherman's part of it.) It's moving usefully in this direction, to the point of having two formats for the string body. I've mainly talked to Vladimir about his part in it.
Stuart:
I know they're working on multiple (well, two) representations, but not much more than that.
I'd be uncomfortable tying the object structure/layout to the actual string encoding though.
The current model for storing strings in bytes is ASCII or maybe ISO 8859-1, but I keep hoping for UTF-8.
On another note, I guess you've talked to the GC guys about the additional complexity determining the object's size, based on information not in the object header. Or is there additional stuff in the header?
John:
Good point; we want the object structure to be just a bag of bits (probably vectorized in longs). The interpretation of that structure would not be wired into the library but rather defined by Java library code. The GC will not know about this code. There will have to be a lower-level length indicator for the GC, probably just an array length field like today. (It might contribute to the library level sizing logic, or not.)
I have been talking to the GC folks about this sort of thing for a long time. Stefan K's refactoring of GC object mapping code may help clear the ground for it.
Stuart:
(Sorry, I keep coming up with issues.) There is also the object identity issue that might arise when switching encodings.
If a string is re-encoded its size might change; that implies a different object, right? Unfortunately == will need to stay the same.
I happen to know that the permissions libraries depend on == on interned Strings, for instance....
OTOH maybe encodings don't change on the fly, and a String has whatever encoding it has at creation time. I'm not sure of the current state of the design.
Paul:
John, it sounds like we need a few fundamental things to get this working:
- Methods on unsafe to 1) instantiate an object with a certain sized tail of bytes (length packing rules could apply) 2) obtain the offset to the tail; and 3) obtain the tail length, which should be treated as an unsigned value regarding bounds checks (perhaps via the Array check range methods Roland is working on to intrinsify)
- GC to recognise this structure
Once that is in place we can build on top, like you say with MHs and VHs, repeat and rinse until the generated code is good. (I dunno if there are issues for class redefinition and debugging, probably…)
Paul: Re: hash code, yes for chars i think it might be possible since they are unsigned IIRC h = +[31^3, 31^2, 31^1, 31^0] * [a[0], a[1], a[2], a[3]] + 31^4 * h. I was struggling in Java code with byte[], the signed nature really seems to mess things up. Perhaps it might be easier in machine code but i could not find a way in Java code. Also, perhaps we should include new hash code methods in Arrays?
John, what happens if one invokes a MemberName with fewer arguments than it's arity?
For VarHandles it's very convenient to link to a single method that accepts no arguments and throws an exception for cases where an operation is unsupported (e.g. update operations for final fields or numeric operations for refs).
It works (with fast debug and -XX:+VerifyMethodHandles, plus i cannot find any assertions/checks on the arity) but i don't trust it, as i suspect it could leave the stack in an undefined state.
Here is a simple Scala example:
class A {
object AO {
val x: Int = Test.getY()
}
def getAOX(): Int = {
AO.x
}
}
object Test {
var y: Int = 0;
def getY(): Int = {
y = y + 1
y
}
def main(args: Array[String]) {
print(new A().getAOX())
print(new A().getAOX())
print(new A().getAOX())
}
}
[I restarted Skype and it has lost all messages from this hallway from the 1st of May up to today :-( ]
John:
I sent [via email] a super-quick sketch about static bundles.
Paul, I'll convert our previous exchange to Email.
More information about the valhalla-dev
mailing list