JEP 254: Compact Strings - length limits

John Rose john.r.rose at oracle.com
Tue Sep 6 21:11:35 UTC 2016


On Sep 6, 2016, at 12:58 PM, Charles Oliver Nutter <headius at headius.com> wrote:
> 
> On Tue, Sep 6, 2016 at 1:04 PM, Xueming Shen <xueming.shen at oracle.com>
> wrote:
> 
>> Yes, it's a known "limit" given the nature of the approach. It is not
>> considered
>> to be an "incompatible change",  because the max length the String class
>> and
>> the corresponding buffer/builder classes can support is really an
>> implementation
>> details, not a spec requirement. The conclusion from the discussion back
>> then
>> was this is  something we can trade off for the benefits we gain from the
>> approach.
>> Do we have a real use case that impacted by this change?
>> 
> 
> Well, doesn't this mean that any code out there consuming String data
> that's longer than Integer.MAX_VALUE / 2 will suddenly start failing on
> OpenJDK 9?
> 
> Not that such a case is a particularly good pattern, but I'm sure there's
> code out there doing it. On JRuby we routinely get bug reports complaining
> that we can't support strings larger than 2GB (and we have used byte[] for
> strings since 2006).
> 
> - Charlie

The most basic scale requirement for strings is that they support class-file
constants, which top out at a UTF8-length of 2**16.  Lengths beyond that,
to fill up the 'int' return value of String::length, are less well specified.

FTR, we could have chosen char[], int[], or long[] (not byte[]) as the backing
store for string data.  With long[] we could have strings above 4G-chars.

But it would have come with a perf. tax, since the T[].length field would need
to be combined with an extra bit or two (from a flag byte) to complete the length.
That's 2-3 extra instructions for loading a string length, or else a redundant
length field.  So it's a trade-off.

Likewise, choosing a third format deepens branch depth in order to get to payload.

Likewise, making the second format (of two) have a length field embedded in the
payload section requires a conditional load or branch, in order to load the string
length.  Again, more instructions.

The team has looked at 20 possibilities like these.  The current design is fastest.
I hope it flies.

— John


More information about the core-libs-dev mailing list