The store for byte strings

Sun Jun 10 01:48:59 UTC 2018

On Jun 9, 2018, at 12:18 PM, Xueming Shen <xueming.shen at oracle.com> wrote:
> 
> Ideally I would assume we would want to have a utf-8 internal storage for
> String, even in theory utf8 is supposed to be used externally and utf16
> to be the internal one.

Separately from my point about ByteSequence, I agree that "doubling down"
on Utf8 as a standard format for packed strings is a good idea.  A reasonable
way to prototype right now would be an implementation of CharSequence
that is backed by a byte[] (eventually ByteSequence) and has some sort of
fast access (probably streaming) to Utf16 code points.  To make it pay for
itself the Utf8 encoding should be applicable as an overlay in as many places
as possible, including slices of byte[] and ByteBuffer objects, and later
ByteSequences.

> Defensive copy when getting byte[] in & out of String object seems still
> inevitable for now, before we can have something like "read-only" byte[],
> given the nature of its immutability commitment.

We didn't need frozen char[] arrays to avoid defensive copying of String
objects, only an immutability invariant on the class.  We could pull a similar
trick with Utf8 by supplying a ByteSequence view of a String's underlying
bytes.  If the String has underlying chars (Utf16) a view is also possible,
although it is more difficult to get right (as you described).

— John