The store for byte strings

Sat Jun 9 19:18:05 UTC 2018

On 6/9/18, 3:27 AM, Florian Weimer wrote:
> Lately I've been thinking about string representation.  The world
> turned out not to be UCS-2 or UTF-16, after all, and we often have to
> deal with strings generally encoded as ASCII or UTF-8, but we aren't
> always encoded this way (and there might not even be a charset
> declaration, see the ELF spec).
>
> (a) byte[] with defensive copies.
>      Internal storage is byte[], copy is made before returning it to
>      the caller.  Quite common across the JDK.
>
> (b) byte[] without defensive copies.
>      Internal storage is byte[], and a reference is returned.  In the
>      past, this could be a security bug, and usually, it was adjusted
>      to (a) when noticed.  Without security requirements, this can be
>      quite efficient, but there is ample potential for API misuse.
>
> (c) java.lang.String with ISO-8859-1 decoding/encoding.
>      Sometimes done by reconfiguring the entire JVM to run with
>      ISO-8859-1, usually so that it is possible to process malformed
>      UTF-8.  The advantage is that there is rich API support, including
>      regular expressions, and good optimization.  There is also
>      language support for string literals.
>
> (d) java.lang.String with UTF-8 decoding/encoding and replacement.
>      This seems to be very common, but is not completely accurate
>      and can lead to subtle bugs (or completely non-processible
>      data).  Otherwise has the same advantages as (c).
>
> (e) Various variants of ByteBuffer.
>      Have not seen this much in practice (outside binary file format
>      parsers).  In the past, it needed deep defensive copies on input
>      for security (because there isn't an immutably backed ByteBuffer),
>      and shallow copies for access.  The ByteBuffer objects themselves
>      are also quite heavy when they can't be optimized away.  For that
>      reason, probably most useful on interfaces, and not for storage.
>
> (f) Custom, immutable ByteString class.
>      Quite common, but has cross-library interoperability issues,
>      and a full complement of support (matching java.lang.String)
>      is quite hard.
>
> (g) Something based on VarHandle.
>      Haven't seen this yet.  Probably not useful for storage.
>
> Anything that I have missed?
>
> Considering these choices, what is the expected direction on the JDK
> side for new code?  Option (d) for things generally ASCII/UTF-8, and
> (b) for things of a more binary nature?  What to do if the choice is
> difficult?

Hi Florian,

Some comments about the j.l.String storage.

Ideally I would assume we would want to have a utf-8 internal storage for
String, even in theory utf8 is supposed to be used externally and utf16
to be the internal one. I did have a byte[]/utf-8 prototype implementation
when we did the compact string for jdk9 but that was finally dropped because
of the potential performance regression for index base access, such as the
basic String.charAt(int), as you have to count from the beginning to locate
the target character each every time. But I think we might want to try 
it again
later, especially for use scenario that index base access performance is not
that important/critical and the throughput operation of the String, means
input from /output to the external utf-8/byte[] world, is more desired. 
Given
we are heading utf-8 as the default encoding for jvm [1], I think we might
want to at least provide some alternative that you can "optionally" do that
for String object. The idea might go further (wild, just an idea, not 
necessary
something thing we really want to do :-) for Java String)  to other 
charsets,
so you can simply store the byte[] (verified no malformed/unmappable) +
charsetId directly when creating a String object. This might be useful and
efficient in use scenario that the String object is simply a vehicle to 
carry a
sequence of characters back and forth between a front end server and back
end server, the jvm is simply passing them around/through.

Defensive copy when getting byte[] in & out of String object seems still
inevitable for now, before we can have something like "read-only" byte[],
given the nature of its immutability commitment.

Regards,
Sherman

[1] https://bugs.openjdk.java.net/browse/JDK-8187041