The store for byte strings
Lately I've been thinking about string representation. The world turned out not to be UCS-2 or UTF-16, after all, and we often have to deal with strings generally encoded as ASCII or UTF-8, but we aren't always encoded this way (and there might not even be a charset declaration, see the ELF spec). (a) byte[] with defensive copies. Internal storage is byte[], copy is made before returning it to the caller. Quite common across the JDK. (b) byte[] without defensive copies. Internal storage is byte[], and a reference is returned. In the past, this could be a security bug, and usually, it was adjusted to (a) when noticed. Without security requirements, this can be quite efficient, but there is ample potential for API misuse. (c) java.lang.String with ISO-8859-1 decoding/encoding. Sometimes done by reconfiguring the entire JVM to run with ISO-8859-1, usually so that it is possible to process malformed UTF-8. The advantage is that there is rich API support, including regular expressions, and good optimization. There is also language support for string literals. (d) java.lang.String with UTF-8 decoding/encoding and replacement. This seems to be very common, but is not completely accurate and can lead to subtle bugs (or completely non-processible data). Otherwise has the same advantages as (c). (e) Various variants of ByteBuffer. Have not seen this much in practice (outside binary file format parsers). In the past, it needed deep defensive copies on input for security (because there isn't an immutably backed ByteBuffer), and shallow copies for access. The ByteBuffer objects themselves are also quite heavy when they can't be optimized away. For that reason, probably most useful on interfaces, and not for storage. (f) Custom, immutable ByteString class. Quite common, but has cross-library interoperability issues, and a full complement of support (matching java.lang.String) is quite hard. (g) Something based on VarHandle. Haven't seen this yet. Probably not useful for storage. Anything that I have missed? Considering these choices, what is the expected direction on the JDK side for new code? Option (d) for things generally ASCII/UTF-8, and (b) for things of a more binary nature? What to do if the choice is difficult?
On 6/9/18, 3:27 AM, Florian Weimer wrote:
Lately I've been thinking about string representation. The world turned out not to be UCS-2 or UTF-16, after all, and we often have to deal with strings generally encoded as ASCII or UTF-8, but we aren't always encoded this way (and there might not even be a charset declaration, see the ELF spec).
(a) byte[] with defensive copies. Internal storage is byte[], copy is made before returning it to the caller. Quite common across the JDK.
(b) byte[] without defensive copies. Internal storage is byte[], and a reference is returned. In the past, this could be a security bug, and usually, it was adjusted to (a) when noticed. Without security requirements, this can be quite efficient, but there is ample potential for API misuse.
(c) java.lang.String with ISO-8859-1 decoding/encoding. Sometimes done by reconfiguring the entire JVM to run with ISO-8859-1, usually so that it is possible to process malformed UTF-8. The advantage is that there is rich API support, including regular expressions, and good optimization. There is also language support for string literals.
(d) java.lang.String with UTF-8 decoding/encoding and replacement. This seems to be very common, but is not completely accurate and can lead to subtle bugs (or completely non-processible data). Otherwise has the same advantages as (c).
(e) Various variants of ByteBuffer. Have not seen this much in practice (outside binary file format parsers). In the past, it needed deep defensive copies on input for security (because there isn't an immutably backed ByteBuffer), and shallow copies for access. The ByteBuffer objects themselves are also quite heavy when they can't be optimized away. For that reason, probably most useful on interfaces, and not for storage.
(f) Custom, immutable ByteString class. Quite common, but has cross-library interoperability issues, and a full complement of support (matching java.lang.String) is quite hard.
(g) Something based on VarHandle. Haven't seen this yet. Probably not useful for storage.
Anything that I have missed?
Considering these choices, what is the expected direction on the JDK side for new code? Option (d) for things generally ASCII/UTF-8, and (b) for things of a more binary nature? What to do if the choice is difficult?
Hi Florian, Some comments about the j.l.String storage. Ideally I would assume we would want to have a utf-8 internal storage for String, even in theory utf8 is supposed to be used externally and utf16 to be the internal one. I did have a byte[]/utf-8 prototype implementation when we did the compact string for jdk9 but that was finally dropped because of the potential performance regression for index base access, such as the basic String.charAt(int), as you have to count from the beginning to locate the target character each every time. But I think we might want to try it again later, especially for use scenario that index base access performance is not that important/critical and the throughput operation of the String, means input from /output to the external utf-8/byte[] world, is more desired. Given we are heading utf-8 as the default encoding for jvm [1], I think we might want to at least provide some alternative that you can "optionally" do that for String object. The idea might go further (wild, just an idea, not necessary something thing we really want to do :-) for Java String) to other charsets, so you can simply store the byte[] (verified no malformed/unmappable) + charsetId directly when creating a String object. This might be useful and efficient in use scenario that the String object is simply a vehicle to carry a sequence of characters back and forth between a front end server and back end server, the jvm is simply passing them around/through. Defensive copy when getting byte[] in & out of String object seems still inevitable for now, before we can have something like "read-only" byte[], given the nature of its immutability commitment. Regards, Sherman [1] https://bugs.openjdk.java.net/browse/JDK-8187041
On Jun 9, 2018, at 12:18 PM, Xueming Shen <xueming.shen@oracle.com> wrote:
Ideally I would assume we would want to have a utf-8 internal storage for String, even in theory utf8 is supposed to be used externally and utf16 to be the internal one.
Separately from my point about ByteSequence, I agree that "doubling down" on Utf8 as a standard format for packed strings is a good idea. A reasonable way to prototype right now would be an implementation of CharSequence that is backed by a byte[] (eventually ByteSequence) and has some sort of fast access (probably streaming) to Utf16 code points. To make it pay for itself the Utf8 encoding should be applicable as an overlay in as many places as possible, including slices of byte[] and ByteBuffer objects, and later ByteSequences.
Defensive copy when getting byte[] in & out of String object seems still inevitable for now, before we can have something like "read-only" byte[], given the nature of its immutability commitment.
We didn't need frozen char[] arrays to avoid defensive copying of String objects, only an immutability invariant on the class. We could pull a similar trick with Utf8 by supplying a ByteSequence view of a String's underlying bytes. If the String has underlying chars (Utf16) a view is also possible, although it is more difficult to get right (as you described). — John
I'm glad to see you are thinking about this, Florian. You appear to be aiming at a way to compactly store and manipulate series of octets (in an arbitrary encoding) with an emphasis on using those octets to represent strings, in the usual sense of character sequences. Would you agree that this design problem factors well into a generic problem of storing and manipulating octet sequences, plus a detachable upper layer that allows strings (in various encodings) to be extracted from those sequences? I think the sweet spot here is to introduce a "stringy but char-free" API which commits to dealing with chunks of memory (viewed as octet sequences), regardless of how those chunks will be interpreted. In https://bugs.openjdk.java.net/browse/JDK-8161256 I discuss this nascent API under the name "ByteSequence", which is analogous to CharSequence, but doesn't mention the types 'char' or 'String'. By "stringy" I mean that there are natural ways to index or partition an existing sequence of octets, or concatenate multiple sequences into one. I also mean that immutability plays a strong role, enabling algorithms to work without defensive copies. Making it an interface like CharSequence means we can use backing stores like ByteBuffer or byte[], or more exotic things like Panama native memory, interoperably. Here are some uses for an immutable octet sequence type: - manipulation of non-UTF16 character data (which you mention) - zero copy views (slices, modifiable or not) into existing types (ByteBuffer, byte[], etc.) - zero copy views into file images (N.B. requires a 'long' size property, not 'int') - zero copy views to intra-classfile resources (CONSTANT_Bytes) - backing stores for Panama data structures and smart pointers - copy-reduced scatter and gather nodes associated with packet processing - octet-level cursors for parsers, scanners, packet decoders, etc. If the ByteSequence views are value instances, they can be created at a very high rate with little or no GC impact. Generic algorithms would still operate on them A mutable octet sequence class, analogous to StringBuilder, would allow immutable sequences to be built with fewer intermediate copies, just like with StringBuilder. If the API is properly defined it can be inserted directly into existing types like ByteBuffer. Doing this will probably require us to polish ByteBuffer a little, adding immutability as an option and lifting the 32-bit limits. It should be possible to "freeze" a ByteBuffer or array and use it as a backing store that is reliably immutable, so it can be handed to zero-copy algorithms that work with ByteSequences. For some of this see https://bugs.openjdk.java.net/browse/JDK-8180628 . Independently, I want to eventually add frozen arrays, including frozen byte[] arrays, to the JVM, but that doesn't cover zero-copy use cases; it has to be an interface like CharSequence. So the option I prefer is not on your list; it would be: (h) ByteSequence interface with retrofits to ByteBuffer, byte[], etc. This is more flexible than (f) the concrete ByteString class. I think the ByteString you are thinking of would appear as a non-public class created by a ByteSequence factory, analogous to List::of. — John On Jun 9, 2018, at 3:27 AM, Florian Weimer <fw@deneb.enyo.de> wrote:
Lately I've been thinking about string representation. The world turned out not to be UCS-2 or UTF-16, after all, and we often have to deal with strings generally encoded as ASCII or UTF-8, but we aren't always encoded this way (and there might not even be a charset declaration, see the ELF spec).
(a) byte[] with defensive copies. Internal storage is byte[], copy is made before returning it to the caller. Quite common across the JDK.
(b) byte[] without defensive copies. Internal storage is byte[], and a reference is returned. In the past, this could be a security bug, and usually, it was adjusted to (a) when noticed. Without security requirements, this can be quite efficient, but there is ample potential for API misuse.
(c) java.lang.String with ISO-8859-1 decoding/encoding. Sometimes done by reconfiguring the entire JVM to run with ISO-8859-1, usually so that it is possible to process malformed UTF-8. The advantage is that there is rich API support, including regular expressions, and good optimization. There is also language support for string literals.
(d) java.lang.String with UTF-8 decoding/encoding and replacement. This seems to be very common, but is not completely accurate and can lead to subtle bugs (or completely non-processible data). Otherwise has the same advantages as (c).
(e) Various variants of ByteBuffer. Have not seen this much in practice (outside binary file format parsers). In the past, it needed deep defensive copies on input for security (because there isn't an immutably backed ByteBuffer), and shallow copies for access. The ByteBuffer objects themselves are also quite heavy when they can't be optimized away. For that reason, probably most useful on interfaces, and not for storage.
(f) Custom, immutable ByteString class. Quite common, but has cross-library interoperability issues, and a full complement of support (matching java.lang.String) is quite hard.
(g) Something based on VarHandle. Haven't seen this yet. Probably not useful for storage.
Anything that I have missed?
Considering these choices, what is the expected direction on the JDK side for new code? Option (d) for things generally ASCII/UTF-8, and (b) for things of a more binary nature? What to do if the choice is difficult?
* John Rose:
In https://bugs.openjdk.java.net/browse/JDK-8161256 I discuss this nascent API under the name "ByteSequence", which is analogous to CharSequence, but doesn't mention the types 'char' or 'String'.
Very interesting. What's the specification for toString() and hashCode()? One problem of retrofitting a custom ByteString into a CharSequence is that CharSequence reuses toString() in a fairly central fashion, and it's hard to reconcile this with byte-based length() and charAt() methods unless ISO-8859-1 encoding is used. If this feature is supposed to land before JEP 218? If not, how does ByteSequence differ from List<byte>?
If the ByteSequence views are value instances, they can be created at a very high rate with little or no GC impact. Generic algorithms would still operate on them
I'm not up-to-date with those upcoming changes. Would the nature as value instances be expressed as part of the ByteSequence interface type?
If the API is properly defined it can be inserted directly into existing types like ByteBuffer. Doing this will probably require us to polish ByteBuffer a little, adding immutability as an option and lifting the 32-bit limits. It should be possible to "freeze" a ByteBuffer or array and use it as a backing store that is reliably immutable, so it can be handed to zero-copy algorithms that work with ByteSequences.
Such freezing is incompatible with mapped byte buffers, right? Even if the implementation prevents changes at the VM/process level, changes on the file system could well become visible. Do you expect to make freezing an optional operation (probably not a good idea), or copy the contents of the mapping to the heap (which is probably not too bad, considering that a shared byte[] could also result in arbitrarily large copies).
Independently, I want to eventually add frozen arrays, including frozen byte[] arrays, to the JVM, but that doesn't cover zero-copy use cases; it has to be an interface like CharSequence.
Well, there is already the VarHandle approach for that. But it's not a particularly rich interface and very far away from strings.
So the option I prefer is not on your list; it would be:
(h) ByteSequence interface with retrofits to ByteBuffer, byte[], etc.
This is more flexible than (f) the concrete ByteString class. I think the ByteString you are thinking of would appear as a non-public class created by a ByteSequence factory, analogous to List::of.
Yes, this seems reasonable. It's a bit of a drawback that the immutable nature of a value cannot be expressed in the type system (so that you have to remember to use ByteSequence::of to get a view to an immutable object), but at least it's consistent with collections.
participants (3)
-
Florian Weimer
-
John Rose
-
Xueming Shen