Compact Strings and APIs for fast decoding of string data

Tue Feb 9 15:20:27 UTC 2016

Hi,

Aleksey Shipilev did a talk on his journey to implement compact strings and indified string concat at the JVM Tech Summit yesterday, and this reminded me that we (Neo4j) have a need for turning segments of DirectByteBuffers into Strings as fast as possible. If we already store the string data in Latin1, which is one of the two special encodings for compact strings, we’d ideally like to produce the String object with just the two necessary object allocations and a single, vectorised memory copy.

Our use case is that we are a database and we do our own file paging, effectively having file data in a large set of DirectByteBuffers. We have string data in our files in a number of different encodings, a popular one being Latin1. Occasionally these String values span multiple buffers. We often need to expose this data as String objects, in which case decoding the bytes and turning them into a String is often very performance sensitive - to the point of being one of our top bottlenecks for the given queries. Part of the story is that in the case of Latin1, I’ll know up front exactly how many bytes my string data takes up, though I might not know how many buffers are going to be involved.

As far as I can tell, this is currently not possible using public APIs. Using private APIs it may be possible, but will be relying on the JIT for vectorising the memory copying.

From an API standpoint, CharsetDecoder is close to home, but is not quite there. It’s stateful and not thread-safe, so I either have to allocate new ones every time or cache them in thread-locals. I’m also required to allocate the receiving CharBuffer. Since I may need to decode from multiple buffers, I realise that I might not be able to get away from allocating at least one extra object to keep track of intermediate decoding state. The CharsetDecoder does not have a method where I can specify the offset and length for the desired part of the ByteBuffer I want to decode, which forces be to allocate views instead.

The CharBuffers are allocated with a length up front, which is nice, but I can’t restrict its encoding so it has to allocate a char array instead of the byte array that I really want. Even if it did allocate a byte array, the CharBuffer is mutable, which would force String do a defensive copy anyway.

One way I imagine this could be solved would be with a less dynamic kind of decoder, where the target length is given upfront to the decoder. Buffers are then consumed one by one, and a terminal method performs finishing sanity checks (did we get all the bytes we were promised?) and returns the result.

StringDecoder decoder = Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, len2).done();

This will in principle allow the string decoding to be 2 small allocations, an array allocation without zeroing, and a sequence of potentially vectorised memcpys. I don’t see any potentially troubling interactions with fused Strings either, since all the knowledge (except for the string data itself) needed to allocate the String objects are available from the get-go.

What do you guys think?

Btw, Richard Warburton has already done some work in this area, and made a patch that adds a constructor to String that takes a buffer, offset, length, and charset. This work now at least needs rebasing: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
It doesn’t solve the case where multiple buffers are used to build the string, but does remove the need for a separate intermediate state-holding object when a single buffer is enough. It’d be a nice addition if possible, but I (for one) can tolerate a small object allocation otherwise.

Cheers,
Chris