Compact Strings and APIs for fast decoding of string data

Wed Feb 10 04:38:16 UTC 2016

Hi Chris,

I think basically you are asking a String constructor that takes a 
ByteBuffer. StringCoding
then can take advantage of the current CompactString design to optimize 
the decoding
operation by just a single byte[]/vectorized memory copy from the 
ByteBuffer to the String's
internal byte[], WHEN the charset is 8859-1.

String(ByteBuffer src, String charset);

Further we will need a "buffer gathering" style constructor

String(ByteBuffer[] srcs, String charset);
(or more generally, String(ByteBuffer[] srcs, int off, int len, String 
charset)

to create a String object from a sequence of ByteBuffers, if it's really 
desired.

And then I would also assume it will also be desired to extend the current
CharsetDecoder/Encoder class as well to add a pair of the "gathering" 
style coding
methods

CharBuffer CharsetDecoder.decode(ByteBuffer... ins);
ByteBuffer CharsetEncoder.encode(CharBuffer... ins);

Though the implementation might have to deal with the tricky "splitting
byte/char" issue, in which part of the "byte/char sequence" is in the 
previous
buffer and the continuing byte/chars are in the next following buffer ...

-Sherman

On 2/9/16 7:20 AM, Chris Vest wrote:
> Hi,
>
> Aleksey Shipilev did a talk on his journey to implement compact strings and indified string concat at the JVM Tech Summit yesterday, and this reminded me that we (Neo4j) have a need for turning segments of DirectByteBuffers into Strings as fast as possible. If we already store the string data in Latin1, which is one of the two special encodings for compact strings, we’d ideally like to produce the String object with just the two necessary object allocations and a single, vectorised memory copy.
>
> Our use case is that we are a database and we do our own file paging, effectively having file data in a large set of DirectByteBuffers. We have string data in our files in a number of different encodings, a popular one being Latin1. Occasionally these String values span multiple buffers. We often need to expose this data as String objects, in which case decoding the bytes and turning them into a String is often very performance sensitive - to the point of being one of our top bottlenecks for the given queries. Part of the story is that in the case of Latin1, I’ll know up front exactly how many bytes my string data takes up, though I might not know how many buffers are going to be involved.
>
> As far as I can tell, this is currently not possible using public APIs. Using private APIs it may be possible, but will be relying on the JIT for vectorising the memory copying.
>
>  From an API standpoint, CharsetDecoder is close to home, but is not quite there. It’s stateful and not thread-safe, so I either have to allocate new ones every time or cache them in thread-locals. I’m also required to allocate the receiving CharBuffer. Since I may need to decode from multiple buffers, I realise that I might not be able to get away from allocating at least one extra object to keep track of intermediate decoding state. The CharsetDecoder does not have a method where I can specify the offset and length for the desired part of the ByteBuffer I want to decode, which forces be to allocate views instead.
>
> The CharBuffers are allocated with a length up front, which is nice, but I can’t restrict its encoding so it has to allocate a char array instead of the byte array that I really want. Even if it did allocate a byte array, the CharBuffer is mutable, which would force String do a defensive copy anyway.
>
> One way I imagine this could be solved would be with a less dynamic kind of decoder, where the target length is given upfront to the decoder. Buffers are then consumed one by one, and a terminal method performs finishing sanity checks (did we get all the bytes we were promised?) and returns the result.
>
> StringDecoder decoder = Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, len2).done();
>
> This will in principle allow the string decoding to be 2 small allocations, an array allocation without zeroing, and a sequence of potentially vectorised memcpys. I don’t see any potentially troubling interactions with fused Strings either, since all the knowledge (except for the string data itself) needed to allocate the String objects are available from the get-go.
>
> What do you guys think?
>
> Btw, Richard Warburton has already done some work in this area, and made a patch that adds a constructor to String that takes a buffer, offset, length, and charset. This work now at least needs rebasing: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
> It doesn’t solve the case where multiple buffers are used to build the string, but does remove the need for a separate intermediate state-holding object when a single buffer is enough. It’d be a nice addition if possible, but I (for one) can tolerate a small object allocation otherwise.
>
> Cheers,
> Chris
>