Compact Strings and APIs for fast decoding of string data

Wed Feb 10 15:50:13 UTC 2016

Hi Chris and others,

On 02/10/2016 02:16 PM, Chris Vest wrote:
> It’d be a shame to loose the no-zeroing optimisation, so if that requires a constructor, then lets find a constructor.
>
> To recap, I have possibly multiple buffers and I only want to use parts of each. So with the composed buffer I’d have to first extract view buffers and then compose those, so I end up with 3 levels of buffers. Not ideal. A value type as a tuple of a buffer reference, an offset and a length would be better. We could make an array of those and pass that in. It could also find uses elsewhere in JDK where the buffer+offset+length pattern occurs. The array, a long with an encoding, could be passed up front to a String constructor (or a static method – I don’t care) and my use case would be equally satisfied.

 From your description the most suitable API seems to be a method on 
StringBuilder:

     public StringBuilder append(ByteBuffer bb, int offset, int length, 
Charset cs)

The only problem with this approach is that even if you guess the 
correct capacity of StringBuilder, a defensive copy of the byte[] has to 
be performed in toString(), because in the absence of synchronization, a 
data race could allow the String's byte[] to be modified after it is 
constructed.

But there is a trick that doesn't need synchronization and prevents data 
races: a OneShotStringBuilder:

http://cr.openjdk.java.net/~plevart/misc/OneShotStringBuilder/OneShotStringBuilder.java

We now have optimized strategies for string concatenation (indyfied 
String concat), but for programmatic building of Strings we are still 
bound by sub-optimal StringBuilder that does defensive copying at the 
end although the capacity is exactly estimated an the builder is 
abandoned afterwards.

So perhaps if such OneShotStringBuilder has enough weight on its own to 
be added to the standard API, it could also be equipped with ByteBuffer 
appending&decoding operation(s)...

As it's use is constrained to single thread and for constructing a 
single String, there's a big chance that JIT could optimize and 
eliminate its allocation. In which case such builder is comparable to a 
String constructor or factory method, although much more flexible in 
expression.

What do you think?

Regards, Peter

>
> If I’ve understood correctly, there’s a plan (or at least a great desire) to turn Optional into a value type once they become available. These “buffer slice” objects could follow a similar path, and thus be made available independently of value types. Even if they can’t, the overhead they add would be small, I imagine.
>
> I’m slightly more worried about their array and what it implies, though. This API requires of me that I’m able to present all of the relevant buffers up front. In my case, these buffers are shared between multiple threads, so they are guarded by locks. I must take all the locks for all the buffers and hold them for the duration of the copy. _Maybe_ I have to do that anyway, but if there’s a way for me to only lock one buffer at a time, I’d like to be able to exploit this. The builder approach holds this door open for me.
>
> Cheers,
> Chris
>
>> On 10 Feb 2016, at 09:54, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>
>> Hi,
>>
>> A more functional approach would be to compose a sequence buffers into one view, perhaps read-only. Then there would be no need to accept arrays of buffers. That should work well for bulk operations. That’s a non-trivial but not very difficult amount of work, and possibly simplified if restricted to read-only views.
>>
>> Thus i think we should focus Richard’s work with:
>>
>>   String(ByteBuffer src, String charset)
>>
>> and perhaps a sub-range variant, if perturbing the position/limit of an existing buffer and/or slicing is too problematic.
>>
>> —
>>
>> Zeroing memory and possibly avoiding it can be tricky. Any such optimisations have to be carefully performed otherwise uninitialised regions might leak and be accessed, nefariously or otherwise. I imagine it’s easier to contain/control within a constructor than say a builder.
>>
>> Paul.
>>
>>> On 10 Feb 2016, at 05:38, Xueming Shen <xueming.shen at oracle.com> wrote:
>>>
>>> Hi Chris,
>>>
>>> I think basically you are asking a String constructor that takes a ByteBuffer. StringCoding
>>> then can take advantage of the current CompactString design to optimize the decoding
>>> operation by just a single byte[]/vectorized memory copy from the ByteBuffer to the String's
>>> internal byte[], WHEN the charset is 8859-1.
>>>
>>> String(ByteBuffer src, String charset);
>>>
>>> Further we will need a "buffer gathering" style constructor
>>>
>>> String(ByteBuffer[] srcs, String charset);
>>> (or more generally, String(ByteBuffer[] srcs, int off, int len, String charset)
>>>
>>> to create a String object from a sequence of ByteBuffers, if it's really desired.
>>>
>>> And then I would also assume it will also be desired to extend the current
>>> CharsetDecoder/Encoder class as well to add a pair of the "gathering" style coding
>>> methods
>>>
>>> CharBuffer CharsetDecoder.decode(ByteBuffer... ins);
>>> ByteBuffer CharsetEncoder.encode(CharBuffer... ins);
>>>
>>> Though the implementation might have to deal with the tricky "splitting
>>> byte/char" issue, in which part of the "byte/char sequence" is in the previous
>>> buffer and the continuing byte/chars are in the next following buffer ...
>>>
>>> -Sherman
>>>
>>>
>>> On 2/9/16 7:20 AM, Chris Vest wrote:
>>>> Hi,
>>>>
>>>> Aleksey Shipilev did a talk on his journey to implement compact strings and indified string concat at the JVM Tech Summit yesterday, and this reminded me that we (Neo4j) have a need for turning segments of DirectByteBuffers into Strings as fast as possible. If we already store the string data in Latin1, which is one of the two special encodings for compact strings, we’d ideally like to produce the String object with just the two necessary object allocations and a single, vectorised memory copy.
>>>>
>>>> Our use case is that we are a database and we do our own file paging, effectively having file data in a large set of DirectByteBuffers. We have string data in our files in a number of different encodings, a popular one being Latin1. Occasionally these String values span multiple buffers. We often need to expose this data as String objects, in which case decoding the bytes and turning them into a String is often very performance sensitive - to the point of being one of our top bottlenecks for the given queries. Part of the story is that in the case of Latin1, I’ll know up front exactly how many bytes my string data takes up, though I might not know how many buffers are going to be involved.
>>>>
>>>> As far as I can tell, this is currently not possible using public APIs. Using private APIs it may be possible, but will be relying on the JIT for vectorising the memory copying.
>>>>
>>>>  From an API standpoint, CharsetDecoder is close to home, but is not quite there. It’s stateful and not thread-safe, so I either have to allocate new ones every time or cache them in thread-locals. I’m also required to allocate the receiving CharBuffer. Since I may need to decode from multiple buffers, I realise that I might not be able to get away from allocating at least one extra object to keep track of intermediate decoding state. The CharsetDecoder does not have a method where I can specify the offset and length for the desired part of the ByteBuffer I want to decode, which forces be to allocate views instead.
>>>>
>>>> The CharBuffers are allocated with a length up front, which is nice, but I can’t restrict its encoding so it has to allocate a char array instead of the byte array that I really want. Even if it did allocate a byte array, the CharBuffer is mutable, which would force String do a defensive copy anyway.
>>>>
>>>> One way I imagine this could be solved would be with a less dynamic kind of decoder, where the target length is given upfront to the decoder. Buffers are then consumed one by one, and a terminal method performs finishing sanity checks (did we get all the bytes we were promised?) and returns the result.
>>>>
>>>> StringDecoder decoder = Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
>>>> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, len2).done();
>>>>
>>>> This will in principle allow the string decoding to be 2 small allocations, an array allocation without zeroing, and a sequence of potentially vectorised memcpys. I don’t see any potentially troubling interactions with fused Strings either, since all the knowledge (except for the string data itself) needed to allocate the String objects are available from the get-go.
>>>>
>>>> What do you guys think?
>>>>
>>>> Btw, Richard Warburton has already done some work in this area, and made a patch that adds a constructor to String that takes a buffer, offset, length, and charset. This work now at least needs rebasing: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
>>>> It doesn’t solve the case where multiple buffers are used to build the string, but does remove the need for a separate intermediate state-holding object when a single buffer is enough. It’d be a nice addition if possible, but I (for one) can tolerate a small object allocation otherwise.
>>>>
>>>> Cheers,
>>>> Chris
>>>>