Compact Strings and APIs for fast decoding of string data

Wed Feb 10 11:13:41 UTC 2016

Or as Chris explain, having a string that span on more than one buffer is a corner case of this software,
so for most of the strings, the constructor that takes a buffer is fine, and for the corner case,
a constructor of String that takes a CharSequence seems to be easier to implement than creating a new kind of buffer that represents several buffers.

and by the way, i would prefer to have static factory methods instead of constructors in String, there are already too many constructors.

regards,
Rémi

----- Mail original -----
> De: "Paul Sandoz" <paul.sandoz at oracle.com>
> Cc: core-libs-dev at openjdk.java.net
> Envoyé: Mercredi 10 Février 2016 09:54:17
> Objet: Re: Compact Strings and APIs for fast decoding of string data
> 
> Hi,
> 
> A more functional approach would be to compose a sequence buffers into one
> view, perhaps read-only. Then there would be no need to accept arrays of
> buffers. That should work well for bulk operations. That’s a non-trivial but
> not very difficult amount of work, and possibly simplified if restricted to
> read-only views.
> 
> Thus i think we should focus Richard’s work with:
> 
>   String(ByteBuffer src, String charset)
> 
> and perhaps a sub-range variant, if perturbing the position/limit of an
> existing buffer and/or slicing is too problematic.
> 
> —
> 
> Zeroing memory and possibly avoiding it can be tricky. Any such optimisations
> have to be carefully performed otherwise uninitialised regions might leak
> and be accessed, nefariously or otherwise. I imagine it’s easier to
> contain/control within a constructor than say a builder.
> 
> Paul.
> 
> > On 10 Feb 2016, at 05:38, Xueming Shen <xueming.shen at oracle.com> wrote:
> > 
> > Hi Chris,
> > 
> > I think basically you are asking a String constructor that takes a
> > ByteBuffer. StringCoding
> > then can take advantage of the current CompactString design to optimize the
> > decoding
> > operation by just a single byte[]/vectorized memory copy from the
> > ByteBuffer to the String's
> > internal byte[], WHEN the charset is 8859-1.
> > 
> > String(ByteBuffer src, String charset);
> > 
> > Further we will need a "buffer gathering" style constructor
> > 
> > String(ByteBuffer[] srcs, String charset);
> > (or more generally, String(ByteBuffer[] srcs, int off, int len, String
> > charset)
> > 
> > to create a String object from a sequence of ByteBuffers, if it's really
> > desired.
> > 
> > And then I would also assume it will also be desired to extend the current
> > CharsetDecoder/Encoder class as well to add a pair of the "gathering" style
> > coding
> > methods
> > 
> > CharBuffer CharsetDecoder.decode(ByteBuffer... ins);
> > ByteBuffer CharsetEncoder.encode(CharBuffer... ins);
> > 
> > Though the implementation might have to deal with the tricky "splitting
> > byte/char" issue, in which part of the "byte/char sequence" is in the
> > previous
> > buffer and the continuing byte/chars are in the next following buffer ...
> > 
> > -Sherman
> > 
> > 
> > On 2/9/16 7:20 AM, Chris Vest wrote:
> >> Hi,
> >> 
> >> Aleksey Shipilev did a talk on his journey to implement compact strings
> >> and indified string concat at the JVM Tech Summit yesterday, and this
> >> reminded me that we (Neo4j) have a need for turning segments of
> >> DirectByteBuffers into Strings as fast as possible. If we already store
> >> the string data in Latin1, which is one of the two special encodings for
> >> compact strings, we’d ideally like to produce the String object with just
> >> the two necessary object allocations and a single, vectorised memory
> >> copy.
> >> 
> >> Our use case is that we are a database and we do our own file paging,
> >> effectively having file data in a large set of DirectByteBuffers. We have
> >> string data in our files in a number of different encodings, a popular
> >> one being Latin1. Occasionally these String values span multiple buffers.
> >> We often need to expose this data as String objects, in which case
> >> decoding the bytes and turning them into a String is often very
> >> performance sensitive - to the point of being one of our top bottlenecks
> >> for the given queries. Part of the story is that in the case of Latin1,
> >> I’ll know up front exactly how many bytes my string data takes up, though
> >> I might not know how many buffers are going to be involved.
> >> 
> >> As far as I can tell, this is currently not possible using public APIs.
> >> Using private APIs it may be possible, but will be relying on the JIT for
> >> vectorising the memory copying.
> >> 
> >> From an API standpoint, CharsetDecoder is close to home, but is not quite
> >> there. It’s stateful and not thread-safe, so I either have to allocate
> >> new ones every time or cache them in thread-locals. I’m also required to
> >> allocate the receiving CharBuffer. Since I may need to decode from
> >> multiple buffers, I realise that I might not be able to get away from
> >> allocating at least one extra object to keep track of intermediate
> >> decoding state. The CharsetDecoder does not have a method where I can
> >> specify the offset and length for the desired part of the ByteBuffer I
> >> want to decode, which forces be to allocate views instead.
> >> 
> >> The CharBuffers are allocated with a length up front, which is nice, but I
> >> can’t restrict its encoding so it has to allocate a char array instead of
> >> the byte array that I really want. Even if it did allocate a byte array,
> >> the CharBuffer is mutable, which would force String do a defensive copy
> >> anyway.
> >> 
> >> One way I imagine this could be solved would be with a less dynamic kind
> >> of decoder, where the target length is given upfront to the decoder.
> >> Buffers are then consumed one by one, and a terminal method performs
> >> finishing sanity checks (did we get all the bytes we were promised?) and
> >> returns the result.
> >> 
> >> StringDecoder decoder =
> >> Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
> >> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2,
> >> len2).done();
> >> 
> >> This will in principle allow the string decoding to be 2 small
> >> allocations, an array allocation without zeroing, and a sequence of
> >> potentially vectorised memcpys. I don’t see any potentially troubling
> >> interactions with fused Strings either, since all the knowledge (except
> >> for the string data itself) needed to allocate the String objects are
> >> available from the get-go.
> >> 
> >> What do you guys think?
> >> 
> >> Btw, Richard Warburton has already done some work in this area, and made a
> >> patch that adds a constructor to String that takes a buffer, offset,
> >> length, and charset. This work now at least needs rebasing:
> >> http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/
> >> <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
> >> It doesn’t solve the case where multiple buffers are used to build the
> >> string, but does remove the need for a separate intermediate
> >> state-holding object when a single buffer is enough. It’d be a nice
> >> addition if possible, but I (for one) can tolerate a small object
> >> allocation otherwise.
> >> 
> >> Cheers,
> >> Chris
> >> 
> > 
> 
>