string indexing (was: Java needs an immutable byte array wrapper)

Per Bothner per at bothner.com
Sun Nov 13 04:06:55 UTC 2016


On 11/12/2016 09:53 AM, Peter Lawrey wrote:
> Java 9 String has a byte [] at its core. I suspect it's not appropriate but
> worth thinking about.

Interesting.  I would be even more interested if they could make codePointAt
and codePointCount be constant-time: A number of programming languages define
a string as a sequence of code-points, and the indexing operator that their standard library
provide is basically codePointAt.  Example languages include Python3, Scheme, and the
XQuery/XPath/XSLT family.

Implementing string indexing for such a language on the JVM gives you the unpalatable choice
of either having indexing take linear time, or not using java.lang.String and thus hurting
Java interoperability.

Note it would be easy to change the Java9 String implementation such that codePointAt
was constant-time in the case of BMP-only (no-surrogate) strings.  Just use a bit in
the 'coder' field to indicate that the string is BMP-only.  Doing so would be a
big and easy win for the common BMP-only case, though it doesn't give us
guaranteed constant-time indexing - a single non-BMP character breaks that.

As a compromise I recently implemented an IString class, which gives you O(1)
codepoint indexing while still being compact and implementing CharSequence efficiently:

http://sourceware.org/viewvc/kawa/branches/invoke/gnu/lists/IString.java?view=markup
[Warning: this has not been tested much.]

Still, it would be much nicer if we could use java.lang.String directly.  It wouldn't
be very expensive.  Note that the offsets array in my IString class only adds 0.24 bytes
per 2-byte char, so roughly 12%.  It is possible to encode the Java9 'coder' field using
the IString 'offsets' field (by using a static flag array for the LATIN1 case).
-- 
	--Per Bothner
per at bothner.com   http://per.bothner.com/


More information about the discuss mailing list