string indexing (was: Java needs an immutable byte array wrapper)

Sun Nov 13 12:21:48 UTC 2016

On Sat, Nov 12, 2016 at 08:06:55PM -0800, Per Bothner wrote:
> On 11/12/2016 09:53 AM, Peter Lawrey wrote:
> >Java 9 String has a byte [] at its core. I suspect it's not
> >appropriate but worth thinking about.

Time to read up on that, thanks.

> Interesting.  I would be even more interested if they could make
> codePointAt and codePointCount be constant-time: A number of
> programming languages define a string as a sequence of code-points,
> and the indexing operator that their standard library provide is
> basically codePointAt.  Example languages include Python3, Scheme, and
> the XQuery/XPath/XSLT family.

Ack.

Although grapheme indexing is probably more generally useful for
multi-lingual UI. Swift basically gets "String" right as far as my
reading of Swift's docs goes - not only code-points, but graphemes, the
next layer of indexing above code-points.

I cannott speak to Swift's implementation as to storage / time
tradeoffs made.

Trying to create a simple string formatter (left, right, centered) that
was also "multi lingual" lead me into the deep dark past of Java's (pre
v1.0) decision to go with UTF-16 (sensible at the time), which for 20
years has been known to be deficient (prior to Java 1.1 it was when
Unicode ascertained they needed more than 16 bits) and yet
java.lang.String never got updated, at least until recently with Java 9,
which now lays the foundation for a sane string class.

Took me two full working weeks to sort out the mess in my head, so I
wrote up the details of that exploration here:
https://zenaan.github.io/zen/javadoc/zen/lang/string.html
(Note, this was pre-Java 9)

Hopefully by Java 10, 11 or 12, we might see full grapheme support in
Java (as is the case in Swift), now that String is implemented with byte
array storage.

> Implementing string indexing for such a language on the JVM gives you
> the unpalatable choice of either having indexing take linear time, or
> not using java.lang.String and thus hurting Java interoperability.

Can class finality be bypassed at the JVM level?

With byte[] underlying Java 9's String class, code-point and grapheme
indexing could be in a subclass?

The trade off then is between the storage (and construction time) cost
for the extra layers of indexing (code-points, then graphemes on top of
that), vs the run time performance hit for dynamically finding these
index points every time needed. There is no universal "best" option of
course... depends always on the application.

> Note it would be easy to change the Java9 String implementation such
> that codePointAt was constant-time in the case of BMP-only
> (no-surrogate) strings.

I.e. without increasing storage cost. I don't think code-points really
solve the significant problem though (discovery of grapheme boundaries
when one truly needs to handle multiple languages).

> Just use a bit in the 'coder' field to indicate that the string is
> BMP-only.  Doing so would be a big and easy win for the common
> BMP-only case, though it doesn't give us guaranteed constant-time
> indexing - a single non-BMP character breaks that.

Again, my write up highlights the issues with code-points - we have
combining "characters", non displayed "characters" and plenty more
besides - it is graphemes (and non-graphemes) that, at the UI layer at
least, we really need to know about.

> As a compromise I recently implemented an IString class, which gives
> you O(1) codepoint indexing while still being compact and implementing
> CharSequence efficiently:
> 
> http://sourceware.org/viewvc/kawa/branches/invoke/gnu/lists/IString.java?view=markup
> [Warning: this has not been tested much.]

Thanks.

"CharSequence" is deceptive. Should be called CodePointSequence or
something else again... "char" is -so- overloaded in Java in particular.

> Still, it would be much nicer if we could use java.lang.String
> directly.  It wouldn't be very expensive.  Note that the offsets array
> in my IString class only adds 0.24 bytes per 2-byte char, so roughly
> 12%.  It is possible to encode the Java9 'coder' field using the
> IString 'offsets' field (by using a static flag array for the LATIN1
> case).

I strongly believe the that immutability of byte arrays would provide the
safety that java.lang.String otherwise provides, and that as long as
removing String finality did not significantly impact performance of
code in the wild, the new byte[] String would be entirely sufficient for
one or two additional, and optional indexing layers - one for
code-points, and the top layer for graphemes.

Regards,
Zenaan