string indexing

Mon Nov 14 02:07:43 UTC 2016

On Sun, Nov 13, 2016 at 05:28:36PM -0800, Per Bothner wrote:
> On 11/13/2016 04:21 AM, Zenaan Harkness wrote:
> >Although grapheme indexing is probably more generally useful for
> >multi-lingual UI.
> 
> Quite possibly.  However, a code-point can be represented as an unboxed
> int.  A grapheme requires memory allocation. You cannot store it in a
> register or even a fixed number of registers, unless you use an indirect
> substring representation (base string, start offset, end offset), which
> has its own problems.
> 
> You can always build a grapheme-based API on top of a codepoint API,
> but not vice versa. You can of course do the same on top of a UTF16
> code-unit API, but it's more error-prone and unnatural: At least
> code-points have some natural semantic meaning; code-units do not.

Ack.

I would only refer here of course:
http://utf8everywhere.org/

Java is what it is, and String is particularly unfortunate - Java 9's
byte[] implementation is a performance improvement in some situations,
but still messy:

http://stackoverflow.com/questions/38213239/what-is-java-9s-new-string-implementaion
"
Because most usages of Strings are Latin-1 and only require one byte,
Java-9's String will be updated to be implemented under the hood as a
byte array with an encoding flag field to note if it is a byte array. If
the characters are not Latin-1 and require more than one byte it will be
stored as a UTF-16 char array (2 bytes per char) and the flag.
"

> >"CharSequence" is deceptive. Should be called CodePointSequence or
> >something else again... "char" is -so- overloaded in Java in particular.
> 
> java.lang.CharSequence is *not* a sequence of code-points.
> It's a sequence of UTF-16 code-units, just like java.lang.String.

Even more the reason it's name is problematic.