string indexing

Per Bothner per at bothner.com
Mon Nov 14 01:28:36 UTC 2016



On 11/13/2016 04:21 AM, Zenaan Harkness wrote:
> Although grapheme indexing is probably more generally useful for
> multi-lingual UI.

Quite possibly.  However, a code-point can be represented as an unboxed
int.  A grapheme requires memory allocation. You cannot store it in a
register or even a fixed number of registers, unless you use an indirect
substring representation (base string, start offset, end offset), which
has its own problems.

You can always build a grapheme-based API on top of a codepoint API,
but not vice versa. You can of course do the same on top of a UTF16
code-unit API, but it's more error-prone and unnatural: At least
code-points have some natural semantic meaning; code-units do not.

> "CharSequence" is deceptive. Should be called CodePointSequence or
> something else again... "char" is -so- overloaded in Java in particular.

java.lang.CharSequence is *not* a sequence of code-points.
It's a sequence of UTF-16 code-units, just like java.lang.String.
-- 
	--Per Bothner
per at bothner.com   http://per.bothner.com/


More information about the discuss mailing list