JEP 254: Compact Strings thoughts: character ranges outside ASCII + EASCII blocks

Thu Sep 24 22:01:57 UTC 2015

[Some of this is may simple or prohibitively tricksy depending on alignment
constraints (even though it's restricted to Prime Multilingual Plane :-) ]

For some not un-realistic use cases, the most significant bytes for all the
characters in a string are identical, even if the string is non-latin. For
example, all the characters may be in the range U+0400--U+04FF, or
U+0500--U+05FF.
In these cases, it may be feasible to save the upper byte, then splat it
into place when reconstituting the UTF-16 chars.

Because of the assignment of unicode code-points, this technique is not as
big as win as it might have been. Unlike (e.g.) 8859-5 or 8859-8, there are
no punctuation marks, digits, or whitespace characters, which restricts use
cases to very short strings (the lack of whitespace is the biggest
problem). For the 254-like coding system I was experimenting with,  for the
cases were I didn't fall back to UTF-16, the savings were overwhelmed by
the cost of  header words and padding.

It is possible to handle some of these mixtures, on some architectures,
without resorting to LUTs or branches, but that's well in to  non-goal
territory for JEP-254. There might be some useful win just from being able
to have an offset to be added to the packed value based if the  high-bit is
set or not.  Anyone here from Москва?

Simon
p.s.
   As part of the replacement for sun.misc.Unsafe, could we get a
jdk.infernal/...ABitDodgy, which would allow the full set of SIMD
instructions to be generated in an architecture independent fashion? (By
architecture independent I mean if you ask for a NEON instruction on an
amd64, or an SSE 4.2 string primitive on SPARC, that's what gets emitted).