JEP 254: Compact Strings thoughts: character ranges outside ASCII + EASCII blocks

Fri Sep 25 12:29:26 UTC 2015

Hi Simon,

On 09/25/2015 01:01 AM, Simon Spero wrote:
> [Some of this is may simple or prohibitively tricksy depending on alignment
> constraints (even though it's restricted to Prime Multilingual Plane :-) ]
> 
> For some not un-realistic use cases, the most significant bytes for all the
> characters in a string are identical, even if the string is non-latin. For
> example, all the characters may be in the range U+0400--U+04FF, or
> U+0500--U+05FF.
> In these cases, it may be feasible to save the upper byte, then splat it
> into place when reconstituting the UTF-16 chars.
> 
> Because of the assignment of unicode code-points, this technique is not as
> big as win as it might have been. Unlike (e.g.) 8859-5 or 8859-8, there are
> no punctuation marks, digits, or whitespace characters, which restricts use
> cases to very short strings (the lack of whitespace is the biggest
> problem). For the 254-like coding system I was experimenting with,  for the
> cases were I didn't fall back to UTF-16, the savings were overwhelmed by
> the cost of  header words and padding.
> 
> It is possible to handle some of these mixtures, on some architectures,
> without resorting to LUTs or branches, but that's well in to  non-goal
> territory for JEP-254. There might be some useful win just from being able
> to have an offset to be added to the packed value based if the  high-bit is
> set or not.  Anyone here from Москва?

Sure, many theoretical constructions may be devised. Not many of them
are practical.

JEP-254 wins big time exactly because many strings *are* single-byte
storeable in ASCII/8859-1, *especially* those with long lengths. So, the
very first thing you have to do is prove that an alternative scheme
successfully encodes a fair amount of real strings. Otherwise, it does
not worth exploring any further. As you say, a lack of "usual"
characters like whitespace may be the deal breaker.

Adding an alternative coder is easy, but making sure it does not regress
the prevailing cases of 8859-1/UTF16 strings is much harder. Think about
branching costs, eliminating the bit tricks that are employed now with
binary 0/1 coder, etc.

Thanks,
-Aleksey