RFR: 8197594 - String and character repeat

Stuart Marks stuart.marks at oracle.com
Tue Feb 27 00:57:30 UTC 2018


On 2/18/18 1:37 AM, James Laskey wrote:
> Didn’t I hear someone mentioning “\U1D11A” at some point?

On 2/19/18 7:55 AM, Martin Buchholz wrote:
> Oops, I already got it wrong - it's already at 6 hex digits because there are 17 
> planes, not 16.  MAX_CODE_POINT is U+10FFFF.
> Yes, we need a variable width syntax like regex \x{h...h}

Yeah, there are a bunch of syntactic alternatives to consider. An "obvious" 
alternative to "\uxxxx" is "\Uxxxxxx" which works if you're always willing to 
specify six digits (or to have some weird non-delimited but variable-length 
sequence, such as the existing octal escapes for chars (does anybody use those 
(see JLS 3.10.6)?)) The difference between \u and \U is rather subtle, though. 
Or a delimited sequence such as used by regex might be an alternative.

> And java regex also supports
>    \N{name}The character with Unicode character name 'name'
> so we could do the same for the java language.
> Although it would be a little weird to have every Unicode update make some 
> previously invalid source files valid.
> 
> We could also say "It's 2018 and UTF-8 has won" and simply use UTF-8 in source 
> files directly.   No Unicode escapes needed.

Even if one is willing to have a source file in UTF-8 (as opposed to say, ASCII) 
there are things in Unicode that are really hard to edit. For example, there are 
zero-width spaces, joiners, non-joiners, directionality markers, etc. I think 
escapes are the bare minimum. Some kind of name-based interpolation would be 
useful, but the actual Unicode names are rather unwieldy. Maybe something like 
HTML entities would be worthwhile to investigate, though probably with a 
different syntax.

s'marks


More information about the core-libs-dev mailing list