RFR: 8197594 - String and character repeat
Stuart Marks
stuart.marks at oracle.com
Tue Feb 27 00:57:30 UTC 2018
On 2/18/18 1:37 AM, James Laskey wrote:
> Didn’t I hear someone mentioning “\U1D11A” at some point?
On 2/19/18 7:55 AM, Martin Buchholz wrote:
> Oops, I already got it wrong - it's already at 6 hex digits because there are 17
> planes, not 16. MAX_CODE_POINT is U+10FFFF.
> Yes, we need a variable width syntax like regex \x{h...h}
Yeah, there are a bunch of syntactic alternatives to consider. An "obvious"
alternative to "\uxxxx" is "\Uxxxxxx" which works if you're always willing to
specify six digits (or to have some weird non-delimited but variable-length
sequence, such as the existing octal escapes for chars (does anybody use those
(see JLS 3.10.6)?)) The difference between \u and \U is rather subtle, though.
Or a delimited sequence such as used by regex might be an alternative.
> And java regex also supports
> \N{name}The character with Unicode character name 'name'
> so we could do the same for the java language.
> Although it would be a little weird to have every Unicode update make some
> previously invalid source files valid.
>
> We could also say "It's 2018 and UTF-8 has won" and simply use UTF-8 in source
> files directly. No Unicode escapes needed.
Even if one is willing to have a source file in UTF-8 (as opposed to say, ASCII)
there are things in Unicode that are really hard to edit. For example, there are
zero-width spaces, joiners, non-joiners, directionality markers, etc. I think
escapes are the bare minimum. Some kind of name-based interpolation would be
useful, but the actual Unicode names are rather unwieldy. Maybe something like
HTML entities would be worthwhile to investigate, though probably with a
different syntax.
s'marks
More information about the core-libs-dev
mailing list