<i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Sat Apr 23 10:48:35 PDT 2011

Mark Davis ☕ <mark at macchiato.com> wrote
   on Sat, 23 Apr 2011 09:09:55 PDT: 

> The changes sound good. 

They sure do, don't they?  I'm quite happy about this.  I think it is more
important to get this in the queue than that it (necessarily) be done for
JDK7.  That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
makes it attractive now.  But if not now, then soon is good enough.

> The flag UNICODE_CHARSET will be misleading, since
> all of Java uses the Unicode Charset (= encoding). How about:

>       UNICODE_SPEC

> or something that gives that flavor.

I hadn't thought of that, but I do see what you mean.  The idea is 
that the semantics of \w etc change to match the Unicode spec in tr18.

I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
broad a brush.  What then happens when, as I imagine it someday shall,
Java gets full support for RL2.3 boundaries, the way with ICU one uses
or (?w) or UREGEX_UWORD for?  

Wouldn't calling something UNICODE_SPEC be too broad? Or should
UNICODE_SPEC automatically include not just existing Unicode flags
like UNICODE_CASE, but also any UREGEX_UWORD that comes along?  
If it does, you have back-compat issue, and if it doesn't, you 
have a misnaming issue.  Seems like a bit of a Catch22.

The reason I'd suggested UNICODE_CHARSET was because of my own background
with the names we use for this within the Perl regex source code (which is
itself written in C).  I believe that Java doesn't have the same situation
as gave rise to it in Perl, and perhaps something else would be clearer.

Here's some background for why we felt we had to go that way. To control
the behavior of \w and such, when a regex is compiled, a compiled Perl 
gets exactly one of these states:

    REGEX_UNICODE_CHARSET
    REGEX_LOCALE_CHARSET
    REGEX_ASCII_RESTRICTED_CHARSET
    REGEX_DEPENDS_CHARSET 

That state it normally inherits from the surrounding lexical scope,
although this can be overridden with /u and /a, or (?u) and (?a),
either within the pattern or as a separate pattern-compilation flag.

REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
full RL1.2a definitions.  Because Perl always does Unicode casemapping --
and full casemapping, too, not just simple -- we didn't need (?u) for what
Java uses it for, which is just as an extra flavor of (?i); it doesn't
do all that much.

    (BTW, the old default is *not* some sort of non-Unicode charset
    semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
    code points > 255 and "maybe" so in the 128-255 range.)

What we did certainly isn't perfect, but it allows for both backwards
compat and future growth.  This was because people want(ed) to be able to
use regexes on both byte arrays yet also on character strings.  Me, I think
it's nuts to support this at all, that if you want an input stream in (say)
CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
done with it: everything turns into characters internally.  But there's old
byte and locale code out there whose semantics we are loth to change out
from under people.  Java has the same kind of issue.

The reason we ever support anything else is because we got (IMHO nasty)
POSIX locales before we got Unicode support, which didn't happen till
toward the end of the last millennium.  So we're stuck supporting code
well more than a decade old, perhaps indefinitely.  It's messy, but it
is very hard to do anything about that.  I think Java shares in that
perspective.

This corresponds, I think, to Java needing to support pre-Unicode
regex semantics on \w and related escapes.  If they had started out
with it always means the real thing the way ICU did, they wouldn't
need both.

I wish there were a pragma to control this on a per-lexical-scope basis,
but I'm don't enough about the Java compilers internals to begin to know
how to go about implementing some thing like that, even as a
-XX:+UseUnicodeSemantics CLI switch for that compilation unit.

One reason you want this is because the Java String class has these
"convenience" methods like matches, replaceAll, etc, that take regexes
but do not provide an API that admits Pattern compile flags.  If there
is no way to embed a (?U) directive or some such, nor any way to pass
in a Pattern.UNICODE_something flag.  The Java String API could also
be broadened through method signature overloading, but for now, you
can't do that.

No matter what the UNICODE_something gets called, I think there needs to be
a corresponding embeddable (?X)-style flag as well.  Even if String were
broadened, you'd want people to be able to specify *within the regex* that
that regex should have full Unicode semantics.  After all, they might read
the pattern in from a file.  That's why (most) Pattern.compile flags need
to be able to embedded, too.  But you knew that already. :)

--tom