Fwd: Re: <i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
Xueming Shen
xueming.shen at oracle.com
Sun Apr 24 18:22:31 UTC 2011
Two more names, UNICODE_PROPERTIES and UNICODE_CLASSES, are suggested.
any opinion?
-Sherman
On 4/23/2011 6:50 PM, Xueming Shen wrote:
> Forwarding...forgot to include the list.
>
> -------- Original Message --------
> Subject: Re: Codereview Request: 7039066 j.u.rgex does not match
> TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
> Date: Sat, 23 Apr 2011 17:53:42 -0700
> From: Xueming Shen <xueming.shen at oracle.com>
> To: Tom Christiansen <tchrist at perl.com>
>
>
>
> Mark, Tom,
>
> I agree with Mark that UNICODE_SPEC is a better name than
> UNICODE_CHARSET. We will have to deal with
> the "compatibility" issue Tom mentioned anyway anyway should Java go
> higher level of Unicode Regex support
> someday. New option/flag will have to be introduced to let the developer
> to have the choice, just like what we
> are trying to do with the ASCII only or Unicode version for those classes.
>
> I also agree we should have an embedded flag. was thinking we can add it
> later, for example the JDK8, if we
> can get this one in jdk7, but the Pattern usage in String class is
> persuasive.
>
> The webrev, specdiff and Pattern doc have been updated to use
> UNICODE_SPEC as the flag and (?U) as the
> embedded flag. It might be a little confused, compared to we use (?u)
> for UNICODE_CASE, but feel it might
> feel "nature" to have uppercase "U" for broader Unicode support.
>
> The webrev is at
> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>
> j.u.regex.Pattern API:
> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>
> Specdiff:
> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>
> Tom, it would be appreciated if you can at lease give the doc update a
> quick scan to see if I miss anything.
> And thanks for the suggestions for the Perl related doc update, I will
> need go through it a little later and address
> it in a separate CR.
>
> Thanks,
> -Sherman
>
>
> On 4/23/2011 10:48 AM, Tom Christiansen wrote:
> > Mark Davis ☕<mark at macchiato.com> wrote
> > on Sat, 23 Apr 2011 09:09:55 PDT:
> >
> >> The changes sound good.
> > They sure do, don't they? I'm quite happy about this. I think it is more
> > important to get this in the queue than that it (necessarily) be done for
> > JDK7. That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
> > makes it attractive now. But if not now, then soon is good enough.
> >
> >> The flag UNICODE_CHARSET will be misleading, since
> >> all of Java uses the Unicode Charset (= encoding). How about:
> >> UNICODE_SPEC
> >> or something that gives that flavor.
> > I hadn't thought of that, but I do see what you mean. The idea is
> > that the semantics of \w etc change to match the Unicode spec in tr18.
> >
> > I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
> > broad a brush. What then happens when, as I imagine it someday shall,
> > Java gets full support for RL2.3 boundaries, the way with ICU one uses
> > or (?w) or UREGEX_UWORD for?
> >
> > Wouldn't calling something UNICODE_SPEC be too broad? Or should
> > UNICODE_SPEC automatically include not just existing Unicode flags
> > like UNICODE_CASE, but also any UREGEX_UWORD that comes along?
> > If it does, you have back-compat issue, and if it doesn't, you
> > have a misnaming issue. Seems like a bit of a Catch22.
> >
> > The reason I'd suggested UNICODE_CHARSET was because of my own background
> > with the names we use for this within the Perl regex source code (which is
> > itself written in C). I believe that Java doesn't have the same situation
> > as gave rise to it in Perl, and perhaps something else would be clearer.
> >
> > Here's some background for why we felt we had to go that way. To control
> > the behavior of \w and such, when a regex is compiled, a compiled Perl
> > gets exactly one of these states:
> >
> > REGEX_UNICODE_CHARSET
> > REGEX_LOCALE_CHARSET
> > REGEX_ASCII_RESTRICTED_CHARSET
> > REGEX_DEPENDS_CHARSET
> >
> > That state it normally inherits from the surrounding lexical scope,
> > although this can be overridden with /u and /a, or (?u) and (?a),
> > either within the pattern or as a separate pattern-compilation flag.
> >
> > REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
> > full RL1.2a definitions. Because Perl always does Unicode casemapping --
> > and full casemapping, too, not just simple -- we didn't need (?u) for what
> > Java uses it for, which is just as an extra flavor of (?i); it doesn't
> > do all that much.
> >
> > (BTW, the old default is *not* some sort of non-Unicode charset
> > semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
> > code points> 255 and "maybe" so in the 128-255 range.)
> >
> > What we did certainly isn't perfect, but it allows for both backwards
> > compat and future growth. This was because people want(ed) to be able to
> > use regexes on both byte arrays yet also on character strings. Me, I think
> > it's nuts to support this at all, that if you want an input stream in (say)
> > CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
> > done with it: everything turns into characters internally. But there's old
> > byte and locale code out there whose semantics we are loth to change out
> > from under people. Java has the same kind of issue.
> >
> > The reason we ever support anything else is because we got (IMHO nasty)
> > POSIX locales before we got Unicode support, which didn't happen till
> > toward the end of the last millennium. So we're stuck supporting code
> > well more than a decade old, perhaps indefinitely. It's messy, but it
> > is very hard to do anything about that. I think Java shares in that
> > perspective.
> >
> > This corresponds, I think, to Java needing to support pre-Unicode
> > regex semantics on \w and related escapes. If they had started out
> > with it always means the real thing the way ICU did, they wouldn't
> > need both.
> >
> > I wish there were a pragma to control this on a per-lexical-scope basis,
> > but I'm don't enough about the Java compilers internals to begin to know
> > how to go about implementing some thing like that, even as a
> > -XX:+UseUnicodeSemantics CLI switch for that compilation unit.
> >
> > One reason you want this is because the Java String class has these
> > "convenience" methods like matches, replaceAll, etc, that take regexes
> > but do not provide an API that admits Pattern compile flags. If there
> > is no way to embed a (?U) directive or some such, nor any way to pass
> > in a Pattern.UNICODE_something flag. The Java String API could also
> > be broadened through method signature overloading, but for now, you
> > can't do that.
> >
> > No matter what the UNICODE_something gets called, I think there needs to be
> > a corresponding embeddable (?X)-style flag as well. Even if String were
> > broadened, you'd want people to be able to specify *within the regex* that
> > that regex should have full Unicode semantics. After all, they might read
> > the pattern in from a file. That's why (most) Pattern.compile flags need
> > to be able to embedded, too. But you knew that already. :)
> >
> > --tom
>
More information about the core-libs-dev
mailing list