Fwd: Re: <i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Sun Apr 24 18:22:31 UTC 2011

Two more names, UNICODE_PROPERTIES and UNICODE_CLASSES, are suggested.

any opinion?

-Sherman

On 4/23/2011 6:50 PM, Xueming Shen wrote:
> Forwarding...forgot to include the list.
>
> -------- Original Message --------
> Subject: 	Re: Codereview Request: 7039066 j.u.rgex does not match 
> TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
> Date: 	Sat, 23 Apr 2011 17:53:42 -0700
> From: 	Xueming Shen <xueming.shen at oracle.com>
> To: 	Tom Christiansen <tchrist at perl.com>
>
>
>
>   Mark, Tom,
>
> I agree with Mark that UNICODE_SPEC is a better name than
> UNICODE_CHARSET. We will have to deal with
> the "compatibility" issue Tom mentioned anyway anyway should Java go
> higher level of Unicode Regex support
> someday. New option/flag will have to be introduced to let the developer
> to have the choice, just like what we
> are trying to do with the ASCII only or Unicode version for those classes.
>
> I also agree we should have an embedded flag. was thinking we can add it
> later, for example the JDK8, if we
> can get this one in jdk7, but the Pattern usage in String class is
> persuasive.
>
> The webrev, specdiff and Pattern doc have been updated to use
> UNICODE_SPEC as the flag and (?U) as the
> embedded flag. It might be a little confused, compared to we use (?u)
> for UNICODE_CASE, but feel it might
> feel "nature" to have uppercase "U" for broader Unicode support.
>
> The webrev is at
> http://cr.openjdk.java.net/~sherman/7039066/webrev/
>
>   j.u.regex.Pattern API:
> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html
>
> Specdiff:
> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
>
> Tom,  it would be appreciated if you can at lease give the doc update a
> quick scan to see if I miss anything.
> And thanks for the suggestions for the Perl related doc update, I will
> need go through it a little later and address
> it in a separate CR.
>
> Thanks,
> -Sherman
>
>
> On 4/23/2011 10:48 AM, Tom Christiansen wrote:
> >  Mark Davis ☕<mark at macchiato.com>   wrote
> >      on Sat, 23 Apr 2011 09:09:55 PDT:
> >
> >>  The changes sound good.
> >  They sure do, don't they?  I'm quite happy about this.  I think it is more
> >  important to get this in the queue than that it (necessarily) be done for
> >  JDK7.  That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
> >  makes it attractive now.  But if not now, then soon is good enough.
> >
> >>  The flag UNICODE_CHARSET will be misleading, since
> >>  all of Java uses the Unicode Charset (= encoding). How about:
> >>         UNICODE_SPEC
> >>  or something that gives that flavor.
> >  I hadn't thought of that, but I do see what you mean.  The idea is
> >  that the semantics of \w etc change to match the Unicode spec in tr18.
> >
> >  I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
> >  broad a brush.  What then happens when, as I imagine it someday shall,
> >  Java gets full support for RL2.3 boundaries, the way with ICU one uses
> >  or (?w) or UREGEX_UWORD for?
> >
> >  Wouldn't calling something UNICODE_SPEC be too broad? Or should
> >  UNICODE_SPEC automatically include not just existing Unicode flags
> >  like UNICODE_CASE, but also any UREGEX_UWORD that comes along?
> >  If it does, you have back-compat issue, and if it doesn't, you
> >  have a misnaming issue.  Seems like a bit of a Catch22.
> >
> >  The reason I'd suggested UNICODE_CHARSET was because of my own background
> >  with the names we use for this within the Perl regex source code (which is
> >  itself written in C).  I believe that Java doesn't have the same situation
> >  as gave rise to it in Perl, and perhaps something else would be clearer.
> >
> >  Here's some background for why we felt we had to go that way. To control
> >  the behavior of \w and such, when a regex is compiled, a compiled Perl
> >  gets exactly one of these states:
> >
> >       REGEX_UNICODE_CHARSET
> >       REGEX_LOCALE_CHARSET
> >       REGEX_ASCII_RESTRICTED_CHARSET
> >       REGEX_DEPENDS_CHARSET
> >
> >  That state it normally inherits from the surrounding lexical scope,
> >  although this can be overridden with /u and /a, or (?u) and (?a),
> >  either within the pattern or as a separate pattern-compilation flag.
> >
> >  REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
> >  full RL1.2a definitions.  Because Perl always does Unicode casemapping --
> >  and full casemapping, too, not just simple -- we didn't need (?u) for what
> >  Java uses it for, which is just as an extra flavor of (?i); it doesn't
> >  do all that much.
> >
> >       (BTW, the old default is *not* some sort of non-Unicode charset
> >       semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
> >       code points>   255 and "maybe" so in the 128-255 range.)
> >
> >  What we did certainly isn't perfect, but it allows for both backwards
> >  compat and future growth.  This was because people want(ed) to be able to
> >  use regexes on both byte arrays yet also on character strings.  Me, I think
> >  it's nuts to support this at all, that if you want an input stream in (say)
> >  CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
> >  done with it: everything turns into characters internally.  But there's old
> >  byte and locale code out there whose semantics we are loth to change out
> >  from under people.  Java has the same kind of issue.
> >
> >  The reason we ever support anything else is because we got (IMHO nasty)
> >  POSIX locales before we got Unicode support, which didn't happen till
> >  toward the end of the last millennium.  So we're stuck supporting code
> >  well more than a decade old, perhaps indefinitely.  It's messy, but it
> >  is very hard to do anything about that.  I think Java shares in that
> >  perspective.
> >
> >  This corresponds, I think, to Java needing to support pre-Unicode
> >  regex semantics on \w and related escapes.  If they had started out
> >  with it always means the real thing the way ICU did, they wouldn't
> >  need both.
> >
> >  I wish there were a pragma to control this on a per-lexical-scope basis,
> >  but I'm don't enough about the Java compilers internals to begin to know
> >  how to go about implementing some thing like that, even as a
> >  -XX:+UseUnicodeSemantics CLI switch for that compilation unit.
> >
> >  One reason you want this is because the Java String class has these
> >  "convenience" methods like matches, replaceAll, etc, that take regexes
> >  but do not provide an API that admits Pattern compile flags.  If there
> >  is no way to embed a (?U) directive or some such, nor any way to pass
> >  in a Pattern.UNICODE_something flag.  The Java String API could also
> >  be broadened through method signature overloading, but for now, you
> >  can't do that.
> >
> >  No matter what the UNICODE_something gets called, I think there needs to be
> >  a corresponding embeddable (?X)-style flag as well.  Even if String were
> >  broadened, you'd want people to be able to specify *within the regex* that
> >  that regex should have full Unicode semantics.  After all, they might read
> >  the pattern in from a file.  That's why (most) Pattern.compile flags need
> >  to be able to embedded, too.  But you knew that already. :)
> >
> >  --tom
>