<i18n dev> Fwd: Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Mon Apr 25 05:38:48 UTC 2011

  Thanks Mark!

Let's go with UNICODE_PROPERTY, if there is no objection.

-Sherman

On 4/24/2011 9:00 PM, Mark Davis ☕ wrote:
> There are pluses and minuses to any of them: UNICODE_SPEC, 
> UNICODE_PROPERTY, UNICODE_CLASS, UNICODE_PROPERTIES, 
> or UNICODE_CLASSES, although any would work in a pinch.
>
> I'd favor a bit the singular over the plural, given the usage.
>
> The term 'class' is not used much in Unicode, just for two properties 
> (see below). So someone could possibly think it just meant those two 
> properties, and it could cause a bit of confusion with 'class' meaning 
> OO. So for that reason I don't think CLASS(ES) would be optimal.
> bc        ; Bidi_Class
> ccc       ; Canonical_Combining_Class
> http://unicode.org/Public/UNIDATA/PropertyAliases.txt
>
> Mark
>
> /— Il meglio è l’inimico del bene —/
>
>
> On Sun, Apr 24, 2011 at 11:22, Xueming Shen <xueming.shen at oracle.com 
> <mailto:xueming.shen at oracle.com>> wrote:
>
>
>     Two more names, UNICODE_PROPERTIES and UNICODE_CLASSES, are suggested.
>
>     any opinion?
>
>     -Sherman
>
>
>     On 4/23/2011 6:50 PM, Xueming Shen wrote:
>>     Forwarding...forgot to include the list.
>>
>>     -------- Original Message --------
>>     Subject: 	Re: Codereview Request: 7039066 j.u.rgex does not match
>>     TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties
>>     Date: 	Sat, 23 Apr 2011 17:53:42 -0700
>>     From: 	Xueming Shen <xueming.shen at oracle.com>
>>     <mailto:xueming.shen at oracle.com>
>>     To: 	Tom Christiansen <tchrist at perl.com> <mailto:tchrist at perl.com>
>>
>>
>>
>>       Mark, Tom,
>>
>>     I agree with Mark that UNICODE_SPEC is a better name than
>>     UNICODE_CHARSET. We will have to deal with
>>     the "compatibility" issue Tom mentioned anyway anyway should Java go
>>     higher level of Unicode Regex support
>>     someday. New option/flag will have to be introduced to let the developer
>>     to have the choice, just like what we
>>     are trying to do with the ASCII only or Unicode version for those classes.
>>
>>     I also agree we should have an embedded flag. was thinking we can add it
>>     later, for example the JDK8, if we
>>     can get this one in jdk7, but the Pattern usage in String class is
>>     persuasive.
>>
>>     The webrev, specdiff and Pattern doc have been updated to use
>>     UNICODE_SPEC as the flag and (?U) as the
>>     embedded flag. It might be a little confused, compared to we use (?u)
>>     for UNICODE_CASE, but feel it might
>>     feel "nature" to have uppercase "U" for broader Unicode support.
>>
>>     The webrev is at
>>     http://cr.openjdk.java.net/~sherman/7039066/webrev/  <http://cr.openjdk.java.net/%7Esherman/7039066/webrev/>
>>
>>       j.u.regex.Pattern API:
>>     http://cr.openjdk.java.net/~sherman/7039066/Pattern.html  <http://cr.openjdk.java.net/%7Esherman/7039066/Pattern.html>
>>
>>     Specdiff:
>>     http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html  <http://cr.openjdk.java.net/%7Esherman/7039066/specdiff/diff.html>
>>
>>     Tom,  it would be appreciated if you can at lease give the doc update a
>>     quick scan to see if I miss anything.
>>     And thanks for the suggestions for the Perl related doc update, I will
>>     need go through it a little later and address
>>     it in a separate CR.
>>
>>     Thanks,
>>     -Sherman
>>
>>
>>     On 4/23/2011 10:48 AM, Tom Christiansen wrote:
>>     >  Mark Davis ☕<mark at macchiato.com>  <mailto:mark at macchiato.com>   wrote
>>     >      on Sat, 23 Apr 2011 09:09:55 PDT:
>>     >
>>     >>  The changes sound good.
>>     >  They sure do, don't they?  I'm quite happy about this.  I think it is more
>>     >  important to get this in the queue than that it (necessarily) be done for
>>     >  JDK7.  That said, having a good tr18 RL1 story for JDK7's Unicode 6.0 debut
>>     >  makes it attractive now.  But if not now, then soon is good enough.
>>     >
>>     >>  The flag UNICODE_CHARSET will be misleading, since
>>     >>  all of Java uses the Unicode Charset (= encoding). How about:
>>     >>         UNICODE_SPEC
>>     >>  or something that gives that flavor.
>>     >  I hadn't thought of that, but I do see what you mean.  The idea is
>>     >  that the semantics of \w etc change to match the Unicode spec in tr18.
>>     >
>>     >  I worry that UNICODE_SPEC, or even UNICODE_SEMANTICS, might be too
>>     >  broad a brush.  What then happens when, as I imagine it someday shall,
>>     >  Java gets full support for RL2.3 boundaries, the way with ICU one uses
>>     >  or (?w) or UREGEX_UWORD for?
>>     >
>>     >  Wouldn't calling something UNICODE_SPEC be too broad? Or should
>>     >  UNICODE_SPEC automatically include not just existing Unicode flags
>>     >  like UNICODE_CASE, but also any UREGEX_UWORD that comes along?
>>     >  If it does, you have back-compat issue, and if it doesn't, you
>>     >  have a misnaming issue.  Seems like a bit of a Catch22.
>>     >
>>     >  The reason I'd suggested UNICODE_CHARSET was because of my own background
>>     >  with the names we use for this within the Perl regex source code (which is
>>     >  itself written in C).  I believe that Java doesn't have the same situation
>>     >  as gave rise to it in Perl, and perhaps something else would be clearer.
>>     >
>>     >  Here's some background for why we felt we had to go that way. To control
>>     >  the behavior of \w and such, when a regex is compiled, a compiled Perl
>>     >  gets exactly one of these states:
>>     >
>>     >       REGEX_UNICODE_CHARSET
>>     >       REGEX_LOCALE_CHARSET
>>     >       REGEX_ASCII_RESTRICTED_CHARSET
>>     >       REGEX_DEPENDS_CHARSET
>>     >
>>     >  That state it normally inherits from the surrounding lexical scope,
>>     >  although this can be overridden with /u and /a, or (?u) and (?a),
>>     >  either within the pattern or as a separate pattern-compilation flag.
>>     >
>>     >  REGEX_UNICODE_CHARSET corresponds to out (?u), so \w and such all get the
>>     >  full RL1.2a definitions.  Because Perl always does Unicode casemapping --
>>     >  and full casemapping, too, not just simple -- we didn't need (?u) for what
>>     >  Java uses it for, which is just as an extra flavor of (?i); it doesn't
>>     >  do all that much.
>>     >
>>     >       (BTW, the old default is *not* some sort of non-Unicode charset
>>     >       semantics, it's the ugly REGEX_DEPENDS_CHARSET, which is Unicode for
>>     >       code points>   255 and "maybe" so in the 128-255 range.)
>>     >
>>     >  What we did certainly isn't perfect, but it allows for both backwards
>>     >  compat and future growth.  This was because people want(ed) to be able to
>>     >  use regexes on both byte arrays yet also on character strings.  Me, I think
>>     >  it's nuts to support this at all, that if you want an input stream in (say)
>>     >  CP1251 or ISO 8859-2, that you simply set that stream's encoding and be
>>     >  done with it: everything turns into characters internally.  But there's old
>>     >  byte and locale code out there whose semantics we are loth to change out
>>     >  from under people.  Java has the same kind of issue.
>>     >
>>     >  The reason we ever support anything else is because we got (IMHO nasty)
>>     >  POSIX locales before we got Unicode support, which didn't happen till
>>     >  toward the end of the last millennium.  So we're stuck supporting code
>>     >  well more than a decade old, perhaps indefinitely.  It's messy, but it
>>     >  is very hard to do anything about that.  I think Java shares in that
>>     >  perspective.
>>     >
>>     >  This corresponds, I think, to Java needing to support pre-Unicode
>>     >  regex semantics on \w and related escapes.  If they had started out
>>     >  with it always means the real thing the way ICU did, they wouldn't
>>     >  need both.
>>     >
>>     >  I wish there were a pragma to control this on a per-lexical-scope basis,
>>     >  but I'm don't enough about the Java compilers internals to begin to know
>>     >  how to go about implementing some thing like that, even as a
>>     >  -XX:+UseUnicodeSemantics CLI switch for that compilation unit.
>>     >
>>     >  One reason you want this is because the Java String class has these
>>     >  "convenience" methods like matches, replaceAll, etc, that take regexes
>>     >  but do not provide an API that admits Pattern compile flags.  If there
>>     >  is no way to embed a (?U) directive or some such, nor any way to pass
>>     >  in a Pattern.UNICODE_something flag.  The Java String API could also
>>     >  be broadened through method signature overloading, but for now, you
>>     >  can't do that.
>>     >
>>     >  No matter what the UNICODE_something gets called, I think there needs to be
>>     >  a corresponding embeddable (?X)-style flag as well.  Even if String were
>>     >  broadened, you'd want people to be able to specify *within the regex* that
>>     >  that regex should have full Unicode semantics.  After all, they might read
>>     >  the pattern in from a file.  That's why (most) Pattern.compile flags need
>>     >  to be able to embedded, too.  But you knew that already. :)
>>     >
>>     >  --tom
>>
>
>