<i18n dev> RL1.1 Hex Notation (part 2 of 3)

Sat Jan 22 19:54:36 PST 2011

Sherman,

In part 1, I outlined my thinking of why having to make end-users think
about represenation issues in regexes goes against if not perhaps the law,
certainly to mind the spirit of UTS(tr)#18 when it says that a compliant
"the regular expression engine provides support for Unicode characters as
basic logical units."  

Please understand that I don't think that is much of a big deal -- it's a
rather low priority bug at worse -- because when you look at it from a
particular perspective, it appears to be a surface-level matter only.

(Also because it is easily addressed just by adding \x{XXX}, which is 
both simple and safe.)  

It's not that big of a deal because as you yourself point out, Sherman, you
can still specify any code point, although you have to bend over sideways
to do it.  But it doesn't at all affect behaviour, which is by far the more
important matter.

These next two serialization concerns, however, are different. This time
they are not just surface issues.  They are actual behavioral problems in
regexes that  derive from the actual internal implementation of characters
in Java:

  **  Surrogate Bugs in Regexes

  **  CANON_EQ Bugs in Regexes with \\uXXXX

I don't think users should have to know about those implementational
details, but if they don't, they will get several sorts of anomalous
behaviour.  I therefore believe those two are both geniuine bugs.  

I know exactly what is causing the second one (code included), but 
fixing it is going to require some code rearrangement and reworking.

===========================
 Surrogate Bugs in Regexes
===========================

Here is one of them:

    Unicode       UTF-16
   Code Point     String            Pattern     Result
   =========   ==============      =========    ======
    U+1F47E    "\uD83D\uDC7E"       /^.$/       true
     n/a       "\uD83D"             /^.$/       TRUE!

I do not understand how that same pattern--which says to match
strings containing a single Unicode code point only--can test on
both those strings.  That's why I believe the TRUE! result an error.

Don't you?

I understand that it brings up some tricky stuff.  Consider:

    If you have a string "HL" where H is a high surrogate and L a low
    surrogate, Java's regex engine correctly concludes that that string
    "HL" exactly matches the pattern "^.$" in its entirety; it has just one
    logical character in it.   This is correct.  It fails to match "^..$",
    which is also correct and for the same reason.

    However, if you flip those around to get string "LH", it now exactly
    matches the pattern "^..$" in its entirety, thus claiming it holds
    exactly two characters even there are no legal
    code points there!  

    If you have just one of the two surrogates, either "H" or "L", both of
    those will also match "^.$" just as "HL" does.  That says that a single
    surrogate is just as much a single logical character as a proper pair
    of them together is just a single logical character.  

But that makes no senses at all.  How can both be correct?  Surely that
*must* be a bug?  What am I not understanding here?

I really think that rather than returning true for something that
isn't even a legal Unicode code point, it should instead either

    1: raise an exception

and/or

    2: admit some pattern flag to deal with such cases

I say this because you are not supposed to have to deal representation
and serialization issues in regexes, and this makes you think about them.
It also gives you bizarre answers even when you do think about them.

=======================================
 CANON_EQ Bugs in Regexes with \\uXXXX
=======================================

Another place where you are forced to think about the internal
representation in Java regexes, is that they can behave differently if
you pass things in as "\\uXXXX" instead of as "\uXXXX".  I don't think
that can be correct behaviour, either.

The problem is that the CANON_EQ can no longer be trusted.  If you compile
up these patterns with CANON_EQ, then it makes a difference whether you've
used a literal or a \u0000 form.  Please consider these, as I believe that
FALSE! results below are all in error:

        String          Pattern
                       w/CANON_EQ           Result
        =========     ============        =========
     A : "\u00E9"       "^\u00E9$"          true
     B : "\u00E9"       "^e\u0301$"         true
     A': "\u00E9"       "^\\u00E9$"         true
     B': "\u00E9"       "^e\\u0301$"        FALSE!

     C : "e\u0301"      "^\u00E9$"          true
     D : "e\u0301"      "^e\u0301$"         true
     C': "e\u0301"      "^\\u00E9$"         FALSE!
     D': "e\u0301"      "^e\\u0301$"        true

The ABCD versions all use literals converted during the lexical
substitution phase, whereas the prime versions use UTF-16 code
units that get passed into the regex compiler for it to consider.

(This second mechanism is indispensable to meet the requirement
of being able to code up any code point, and to facilitate reading
patterns written in ASCII but specifying trans-ASCII code points.)

You get the same problem with octal notation: you can specify U+E9 as
"\351" for the prepass literal (which works), or as "\\0351" for the
regex engine to see (which fails just as \\u did):

        String          Pattern
                       w/CANON_EQ           Result
        =========     ============        =========
     a : "\u00E9"       "^\351$"            true
     a': "\u00E9"       "^\\0351$"          true
     c : "e\u0301"      "^\351$"            true
     c': "e\u0301"      "^\\0351$"          FALSE!

As you might predict, using UTF-8 directly in your code and compiling with
"java -encoding UTF-8" behaves exactly as the non-prime "\uXXXX" versions
do, but which can be different from how the prime "\\uXXXX" version behave.

>From looking at the code, I am sure I can reproduce this with \xXX escapes
as well.  That's because you do the normalization reshuffle before you
actually compile the pattern, so you won't see the octal or hex escapes
when you're doing the normalization.  The bug is right here in this code
right here, from around line 1500 of jdk1.7.0/java/util/regex/Pattern.java:

    /**
     * Copies regular expression to an int array and invokes the parsing
     * of the expression which will create the object tree.
     */
    private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
            normalize();
        } else {
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

        // Copy pattern to int array for convenience
        // Use double zero to terminate pattern
        temp = new int[patternLength + 2];

Because things like \cC and \0XXX and \xXX and \uXXXX all get handled
*after* that point in the code, they are *not* the same as literals with
those values.  This is a genuine problem.

So again we have to think about how things are stored.  It means that
you cannot just read in patterns that have had there non-ASCII converted
into \uXXXX escapes and have them work the same as having the literals in
there.  Those are supposed to be the same as the literals, but they're not.

This is quite apart from the--um, "syntactic infelicity"?--of the mismatch
between how octal excapes are specified in the lexical substitution pass
versus how they're specified in the regex engine.  That, I wouldn't quite
call a bug so much as an unexpected wrinkle.  I do fix this in my regex
rewriter, BTW.

    (There are "syntactic infelicities" with \cC, too.  It is a bit too
     undiscerning, producing things that aren't guaranteed to be control
     characters because it blindly xors whatever follows it with 64.  For
     example, \c} is = and \c= is }, \cé is © and \c© is é, etc. )

This is message is far too long again, so I will discuss your comments
regarding the j.l.Character class in part 3 of 3, to be sent later on.

Thanks again!

--tom