<i18n dev> regex rewriting code (part 2 of 3)

Tue Jan 25 11:12:52 PST 2011

When I set about resolving the Unicode troubles in Java regular 
expressions through rewriting them into something Java understood, 
I found it convenient to divide that functionality into two different 
rewriting functions, one to handle string escapes like \uXXXX and the 
other to handle charclass escapes like \w.  I will discuss the first of
these two functions here in part 2, and the second of them in part 3.

Even though I consider it only an alpha prototype, this is fully working
code that fulfills all the requirements I set.  It is being used in a
production environment, although this is an internal use, not one where
it has been released to outside bodies.  

I am absolutely *not* advocating that this code be taken up as is by 
the JDK.  Even if do you care to use some aspects of it--which you are
perfectly welcome to do, BTW; we're 100% open source--I entertain no 
notion of it remaining recognizable.  

I discuss it here because it does manage to resolve almost all of the
Level 1 compliance issues I have raised.  

Again, my rewrite code has two different functions: one 
for character escapes like \uXXXX, and one for charclass
escapes like \w.

CHARACTER ESCAPES
=================

The one for character escapes translates *symbolically* specified
character escapes found in the pattern into corresponding code
points.  It works on the following character escapes, of which
only the last one, \x{...}, is entirely new to Java.

    --  \a \e \f \n \r \t [but *not* \b due to next function]
    --  \cX
    --  \0 \0N \0NN \N \NN \NNN (where N is any octal digit)
    --  \xXX (X=2)
    --  \uXXXX (X=4)
    --  \x{XXXXXX} (X = 1-8)

This function serves at least four different purposes, which
I shall elaborate on below:

  1. Lets you unescape strings read in with embedded char escapes.
  2. Makes strings and patterns accept the same escape syntax.
  3. Resolves the RL1.1 niggle on hex notation.
  4. Resolves the CANON_EQ bug interfering with satisfying RL2.1 
     Canonical Equivalents, which the JDK7 pattern docs claim met.

There is to my knowledge no function in the core Java library that takes as
input a string with character escapes and produces as output a new string
with all the character escapes with literals.  That's what this one does.

(Purpose 1) You need to do this so that you can read in string from
elsewhere than program literals and have them count the same as though
they were a program literal.  For example, command line arguments,
configuration files, environment variables, user input, etc.

(Purpose 2) Right now there are several subtle mismatchs between which
character escapes work in Java's general string literals and which ones
work only within strings that eventually make their way to Pattern.compile().
This function handles both sorts so you don't have to remember which is which.
If need be, I can discuss what these mismatches are with precision.

(Purpose 3) The new \x{...} borrowed from Perl lets you specify logical
code points not physical 16-bit code units.  That way you you can look
directly at the code and know immediately what code point is meant without
having to run a pair of them through a function to combine high and low
surrogates.  This thereby satisfies RL1.1 even to my satisfaction.  

(Purpose 4) There is a bug in the Pattern.compile() code in its handling of
the CANON_EQ flag.  It normalizes the input string before it parses it.
That means that character literals are correctly normalized but character
escapes are not.  This function can be used as a workaround because if you
first pass the string through this one before you send it to compile, the
character escapes will have been turned into the needed literals already.
This allows the Java Pattern class to meet RL2.1.

That BTW is why this function does not translate \b into backspace as one
would normally expect if it were only escaping strings: you have to leave
them intact so that they can be word boundaries if used as regexes.
Optimally the API could be designed so both would be possible.

In the final part 3 of this letter, I will discuss the function 
I wrote that rewrites a Java regex's character-class escapes to make
them work right on Unicode strings.

--tom