j.u.regex: Negated Character Classes

Fri Jun 3 21:55:28 UTC 2011

I'm sure everybody understands what "negated character classes" [^...] 
in j.u.regex means.
You would never have doubt about

[^c] does NOT match "c"
[^0-9] does NOT match "8"
[^a-z] does NOT match "b"
[^a-bc-d] does NOT match 'c"

But how about

does [^[c]] match "c"?
does [^[0-9]] match "8"?
does [^[a-z]] match "b"?
does [^a-b[c-d]] match "c"?

I was wrong on all of them when was asked first time and it took me a while
to figure out what is going on behind it. Oh, btw, the answer is "yes" 
for all 4, yes,
the

[^[c]] matches "c"
[^[0-9]] matches "8"
[^[a-z]] matches "b".
[^a-b[c-d]] matches "c"  (while [^a-bc-d] does NOT match "c")

Another interesting sample is

[^a-b[c-d]e-f] matches "c" but does NOT match "e" (so the "e-f" part after
the nested character class [c-d] does back to "normal").

It appears the "negation" of the "outer" character class does not affect 
its nested
character class, so [^X] is always opposite from [^[X]], "X" to be any 
character
class.

Same "strange" thing seems to be true for "intersection operation &&" as 
well,
so both [a-d&&c-f] and [^a-d&&c-f] do NOT match "a".

This does not sound correct, at least for me.

The source code suggests that we are treating the nested/embedded [...] 
character
class and the "intersection &&" specially, so

[^[X]  is interpreted as [^] union [X]
[^X[Y]] is interpreted as [^X] union [Y]
[^X[Y]Z] is interpreted as [^XZ] union [Y]
[^X&&Y] is interpreted as [^X] && Y

What I meant "treating...specially" is that we do NOT do the same thing 
for other
"embedded character classes", so while [^[a-z]] does match "c",
[^\p{Lower}] actually does NOT match "c", which I would expect.

The j.u.regex.Pattern APIs do NOT help. All the samples given for 
"Character classes"[1]
section are "simple" negation, no "nested" sample is given. And neither 
"^" nor
"[^...]" appear in the operator precedence table. The behaviors in other 
regex engines,
such as Perl and POSIX, don't help, as "nested" character class is not 
supported there.

I did check with the original author who wrote this part of the code. It 
appears the current
implementation is indeed what he intended to do back then, so this 
behavior is NOT an
implementation bug but by design.

Personally I don't feel this design is not correct. Ideally, I would 
assume the spec either
specifies [^...] as a separate "group operator" to be the "complement" 
of [...], or "^" as the
"negation operator" with the lowest precedence, such as (from lowest to 
highest)

(1) Negation  ^        (only at the beginning of the [...])
(2) Intersection &&
(3) Range -
(4) nested class []

So
[^X[Y]] would be the "complement" of [X<union>[Y]]

[^X[Y]Z] would be the "complement" of  [X<union>[Y]<union>Z]

[^X&&Y] would be the "complement" of [X&&Y]

for example, if I dump the regex internal logic node tree for the sample 
regex
[^a-b[c-d]e-f], the jdk7 and jdk8 results would look like

/home/sherman/TL/regex$ 
/export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag 
"1000" "[^a-b[c-d]e-f]" "c"
Pattern=<[^a-b[c-d]e-f]>
Input  =<c>
      1: <Difference> (7)
      2: <Union> (0)
      3: <Complement> (0)
      4: <Range [a-b]> (0)
      5: <Range [c-d]> (0)
      6: <Range [e-f]> (0)
      7: <END> (0)
     -------------------------------
match:true
     groupCount=0

/home/sherman/TL/regex$ 
/export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag 
"1000" "[^a-b[c-d]e-f]" "c"
Pattern=<[^a-b[c-d]e-f]>
Input  =<c>
      1: <Complement> (7)
      2: <Union> (0)
      3: <Union> (0)
      4: <Range [a-b]> (0)
      5: <Range [c-d]> (0)
      6: <Range [e-f]> (0)
      7: <END> (0)
     -------------------------------
match:false

I know, most of people might not be interested, but if you have read 
this far, means you are
interested:-) and might have some opinions. It is definitely an 
incompatible change, given this
has been the behavior from the very beginning of Java regex and has been 
there for almost a
decade, I would appreciate any comment/opinion, especially if you agree 
that the existing
behavior is NOT "correct" and therefor we need to fix, OR you think the 
existing one is just fine
(the fact I only heard one complain the past 5 -6 years:-) so far), OR 
even the existing behavior
is not "ideal", but given the compatibility concern, we might just leave 
it alone.

The fix/change itself is relatively easy, as showed at

http://cr.openjdk.java.net/~sherman/pattern_cc/ 
<http://cr.openjdk.java.net/%7Esherman/pattern_cc/>

Thanks
-Sherman

[1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html#cc