j.u.regex: Negated Character Classes

Wed Jun 8 15:27:30 UTC 2011

Hi Sherman, ok so I'll admit to reading through to the end of your note
and finding it interesting ;-)

Some comments in-lined.

On 03/Jun/2011 22:55, Xueming Shen wrote:
> I'm sure everybody understands what "negated character classes" [^...]
> in j.u.regex means.
> You would never have doubt about
> 
> [^c] does NOT match "c"
> [^0-9] does NOT match "8"
> [^a-z] does NOT match "b"
> [^a-bc-d] does NOT match 'c"
> 
> But how about
> 
> does [^[c]] match "c"?
> does [^[0-9]] match "8"?
> does [^[a-z]] match "b"?
> does [^a-b[c-d]] match "c"?
> 
> I was wrong on all of them when was asked first time and it took me
> a while to figure out what is going on behind it. Oh, btw, the answer
> is "yes" for all 4, yes, the
> 
> [^[c]] matches "c"
> [^[0-9]] matches "8"
> [^[a-z]] matches "b".
> [^a-b[c-d]] matches "c"  (while [^a-bc-d] does NOT match "c")

I would not have known the right answer to this quiz either; it seems
that the use of nested character sets is sufficiently rare that we've
not had to learn how these behave.

> Another interesting sample is
> 
> [^a-b[c-d]e-f] matches "c" but does NOT match "e" (so the "e-f" part after
> the nested character class [c-d] does back to "normal").
> 
> It appears the "negation" of the "outer" character class does not
> affect its nested character class,

I think the easiest way to explain the situation is not to consider the
negation separately, but that [^ is the syntax of a negated character
set.  Having a normal character set inside a negated character set then
seems ok to me.

> so [^X] is always opposite from
> [^[X]], "X" to be any character class.
> 
> Same "strange" thing seems to be true for "intersection operation &&"
> as well, so both [a-d&&c-f] and [^a-d&&c-f] do NOT match "a".
> 
> This does not sound correct, at least for me.

This case is, I agree, a gotcha which is hard to justify through the
syntax rather than the implementation.

> The source code suggests that we are treating the nested/embedded
> [...] character class and the "intersection &&" specially, so
> 
> [^[X]  is interpreted as [^] union [X]
> [^X[Y]] is interpreted as [^X] union [Y]
> [^X[Y]Z] is interpreted as [^XZ] union [Y]
> [^X&&Y] is interpreted as [^X] && Y
> 
> What I meant "treating...specially" is that we do NOT do the same
> thing for other "embedded character classes", so while [^[a-z]] does
> match "c", [^\p{Lower}] actually does NOT match "c", which I would
> expect.
>
> The j.u.regex.Pattern APIs do NOT help. All the samples given for 
> "Character classes"[1] section are "simple" negation, no "nested"
> sample is given. And neither "^" nor "[^...]" appear in the operator
> precedence table. The behaviors in other regex engines, such as Perl
> and POSIX, don't help, as "nested" character class is not supported
> there.
> 
> I did check with the original author who wrote this part of the code.
> It appears the current implementation is indeed what he intended to
> do back then, so this behavior is NOT an implementation bug but by
> design.
> 
> Personally I don't feel this design is not correct.

More mind tricks ;-) ?  You think the design *is* wrong?

> Ideally, I would assume the spec either specifies [^...] as a
> separate "group operator" to be the "complement" of [...], or "^" as
> the "negation operator" with the lowest precedence, such as (from
> lowest to highest)
> 
> (1) Negation  ^        (only at the beginning of the [...])
> (2) Intersection &&
> (3) Range -
> (4) nested class []

or as I suggested above
  (5) negated class [^ ... ]

and the understanding that nested classes do not 'inherit' the negation
property of their parent.

> 
> So
> [^X[Y]] would be the "complement" of [X<union>[Y]]
> 
> [^X[Y]Z] would be the "complement" of  [X<union>[Y]<union>Z]
> 
> [^X&&Y] would be the "complement" of [X&&Y]
> 
> for example, if I dump the regex internal logic node tree for the sample
> regex
> [^a-b[c-d]e-f], the jdk7 and jdk8 results would look like
> 
> /home/sherman/TL/regex$
> /export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag
> "1000" "[^a-b[c-d]e-f]" "c"
> Pattern=<[^a-b[c-d]e-f]>
> Input  =<c>
>      1: <Difference> (7)
>      2: <Union> (0)
>      3: <Complement> (0)
>      4: <Range [a-b]> (0)
>      5: <Range [c-d]> (0)
>      6: <Range [e-f]> (0)
>      7: <END> (0)
>     -------------------------------
> match:true
>     groupCount=0
> 
> /home/sherman/TL/regex$
> /export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag
> "1000" "[^a-b[c-d]e-f]" "c"
> Pattern=<[^a-b[c-d]e-f]>
> Input  =<c>
>      1: <Complement> (7)
>      2: <Union> (0)
>      3: <Union> (0)
>      4: <Range [a-b]> (0)
>      5: <Range [c-d]> (0)
>      6: <Range [e-f]> (0)
>      7: <END> (0)
>     -------------------------------
> match:false
> 
> I know, most of people might not be interested, but if you have read 
> this far, means you are interested:-) and might have some opinions.
> It is definitely an incompatible change, given this has been the
> behavior from the very beginning of Java regex and has been there for
> almost a decade, I would appreciate any comment/opinion, especially
> if you agree that the existing behavior is NOT "correct" and therefor
> we need to fix, OR you think the existing one is just fine (the fact
> I only heard one complain the past 5 -6 years:-) so far), OR even the
> existing behavior is not "ideal", but given the compatibility
> concern, we might just leave it alone.

Since it does not seem to be causing a great problem, I would be
inclined to document it and leave it alone.

Should the future include subtractive character classes then writing,
e.g. [^a-z-[safe]] would seem more natural than [^a-z-[^safe]].

Regards,
Tim

> The fix/change itself is relatively easy, as showed at
> 
> http://cr.openjdk.java.net/~sherman/pattern_cc/
> <http://cr.openjdk.java.net/%7Esherman/pattern_cc/>
> 
> Thanks
> -Sherman
> 
> [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html#cc