j.u.regex: Negated Character Classes
Tim Ellison
t.p.ellison at gmail.com
Wed Jun 8 15:27:30 UTC 2011
Hi Sherman, ok so I'll admit to reading through to the end of your note
and finding it interesting ;-)
Some comments in-lined.
On 03/Jun/2011 22:55, Xueming Shen wrote:
> I'm sure everybody understands what "negated character classes" [^...]
> in j.u.regex means.
> You would never have doubt about
>
> [^c] does NOT match "c"
> [^0-9] does NOT match "8"
> [^a-z] does NOT match "b"
> [^a-bc-d] does NOT match 'c"
>
> But how about
>
> does [^[c]] match "c"?
> does [^[0-9]] match "8"?
> does [^[a-z]] match "b"?
> does [^a-b[c-d]] match "c"?
>
> I was wrong on all of them when was asked first time and it took me
> a while to figure out what is going on behind it. Oh, btw, the answer
> is "yes" for all 4, yes, the
>
> [^[c]] matches "c"
> [^[0-9]] matches "8"
> [^[a-z]] matches "b".
> [^a-b[c-d]] matches "c" (while [^a-bc-d] does NOT match "c")
I would not have known the right answer to this quiz either; it seems
that the use of nested character sets is sufficiently rare that we've
not had to learn how these behave.
> Another interesting sample is
>
> [^a-b[c-d]e-f] matches "c" but does NOT match "e" (so the "e-f" part after
> the nested character class [c-d] does back to "normal").
>
> It appears the "negation" of the "outer" character class does not
> affect its nested character class,
I think the easiest way to explain the situation is not to consider the
negation separately, but that [^ is the syntax of a negated character
set. Having a normal character set inside a negated character set then
seems ok to me.
> so [^X] is always opposite from
> [^[X]], "X" to be any character class.
>
> Same "strange" thing seems to be true for "intersection operation &&"
> as well, so both [a-d&&c-f] and [^a-d&&c-f] do NOT match "a".
>
> This does not sound correct, at least for me.
This case is, I agree, a gotcha which is hard to justify through the
syntax rather than the implementation.
> The source code suggests that we are treating the nested/embedded
> [...] character class and the "intersection &&" specially, so
>
> [^[X] is interpreted as [^] union [X]
> [^X[Y]] is interpreted as [^X] union [Y]
> [^X[Y]Z] is interpreted as [^XZ] union [Y]
> [^X&&Y] is interpreted as [^X] && Y
>
> What I meant "treating...specially" is that we do NOT do the same
> thing for other "embedded character classes", so while [^[a-z]] does
> match "c", [^\p{Lower}] actually does NOT match "c", which I would
> expect.
>
> The j.u.regex.Pattern APIs do NOT help. All the samples given for
> "Character classes"[1] section are "simple" negation, no "nested"
> sample is given. And neither "^" nor "[^...]" appear in the operator
> precedence table. The behaviors in other regex engines, such as Perl
> and POSIX, don't help, as "nested" character class is not supported
> there.
>
> I did check with the original author who wrote this part of the code.
> It appears the current implementation is indeed what he intended to
> do back then, so this behavior is NOT an implementation bug but by
> design.
>
> Personally I don't feel this design is not correct.
More mind tricks ;-) ? You think the design *is* wrong?
> Ideally, I would assume the spec either specifies [^...] as a
> separate "group operator" to be the "complement" of [...], or "^" as
> the "negation operator" with the lowest precedence, such as (from
> lowest to highest)
>
> (1) Negation ^ (only at the beginning of the [...])
> (2) Intersection &&
> (3) Range -
> (4) nested class []
or as I suggested above
(5) negated class [^ ... ]
and the understanding that nested classes do not 'inherit' the negation
property of their parent.
>
> So
> [^X[Y]] would be the "complement" of [X<union>[Y]]
>
> [^X[Y]Z] would be the "complement" of [X<union>[Y]<union>Z]
>
> [^X&&Y] would be the "complement" of [X&&Y]
>
> for example, if I dump the regex internal logic node tree for the sample
> regex
> [^a-b[c-d]e-f], the jdk7 and jdk8 results would look like
>
> /home/sherman/TL/regex$
> /export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag
> "1000" "[^a-b[c-d]e-f]" "c"
> Pattern=<[^a-b[c-d]e-f]>
> Input =<c>
> 1: <Difference> (7)
> 2: <Union> (0)
> 3: <Complement> (0)
> 4: <Range [a-b]> (0)
> 5: <Range [c-d]> (0)
> 6: <Range [e-f]> (0)
> 7: <END> (0)
> -------------------------------
> match:true
> groupCount=0
>
> /home/sherman/TL/regex$
> /export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag
> "1000" "[^a-b[c-d]e-f]" "c"
> Pattern=<[^a-b[c-d]e-f]>
> Input =<c>
> 1: <Complement> (7)
> 2: <Union> (0)
> 3: <Union> (0)
> 4: <Range [a-b]> (0)
> 5: <Range [c-d]> (0)
> 6: <Range [e-f]> (0)
> 7: <END> (0)
> -------------------------------
> match:false
>
> I know, most of people might not be interested, but if you have read
> this far, means you are interested:-) and might have some opinions.
> It is definitely an incompatible change, given this has been the
> behavior from the very beginning of Java regex and has been there for
> almost a decade, I would appreciate any comment/opinion, especially
> if you agree that the existing behavior is NOT "correct" and therefor
> we need to fix, OR you think the existing one is just fine (the fact
> I only heard one complain the past 5 -6 years:-) so far), OR even the
> existing behavior is not "ideal", but given the compatibility
> concern, we might just leave it alone.
Since it does not seem to be causing a great problem, I would be
inclined to document it and leave it alone.
Should the future include subtractive character classes then writing,
e.g. [^a-z-[safe]] would seem more natural than [^a-z-[^safe]].
Regards,
Tim
> The fix/change itself is relatively easy, as showed at
>
> http://cr.openjdk.java.net/~sherman/pattern_cc/
> <http://cr.openjdk.java.net/%7Esherman/pattern_cc/>
>
> Thanks
> -Sherman
>
> [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html#cc
More information about the core-libs-dev
mailing list