From xueming.shen at oracle.com Wed Jun 8 11:47:45 2011 From: xueming.shen at oracle.com (Xueming Shen) Date: Wed, 08 Jun 2011 11:47:45 -0700 Subject: j.u.regex: Negated Character Classes In-Reply-To: <4DEF94E2.2050105@gmail.com> References: <4DE95850.6040208@oracle.com> <4DEF94E2.2050105@gmail.com> Message-ID: <4DEFC3D1.2060808@oracle.com> Hi Tim, Semantically I don't see too much difference between to consider [^ syntax as a "negated character set" or to separate the ^ out and treat it as an unary operator with the lowest precedence (only at the very beginning, of course). The issue here is whether or not to consider the nested class character (the character classes grouped by [...]) as a basic "element" of its enclosing class, equivalent to other elements, such as the literals", the predefined, the range and the union. While POSIX's bracket expression does not support nested/sub "bracket expression", it does define a set of so called "POSIX character classes" in form of [:...:] that can only be used inside the bracket expression, as a basic element, equivalent to other elements when nested/embedded in [...] or [^...]. So [a-z[:digit:] matches any of lowercase ascii or 0-9 digits and [^a-z[:digit"] matches any character that is NOT lowercase ascii AND 0-9 digits, in which the [:digit'] is an equivalence of 0-9, or [0-9] if nested bracket expression is supported. In Java, however the [0-9] (grouping), 0-9 (range) and \p{digit} are NOT treated as equivalent when inside [^..], but the same when inside [...]. My apology for the confusion. Yes, I meant to say "The design decision IS wrong":-) I agree that it appears not to be a "great" problem for most people, it might be OK just leave it alone (and document it somewhere), consider it has been the behavior for decade. But on the other side, it also makes the "compatibility" issue less severe:-) especially it's hard to "explain" the && case. -Sherman On 6/8/2011 8:27 AM, Tim Ellison wrote: > Hi Sherman, ok so I'll admit to reading through to the end of your note > and finding it interesting ;-) > > Some comments in-lined. > > On 03/Jun/2011 22:55, Xueming Shen wrote: >> I'm sure everybody understands what "negated character classes" [^...] >> in j.u.regex means. >> You would never have doubt about >> >> [^c] does NOT match "c" >> [^0-9] does NOT match "8" >> [^a-z] does NOT match "b" >> [^a-bc-d] does NOT match 'c" >> >> But how about >> >> does [^[c]] match "c"? >> does [^[0-9]] match "8"? >> does [^[a-z]] match "b"? >> does [^a-b[c-d]] match "c"? >> >> I was wrong on all of them when was asked first time and it took me >> a while to figure out what is going on behind it. Oh, btw, the answer >> is "yes" for all 4, yes, the >> >> [^[c]] matches "c" >> [^[0-9]] matches "8" >> [^[a-z]] matches "b". >> [^a-b[c-d]] matches "c" (while [^a-bc-d] does NOT match "c") > I would not have known the right answer to this quiz either; it seems > that the use of nested character sets is sufficiently rare that we've > not had to learn how these behave. > >> Another interesting sample is >> >> [^a-b[c-d]e-f] matches "c" but does NOT match "e" (so the "e-f" part after >> the nested character class [c-d] does back to "normal"). >> >> It appears the "negation" of the "outer" character class does not >> affect its nested character class, > I think the easiest way to explain the situation is not to consider the > negation separately, but that [^ is the syntax of a negated character > set. Having a normal character set inside a negated character set then > seems ok to me. > >> so [^X] is always opposite from >> [^[X]], "X" to be any character class. >> >> Same "strange" thing seems to be true for "intersection operation&&" >> as well, so both [a-d&&c-f] and [^a-d&&c-f] do NOT match "a". >> >> This does not sound correct, at least for me. > This case is, I agree, a gotcha which is hard to justify through the > syntax rather than the implementation. > >> The source code suggests that we are treating the nested/embedded >> [...] character class and the "intersection&&" specially, so >> >> [^[X] is interpreted as [^] union [X] >> [^X[Y]] is interpreted as [^X] union [Y] >> [^X[Y]Z] is interpreted as [^XZ] union [Y] >> [^X&&Y] is interpreted as [^X]&& Y >> >> What I meant "treating...specially" is that we do NOT do the same >> thing for other "embedded character classes", so while [^[a-z]] does >> match "c", [^\p{Lower}] actually does NOT match "c", which I would >> expect. >> >> The j.u.regex.Pattern APIs do NOT help. All the samples given for >> "Character classes"[1] section are "simple" negation, no "nested" >> sample is given. And neither "^" nor "[^...]" appear in the operator >> precedence table. The behaviors in other regex engines, such as Perl >> and POSIX, don't help, as "nested" character class is not supported >> there. >> >> I did check with the original author who wrote this part of the code. >> It appears the current implementation is indeed what he intended to >> do back then, so this behavior is NOT an implementation bug but by >> design. >> >> Personally I don't feel this design is not correct. > More mind tricks ;-) ? You think the design *is* wrong? > >> Ideally, I would assume the spec either specifies [^...] as a >> separate "group operator" to be the "complement" of [...], or "^" as >> the "negation operator" with the lowest precedence, such as (from >> lowest to highest) >> >> (1) Negation ^ (only at the beginning of the [...]) >> (2) Intersection&& >> (3) Range - >> (4) nested class [] > or as I suggested above > (5) negated class [^ ... ] > > and the understanding that nested classes do not 'inherit' the negation > property of their parent. > >> So >> [^X[Y]] would be the "complement" of [X[Y]] >> >> [^X[Y]Z] would be the "complement" of [X[Y]Z] >> >> [^X&&Y] would be the "complement" of [X&&Y] >> >> for example, if I dump the regex internal logic node tree for the sample >> regex >> [^a-b[c-d]e-f], the jdk7 and jdk8 results would look like >> >> /home/sherman/TL/regex$ >> /export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag >> "1000" "[^a-b[c-d]e-f]" "c" >> Pattern=<[^a-b[c-d]e-f]> >> Input = >> 1: (7) >> 2: (0) >> 3: (0) >> 4: (0) >> 5: (0) >> 6: (0) >> 7: (0) >> ------------------------------- >> match:true >> groupCount=0 >> >> /home/sherman/TL/regex$ >> /export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag >> "1000" "[^a-b[c-d]e-f]" "c" >> Pattern=<[^a-b[c-d]e-f]> >> Input = >> 1: (7) >> 2: (0) >> 3: (0) >> 4: (0) >> 5: (0) >> 6: (0) >> 7: (0) >> ------------------------------- >> match:false >> >> I know, most of people might not be interested, but if you have read >> this far, means you are interested:-) and might have some opinions. >> It is definitely an incompatible change, given this has been the >> behavior from the very beginning of Java regex and has been there for >> almost a decade, I would appreciate any comment/opinion, especially >> if you agree that the existing behavior is NOT "correct" and therefor >> we need to fix, OR you think the existing one is just fine (the fact >> I only heard one complain the past 5 -6 years:-) so far), OR even the >> existing behavior is not "ideal", but given the compatibility >> concern, we might just leave it alone. > Since it does not seem to be causing a great problem, I would be > inclined to document it and leave it alone. > > Should the future include subtractive character classes then writing, > e.g. [^a-z-[safe]] would seem more natural than [^a-z-[^safe]]. > > Regards, > Tim > >> The fix/change itself is relatively easy, as showed at >> >> http://cr.openjdk.java.net/~sherman/pattern_cc/ >> >> >> Thanks >> -Sherman >> >> [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html#cc