RFR: Regex canonical equivalents

Fri Mar 18 05:18:40 UTC 2016

Hi,

While still waiting patiently for the review for

RFR: Regex exponential backtracking issue [1]
RFR: Regex exponential backtracking issue --- more cleanup/tuning [2]

Here is the third round of change to address the "broken canonical 
equivalent support"
issue listed in JEP-111 [3] .

Canonical Equivalents:

Unicode Standard defines "canonical equivalents" in its tr18# [4]. The 
"canonical equivalents"
related item has been updated lot of times. It appears the standard 
itself has kinda given up
on defining a "general approach" for the regex engine :-) with the 
wordings "In practice, regex
APIs are not setup to match parts of .... it is feasible, however, to 
construct patterns that will
match against NFD ..." The "canonical equivalents" is just too 
complicated and too many
edge cases for a "normal" regex engine and its APIs.

We probably should have optioned to not support it if had the choice 
again. Unfortunately
Java regex has spec-ed to support such "canonical equivalents" from the 
very beginning
and has been "supporting" this feature with something that least works 
reasonably in some
common cases for releases.

The current regex CANON_EQ support is kinda of broken, especially the 
implementation for
the character class construct. It simply does not work as expected, as 
reported in various
issues listed below.

https://bugs.openjdk.java.net/browse/JDK-4916384
https://bugs.openjdk.java.net/browse/JDK-4867170
https://bugs.openjdk.java.net/browse/JDK-6995635
https://bugs.openjdk.java.net/browse/JDK-6728861
https://bugs.openjdk.java.net/browse/JDK-6736245
https://bugs.openjdk.java.net/browse/JDK-7080302

How does the java regex "work" now when the CANON_EQ flag is set? the 
current
implementation first

(1) converts the whole pattern into Normalizer.Form.NFD form, for 
example for
the greek extended character "\u1f80", we convert it to its nfd form as

\u1f80 -> \u03b1\u0313\u0345

which has a base character \u03b1 (small alpha) followed by two 
non-spacing_mark
characters \u0313 (combining comma above) and \u0345 (greek iota below)

(2) then we generate all the possible "permutations" of the characters 
inside the nfd
string (based on the unicode nfd/nfc normalization rules, the base 
character stays
where it is, those non-space-mark characters can be in any order for NOT 
normalized
text), which includes the possible new "combination" of individual each 
characters.

\u3b1\u313\u345
\u3b1\u345\u313
\u1f00\u345 (new combination \u3b1\u0313 -> \u1f00)
\u1fb3\u313 (new combination \u3b1\u0345 -> \u1fb3)
\u1f80

(3) finally a pure group is constructed with the permutations, which 
will match
any canonical equivalences, to replace the original single \u1f80

(?:\u1f80|\u3b1\u313\u345|\u1f00\u345|\u3b1\u345\u313|\u1fb3\u313)

The resulting pure group, though looks tedious, can match all the canonical
equivalences (are literally listed as the "alternation" inside the pure 
group construct)
of the greek character \u1f80, especially in "literal" case (slice of 
characters). For
example, pattern "A\u1f80B" matches successfully for input like

"A\u3b1\u313\u345B"
"A\u3b1\u345\u313B"
"A\u1f00\u345B"
"A\u1fb3\u313B"
"A\u1f80B"

And it works fine even you put it inside a character class construct 
[...] The pattern
"[\u1f80]" can successfully find its corresponding canonical 
equivalences from the
above input strings.

But it starts to fail apart when you try a little more "complicated" 
character class, for
example, the negation [^\u1f80A] does not match A but matches \u1f80, 
and match
all of its canonical equivalences.

Range [\u1f80-\u1f82] matches \u1f80, \u1f82 and their canonical 
equivalences, as
expected but it doesn't match \u1f81 (and its canonical equivalences), 
as people would
normally expected (as \u1f81 is perfectly inside that range) .

The reason behind these failures is because the current implementation 
converts/
normalizes the "character class" pattern the same way as it does for the 
"slice of
characters" pattern. Take a look at the "normalized" pattern from the 
character class
samples,

[^\u1f80A]
--> 
(?:[^A]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80)

[\u1f80-\u1f82]
--> 
(?:[-]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80|\u3b1\u313\u300\u345|...)

The current implementation simply takes those "composed" characters out 
of the
character class body, converts them into alternations (as it does for 
the slice)
and append them at the end of the character class construct. It just 
does not work,
the resulting pattern does not really match what the original regex 
pattern means
to be.

[^\u1f80A]  -> any character neither is 'A' nor \u1f8a", including their 
canonical
                         equivalences.
[[\u1f80-\u1f82] ]
                   -> any character (including their canonical 
equivalences) with the range
                        of \u1f80 andu1f82

The proposed changes here are

(1) instead of normalizing everything of the pattern into nfd, 
normalizing the character
class part into nfc, as the "character class" really needs to match a 
"character" or the
"canonical equivalence" of this "character (composed). It can be 
interpreted as to match
a "grapheme" that can be normalized into a "nfc" that matches this 
"character". For
example if you have a  luster of \u03b1\u0314\u0345" inside a character 
class, it is
reasonable to assume you really mean to match character \u1f80 and/its 
canonical
equivalences, when the CANON_EQ when the flag is on.

(2) instead of trying to generate the permutations, create the 
alternation and then put
it appropriately into the character class (logically), we now use a 
special "Node", the
NFCCharProperty to do the matching work. The NFCCharProperty tries to 
match a
grapheme cluster at a time (nfc greedly, then backtrack) against the 
character class.

For example for character class [\u1f80] or [\u03b1\u0313\u0345] the 
resulting
"normalized "pattern is [\u1f80]

and for the negation the "normalized" pattern is a normal [^\u1f80] for 
both [^\u1f80]
and [^\u03b1\u0313\u0345]

When matching,  we try to match the input against the targeting 
character class
grapheme by grapheme. The engine picks a grapheme cluster first, 
normalize it
to its NFC form, and then match it against the character class.

For example,

input text:  "ab\u1f81cd"
                   "ab\u03b1\u0314\u0345cd"
                   "ab\u03b1\u0345\u0314cd"
                   "ab\u1f01\u0345cd"
                   "ab\u1fb3\u0314cd"

character class: [\u1f80-\u1f82],

The engine finds the grapheme cluster "\u03b1\u0314\u0345" (or any of 
the other combination)
first, normalize it into \u1f81, match it against the range 
\u1f80-\u1f82, and move on.

In case there is(are) extra non-spacing-mark character in the grapheme 
cluster, for example

"ab\u03b1\u0314\u0345\u0313cd"    (there is any extra \u0313")

The "find" will still have a "find" at \u03b1\u0314\u0345. But the input 
will not "match"
"ab[\u1f80-\u1f82]cd" (because there is extra), but it will "match" 
"ab[\u1f80-\u1f82]\u0313cd".

(as an implementation detail, the engine will first try 
"\u03b1\u0314\u0345\u0313",
which ends up with a nfc string"\u1f81\u0313", can NOT match the single 
character, it backtracks
to "\u03b1\u0314\u0345" ...)

The approach seems to work and fix most of the troubles we have in 
character class.

Here is the webrev (on top of the changes for JDK-8151481 [1])

http://cr.openjdk.java.net/~sherman/regexCE/webrev/

While the character class CANON_EQ support is the main issue we want to 
address
this time, there are also couple other problems reported in the past, 
they are fixed
together here.

(1) the current implement only supports base+non_spacing_marks CANON_EQ, 
the
the canonical equivalence of decomposed hangul (jamos) and composed hangl
syllables is NOT supported, for example "\u1100\u1161" vs "\uac00".

fixed.

(2) Character in Unicode composition exclusion table does not match 
itself, as
reported in JDK-6736245. (special composition sample,
nfd(\u2adc) -> \u2add\u0338
nfc(\u2add\u0338) -> \u2add\u0338 (NOT back to \u2adc)

fixed.

(3) regex compiling syntax error/exception when compile certain regex, 
for example
"(\u00e9)" triggers

Exception in thread "main" java.util.regex.PatternSyntaxException: 
Unmatched closing ')' near index 11
((?:eÌ)|Ã©)|e)Ì)
^
fixed.

(4) the canonical equivalence does not work for the property class
"\\p{IsGreek}" matches "\u1f80"
"\\p{IsGreek}" but does match "\u1f00\u0345\u0300"

works as expected now.

thanks,
Sherman

[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039404.html
[3] http://openjdk.java.net/jeps/111
[4] http://unicode.org/reports/tr18/#Canonical_Equivalents