RFR: JDK-6609854: Regex does not match correctly for negative nested character classes
Hong Dai Thanh
nhahtdh at gmail.com
Tue Feb 25 09:30:28 UTC 2014
*Pretext*
The reference implementation has strange behavior when there are nested
character classes involved.
This is described briefly in the bug report:
https://bugs.openjdk.java.net/browse/JDK-6609854
And also brought up again with more details on stackoverflow.com
http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes
*Proposed patch*
I have looked at the current implementation and I'd like to propose the
following patch.
The patch is *not yet tested against the test cases in test package*.
However, *the structure has been verified via reflection*.
*(Commented lines are lines to be removed. Lines with trailing comment are
those to be added.)*
In method private CharProperty clazz(boolean consume)
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
if (!include) { //
prev = prev.complement(); //
} //
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
// if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
// } else {
// if (prev == null) {
// prev = node.complement();
// } else {
// if (prev != node)
// prev = setDifference(prev, node);
// }
// }
ch = peek();
*Fix Explanation*
In the current code, a nested character class that generates a single Node
(for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach the check if
(include) outside the switch statement. As a result, the negation is not
applied to the nested character class.
It also causes redundancy with patterns such as [^45[^67]] or [^4589[^67]],
since the first character 4 triggers complement() on the BitClass bits and
every subsequence character (5, 8, 9) triggers setDifference() on the
BitClass bits and the complement of the same BitClass bits.
*Current structure*
[^45[^67]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by
either character classes below:
Pattern.setDifference (character class subtraction). Match any character
matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0034][U+0035]
45
BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0034][U+0035]
45
Pattern.setDifference (character class subtraction). Match any character
matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0036][U+0037]
67
BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0036][U+0037]
67
LastNode
Node. Accept match
*Structure after patch*
[^45[^67]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT
matched by the following character class:
Pattern.union (character class union). Match any character matched by
either character classes below:
BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0034][U+0035]
45
CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0036][U+0037]
67
LastNode
Node. Accept match
More information about the core-libs-dev
mailing list