RFR: JDK-6609854: Regex does not match correctly for negative nested character classes

Hong Dai Thanh nhahtdh at gmail.com
Tue Feb 25 09:30:28 UTC 2014


*Pretext*

The reference implementation has strange behavior when there are nested
character classes involved.

This is described briefly in the bug report:
https://bugs.openjdk.java.net/browse/JDK-6609854

And also brought up again with more details on stackoverflow.com

http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes

*Proposed patch*

I have looked at the current implementation and I'd like to propose the
following patch.

The patch is *not yet tested against the test cases in test package*.

However, *the structure has been verified via reflection*.

*(Commented lines are lines to be removed. Lines with trailing comment are
those to be added.)*

In method private CharProperty clazz(boolean consume)

                case ']':
                    firstInClass = false;
                    if (prev != null) {
                        if (consume)
                            next();
                        if (!include) { //
                            prev = prev.complement(); //
                        } //
                        return prev;
                    }
                    break;
                default:
                    firstInClass = false;
                    break;
            }
            node = range(bits);
//            if (include) {
                if (prev == null) {
                    prev = node;
                } else {
                    if (prev != node)
                        prev = union(prev, node);
                }
//            } else {
//                if (prev == null) {
//                    prev = node.complement();
//                } else {
//                    if (prev != node)
//                        prev = setDifference(prev, node);
//                }
//            }
            ch = peek();

*Fix Explanation*

In the current code, a nested character class that generates a single Node
(for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach the check if
(include) outside the switch statement. As a result, the negation is not
applied to the nested character class.

It also causes redundancy with patterns such as [^45[^67]] or [^4589[^67]],
since the first character 4 triggers complement() on the BitClass bits and
every subsequence character (5, 8, 9) triggers setDifference() on the
BitClass bits and the complement of the same BitClass bits.

*Current structure*

[^45[^67]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by
either character classes below:
  Pattern.setDifference (character class subtraction). Match any character
matched by the 1st character class, but NOT the 2nd character class:
    CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
      BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
        [U+0034][U+0035]
        45
    BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0034][U+0035]
      45
  Pattern.setDifference (character class subtraction). Match any character
matched by the 1st character class, but NOT the 2nd character class:
    CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
      BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
        [U+0036][U+0037]
        67
    BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0036][U+0037]
      67
LastNode
Node. Accept match

*Structure after patch*

[^45[^67]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT
matched by the following character class:
  Pattern.union (character class union). Match any character matched by
either character classes below:
    BitClass. Optimized character class with boolean[] to match characters
in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0034][U+0035]
      45
    CharProperty.complement (character class negation). Match any character
NOT matched by the following character class:
      BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
        [U+0036][U+0037]
        67
LastNode
Node. Accept match



More information about the core-libs-dev mailing list