RFR: JDK-6609854: Regex does not match correctly for negative nested character classes

Tue Feb 25 16:07:02 UTC 2014

Couple rounds of emails had been exchanged internally regarding this 
issue years
ago and a proposal had been briefly discussed at this list back to 2011 [1].

The only reason that hold us to just fix it is the "compatibility" 
concern, since the
described behavior has been there for decade.

It appears we can try it again, as if we are getting more support:-)

Thanks!
-Sherman

[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html

On 2/25/14 1:30 AM, Hong Dai Thanh wrote:
> *Pretext*
>
> The reference implementation has strange behavior when there are 
> nested character classes involved.
>
> This is described briefly in the bug report: 
> https://bugs.openjdk.java.net/browse/JDK-6609854
>
> And also brought up again with more details on stackoverflow.com 
> <http://stackoverflow.com>
>
> http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes
>
> *Proposed patch*
>
> I have looked at the current implementation and I'd like to propose 
> the following patch.
>
> The patch is *not yet tested against the test cases in test package*.
>
> However, *the structure has been verified via reflection*.
>
> /(Commented lines are lines to be removed. Lines with trailing comment 
> are those to be added.)/
>
> In method private CharProperty clazz(boolean consume)
>
>                 case ']':
>                     firstInClass = false;
>                     if (prev != null) {
>                         if (consume)
>                             next();
>                         if (!include) { //
>                             prev = prev.complement(); //
>                         } //
>                         return prev;
>                     }
>                     break;
>                 default:
>                     firstInClass = false;
>                     break;
>             }
>             node = range(bits);
> //            if (include) {
>                 if (prev == null) {
>                     prev = node;
>                 } else {
>                     if (prev != node)
>                         prev = union(prev, node);
>                 }
> //            } else {
> //                if (prev == null) {
> //                    prev = node.complement();
> //                } else {
> //                    if (prev != node)
> //                        prev = setDifference(prev, node);
> //                }
> //            }
>             ch = peek();
>
> *Fix Explanation*
>
> In the current code, a nested character class that generates a single 
> Node (for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach the 
> check if (include) outside the switch statement. As a result, the 
> negation is not applied to the nested character class.
>
> It also causes redundancy with patterns such as [^45[^67]] or 
> [^4589[^67]], since the first character 4 triggers complement() on the 
> BitClass bits and every subsequence character (5, 8, 9) triggers 
> setDifference() on the BitClass bits and the complement of the same 
> BitClass bits.
>
> *Current structure*
>
> [^45[^67]]
> Start. Start unanchored match (minLength=1)
> Pattern.union (character class union). Match any character matched by 
> either character classes below:
>   Pattern.setDifference (character class subtraction). Match any 
> character matched by the 1st character class, but NOT the 2nd 
> character class:
>     CharProperty.complement (character class negation). Match any 
> character NOT matched by the following character class:
>       BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>         [U+0034][U+0035]
>         45
>     BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>       [U+0034][U+0035]
>       45
>   Pattern.setDifference (character class subtraction). Match any 
> character matched by the 1st character class, but NOT the 2nd 
> character class:
>     CharProperty.complement (character class negation). Match any 
> character NOT matched by the following character class:
>       BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>         [U+0036][U+0037]
>         67
>     BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>       [U+0036][U+0037]
>       67
> LastNode
> Node. Accept match
>
> *Structure after patch*
>
> [^45[^67]]
> Start. Start unanchored match (minLength=1)
> CharProperty.complement (character class negation). Match any 
> character NOT matched by the following character class:
>   Pattern.union (character class union). Match any character matched 
> by either character classes below:
>     BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>       [U+0034][U+0035]
>       45
>     CharProperty.complement (character class negation). Match any 
> character NOT matched by the following character class:
>       BitClass. Optimized character class with boolean[] to match 
> characters in Latin-1 (code point <= 255). Match the following 2 
> character(s):
>         [U+0036][U+0037]
>         67
> LastNode
> Node. Accept match