JDK-8215626 : Correct [^..&&..] intersection negation behaviour JDK8 vs JDK11 ??

Andrew Leonard andrew_m_leonard at uk.ibm.com
Mon Jan 7 13:50:34 UTC 2019


Anyone got any views on which "regex" beheviour is correct JDK8 or JDK11 ?
thanks
Andrew

Andrew Leonard
Java Runtimes Development
IBM Hursley
IBM United Kingdom Ltd
Phone internal: 245913, external: 01962 815913
internet email: andrew_m_leonard at uk.ibm.com 




From:   Andrew Leonard/UK/IBM
To:     "OpenJDK Core Libs Developers" <core-libs-dev at openjdk.java.net>
Date:   03/01/2019 11:20
Subject:        JDK-8215626 : Correct [^..&&..] intersection negation 
behaviour JDK8 vs JDK11 ??


Hi,
I'm currently investigating bug JDK-8215626 and have discovered the 
problem is in the Pattern interpretation of the [^..&&..] negation when 
applied to "intersected" expressions. So I have simplified the bug example 
to a more extreme and obvious example:
    Input string: "1234 ABCDEFG !$%^& abcdefg"
    pattern RegEx: "[^[A-B]&&[^ef]]"
    Operation: pattern.matcher(input).replaceAll("");

JDK8 output: 
      1234 CDEFG !$%^& abcdefg
JDK11 output:
      AB

So from the "spec" : 
A character class is a set of characters enclosed within square brackets. 
It specifies the characters that will successfully match a single 
character from a given input string 
Intersection: 
To create a single character class matching only the characters common to 
all of its nested classes, use &&, as in [0-9&&[345]]. 
Negation:
To match all characters except those listed, insert the "^" metacharacter 
at the beginning of the character class. 

The way I read the "spec" is the "^" negation negates the whole character 
class within the outer square brackets, thus in this example:
    "[^[A-B]&&[^ef]]"  is equivalent to the negation of  "[[A-B]&&[^ef]]"
    ie.the negation of the intersect of chars A,B and everything other 
than e,f
    which is thus the negation of A,B
    hence the operation above will remove any character in the input 
string other than A,B
Hence, JDK11 in my opinion meets the "spec". It looks as though JDK8 is 
applying the ^ negation to just [A-B] and then intersecting it with [^ef], 
which to me is the wrong interpretation of the "spec".

Your thoughts please?

If JDK11 is correct, and JDK8 wrong, then the next question is do we fix 
JDK8? as there's obviously potential "behavioural" impacts to existing 
applications....?

Thanks
Andrew

Andrew Leonard
Java Runtimes Development
IBM Hursley
IBM United Kingdom Ltd
Phone internal: 245913, external: 01962 815913
internet email: andrew_m_leonard at uk.ibm.com 


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


More information about the core-libs-dev mailing list