RFR: Regex exponential backtracking issue --- more cleanup/tuning

Fri Mar 18 17:03:50 UTC 2016

Hi Sherman,

A few comments:

src/java.base/share/classes/java/util/regex/CharPredicates.java:

- The static fields like ALPHABETIC should be final
- line: 139/140 the posix and uprops HashMaps could have an initializer 
for size to avoid resizing (12 and 18):
- 224: typo: "categoreis"
- 232: double space: "return  props.get"

- Remove debugging prints to "PrintPattern.pmap..." before commit
- 288: can you resort "Pf" and "Pi" to be with the other P<x> entries; 
maybe sort the whole list.

- 305: Is the definition of "C" missing the equivalent of "Cn" 
unassigned or is that an intentional difference?

-323:  "LD" the definition of LD is not in the reference  (is it 
specified somewhere?)
-333: the referenced spec defines the names of the character classes but 
not the contents; are they elsewhere?
-374: add "final" to static declarations

Pattern.java:
-2651: stray trailing "{"  after comment
-2746: indentation of arguments
-5493: javadoc the description of Predicate interface and implementations
-5637: if you want it to be the first initializer put it at the top of 
the file, not buried at the end.
    and make them final.

That's a start, I need to look a bit more closely at the algorithmic 
change that prompted the fix.

Thanks, Roger

On 3/10/2016 1:55 AM, Xueming Shen wrote:
> Now we have a issue id
>
> issue: https://bugs.openjdk.java.net/browse/JDK-8151481
> webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev
>
> thanks,
> sherman
>
> btw, the patter inside [...] no longer be "printable", they are 
> "function" now.
>
> On 3/8/16 9:45 AM, Xueming Shen wrote:
>> Hi,
>>
>> While waiting patiently for someone to help review the proposal for 
>> the exponential
>> backtracking issue [1] I went ahead replacing those "CharProperty 
>> constant nodes" with
>> the IntPredicate. We were hoping having closure back then when 
>> working on those
>> CharProperty classes, which ended up with those make()/clone(). Now 
>> it might be the
>> time to replace it with what we wanted at the beginning.
>>
>> http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/
>>
>> Here are the notes about the changes
>>
>> (1) pulled out the "broken" printNodeTree (for debugging) from the 
>> Pattern. This one does
>>      not work as expected for a while . I do have a working copy and 
>> have to put it in every
>>      time I need debug the engine. So now I replaced the 
>> printNoteTree with working one
>>      and putting it at a separate class j.u.regex.PrintPattern, which 
>> now can print out the
>>      clean and complete node tree of the pattern. For example,
>>
>>    Pattern: [a-z0-9]+|ABCDEFG
>>      0:  <Start>
>>      1:  <Branch>
>>      2:    <CharPropertyGreedy +>
>>      3:      <Union>
>>      4:        <Range[a-z]>
>>      5:        <Range[0-9]>
>>          <-branch.separator->
>>      6:    <Slice  "ABCDEFG">
>>      7:  </Branch>
>>      8:  <END>
>>
>> (2) the optimization for the greedy repetition of a "CharProperty", 
>> which parse the
>>      greedy repetition on a single "CharProperty", such as 
>> \p{IsGreek}+, or the most
>>      commonly used .* into a single/smooth loop node.
>>
>> from
>>
>>     Pattern: \p{IsGreek}+
>>      0:  <Start>
>>      1:  <Curly GREEDY  + >
>>      2:    <Script GREEK>
>>          </Curly>
>>      3:  <END>
>>
>> to
>>
>>      Pattern: \p{IsGreek}+
>>      0:  <Start>
>>      1:  <CharPropertyGreedy Script GREEK+>
>>      2:  <END>
>>
>>    The simple jmh benchmark [2] indicates it is about 50%+, 
>> especially for those no-match
>>    case.
>>
>> (3) the optimization for the "union" of various individual "char" 
>> inside a chracter class
>>      [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the 
>> engine generates
>>      the nodes like
>>
>>    Pattern: [a-zABCDEF]
>>      0:  <Start>
>>      1:  <Union>
>>      2:    <Union>
>>      3:      <Union>
>>      4:        <Union>
>>      5:          <Union>
>>      6:            <Union>
>>      7:              <Range[a-z]>
>>      8:              <Bits [ A B C D E F]>
>>      8:            <Bits [ A B C D E F]>
>>      8:          <Bits [ A B C D E F]>
>>      8:        <Bits [ A B C D E F]>
>>      8:      <Bits [ A B C D E F]>
>>      8:    <Bits [ A B C D E F]>
>>      9:  <END>
>>
>> with the optimization it generate (which it should)
>>
>>    Pattern: [a-zABCDEF]
>>      0:  <Start>
>>      1:  <Union>
>>      2:    <Range[a-z]>
>>      3:    <Bits [ A B C D E F]>
>>      4:  <END>
>>
>>    The jmh benchmark [2] also indicates it is much faster, especially 
>> for those no-match
>>    case.
>>
>> (4) replaced those "constant" CharProperty nodes with IntPredicate 
>> (the main change :-)).
>>
>>      if have time I might go further up to replace the 
>> ""CharProperty.isSatisfiedBy" with a function
>>      "predicate" directly ... in "class" case, we actually don't 
>> really need a "node", the only thing
>>      we really care about is the "predicate" and their combination, 
>> only one node for each "class".
>>
>> oh, there is another one
>> (5) I just threw in the change for the "j.u.regex: Negated Character 
>> Classes" [3]
>>      (I can take it out, if anyone has trouble with this)
>>
>> Thanks,
>> Sherman
>>
>> [1] 
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
>> [2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
>> [3] 
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html
>