RFR: Regex exponential backtracking issue --- more cleanup/tuning

Tue Mar 8 17:45:41 UTC 2016

Hi,

While waiting patiently for someone to help review the proposal for the 
exponential
backtracking issue [1] I went ahead replacing those "CharProperty 
constant nodes" with
the IntPredicate. We were hoping having closure back then when working 
on those
CharProperty classes, which ended up with those make()/clone(). Now it 
might be the
time to replace it with what we wanted at the beginning.

http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/

Here are the notes about the changes

(1) pulled out the "broken" printNodeTree (for debugging) from the 
Pattern. This one does
      not work as expected for a while . I do have a working copy and 
have to put it in every
      time I need debug the engine. So now I replaced the printNoteTree 
with working one
      and putting it at a separate class j.u.regex.PrintPattern, which 
now can print out the
      clean and complete node tree of the pattern. For example,

    Pattern: [a-z0-9]+|ABCDEFG
      0:  <Start>
      1:  <Branch>
      2:    <CharPropertyGreedy +>
      3:      <Union>
      4:        <Range[a-z]>
      5:        <Range[0-9]>
          <-branch.separator->
      6:    <Slice  "ABCDEFG">
      7:  </Branch>
      8:  <END>

(2) the optimization for the greedy repetition of a "CharProperty", 
which parse the
      greedy repetition on a single "CharProperty", such as 
\p{IsGreek}+, or the most
      commonly used .* into a single/smooth loop node.

from

     Pattern: \p{IsGreek}+
      0:  <Start>
      1:  <Curly GREEDY  + >
      2:    <Script GREEK>
          </Curly>
      3:  <END>

to

      Pattern: \p{IsGreek}+
      0:  <Start>
      1:  <CharPropertyGreedy Script GREEK+>
      2:  <END>

    The simple jmh benchmark [2] indicates it is about 50%+, especially 
for those no-match
    case.

(3) the optimization for the "union" of various individual "char" inside 
a chracter class
      [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the 
engine generates
      the nodes like

    Pattern: [a-zABCDEF]
      0:  <Start>
      1:  <Union>
      2:    <Union>
      3:      <Union>
      4:        <Union>
      5:          <Union>
      6:            <Union>
      7:              <Range[a-z]>
      8:              <Bits [ A B C D E F]>
      8:            <Bits [ A B C D E F]>
      8:          <Bits [ A B C D E F]>
      8:        <Bits [ A B C D E F]>
      8:      <Bits [ A B C D E F]>
      8:    <Bits [ A B C D E F]>
      9:  <END>

with the optimization it generate (which it should)

    Pattern: [a-zABCDEF]
      0:  <Start>
      1:  <Union>
      2:    <Range[a-z]>
      3:    <Bits [ A B C D E F]>
      4:  <END>

    The jmh benchmark [2] also indicates it is much faster, especially 
for those no-match
    case.

(4) replaced those "constant" CharProperty nodes with IntPredicate (the 
main change :-)).

      if have time I might go further up to replace the 
""CharProperty.isSatisfiedBy" with a function
      "predicate" directly ... in "class" case, we actually don't really 
need a "node", the only thing
      we really care about is the "predicate" and their combination, 
only one node for each "class".

oh, there is another one
(5) I just threw in the change for the "j.u.regex: Negated Character 
Classes" [3]
      (I can take it out, if anyone has trouble with this)

Thanks,
Sherman

[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
[3] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html