RFR: Regex exponential backtracking issue --- more cleanup/tuning

Xueming Shen xueming.shen at oracle.com
Tue Mar 8 17:45:41 UTC 2016


While waiting patiently for someone to help review the proposal for the 
backtracking issue [1] I went ahead replacing those "CharProperty 
constant nodes" with
the IntPredicate. We were hoping having closure back then when working 
on those
CharProperty classes, which ended up with those make()/clone(). Now it 
might be the
time to replace it with what we wanted at the beginning.


Here are the notes about the changes

(1) pulled out the "broken" printNodeTree (for debugging) from the 
Pattern. This one does
      not work as expected for a while . I do have a working copy and 
have to put it in every
      time I need debug the engine. So now I replaced the printNoteTree 
with working one
      and putting it at a separate class j.u.regex.PrintPattern, which 
now can print out the
      clean and complete node tree of the pattern. For example,

    Pattern: [a-z0-9]+|ABCDEFG
      0:  <Start>
      1:  <Branch>
      2:    <CharPropertyGreedy +>
      3:      <Union>
      4:        <Range[a-z]>
      5:        <Range[0-9]>
      6:    <Slice  "ABCDEFG">
      7:  </Branch>
      8:  <END>

(2) the optimization for the greedy repetition of a "CharProperty", 
which parse the
      greedy repetition on a single "CharProperty", such as 
\p{IsGreek}+, or the most
      commonly used .* into a single/smooth loop node.


     Pattern: \p{IsGreek}+
      0:  <Start>
      1:  <Curly GREEDY  + >
      2:    <Script GREEK>
      3:  <END>


      Pattern: \p{IsGreek}+
      0:  <Start>
      1:  <CharPropertyGreedy Script GREEK+>
      2:  <END>

    The simple jmh benchmark [2] indicates it is about 50%+, especially 
for those no-match

(3) the optimization for the "union" of various individual "char" inside 
a chracter class
      [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the 
engine generates
      the nodes like

    Pattern: [a-zABCDEF]
      0:  <Start>
      1:  <Union>
      2:    <Union>
      3:      <Union>
      4:        <Union>
      5:          <Union>
      6:            <Union>
      7:              <Range[a-z]>
      8:              <Bits [ A B C D E F]>
      8:            <Bits [ A B C D E F]>
      8:          <Bits [ A B C D E F]>
      8:        <Bits [ A B C D E F]>
      8:      <Bits [ A B C D E F]>
      8:    <Bits [ A B C D E F]>
      9:  <END>

with the optimization it generate (which it should)

    Pattern: [a-zABCDEF]
      0:  <Start>
      1:  <Union>
      2:    <Range[a-z]>
      3:    <Bits [ A B C D E F]>
      4:  <END>

    The jmh benchmark [2] also indicates it is much faster, especially 
for those no-match

(4) replaced those "constant" CharProperty nodes with IntPredicate (the 
main change :-)).

      if have time I might go further up to replace the 
""CharProperty.isSatisfiedBy" with a function
      "predicate" directly ... in "class" case, we actually don't really 
need a "node", the only thing
      we really care about is the "predicate" and their combination, 
only one node for each "class".

oh, there is another one
(5) I just threw in the change for the "j.u.regex: Negated Character 
Classes" [3]
      (I can take it out, if anyone has trouble with this)


[2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java

More information about the core-libs-dev mailing list