RFR: Regex exponential backtracking issue --- more cleanup/tuning

Xueming Shen xueming.shen at oracle.com
Thu Mar 10 06:55:26 UTC 2016


Now we have a issue id

issue: https://bugs.openjdk.java.net/browse/JDK-8151481
webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev

thanks,
sherman

btw, the patter inside [...] no longer be "printable", they are 
"function" now.

On 3/8/16 9:45 AM, Xueming Shen wrote:
> Hi,
>
> While waiting patiently for someone to help review the proposal for 
> the exponential
> backtracking issue [1] I went ahead replacing those "CharProperty 
> constant nodes" with
> the IntPredicate. We were hoping having closure back then when working 
> on those
> CharProperty classes, which ended up with those make()/clone(). Now it 
> might be the
> time to replace it with what we wanted at the beginning.
>
> http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/
>
> Here are the notes about the changes
>
> (1) pulled out the "broken" printNodeTree (for debugging) from the 
> Pattern. This one does
>      not work as expected for a while . I do have a working copy and 
> have to put it in every
>      time I need debug the engine. So now I replaced the printNoteTree 
> with working one
>      and putting it at a separate class j.u.regex.PrintPattern, which 
> now can print out the
>      clean and complete node tree of the pattern. For example,
>
>    Pattern: [a-z0-9]+|ABCDEFG
>      0:  <Start>
>      1:  <Branch>
>      2:    <CharPropertyGreedy +>
>      3:      <Union>
>      4:        <Range[a-z]>
>      5:        <Range[0-9]>
>          <-branch.separator->
>      6:    <Slice  "ABCDEFG">
>      7:  </Branch>
>      8:  <END>
>
> (2) the optimization for the greedy repetition of a "CharProperty", 
> which parse the
>      greedy repetition on a single "CharProperty", such as 
> \p{IsGreek}+, or the most
>      commonly used .* into a single/smooth loop node.
>
> from
>
>     Pattern: \p{IsGreek}+
>      0:  <Start>
>      1:  <Curly GREEDY  + >
>      2:    <Script GREEK>
>          </Curly>
>      3:  <END>
>
> to
>
>      Pattern: \p{IsGreek}+
>      0:  <Start>
>      1:  <CharPropertyGreedy Script GREEK+>
>      2:  <END>
>
>    The simple jmh benchmark [2] indicates it is about 50%+, especially 
> for those no-match
>    case.
>
> (3) the optimization for the "union" of various individual "char" 
> inside a chracter class
>      [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the 
> engine generates
>      the nodes like
>
>    Pattern: [a-zABCDEF]
>      0:  <Start>
>      1:  <Union>
>      2:    <Union>
>      3:      <Union>
>      4:        <Union>
>      5:          <Union>
>      6:            <Union>
>      7:              <Range[a-z]>
>      8:              <Bits [ A B C D E F]>
>      8:            <Bits [ A B C D E F]>
>      8:          <Bits [ A B C D E F]>
>      8:        <Bits [ A B C D E F]>
>      8:      <Bits [ A B C D E F]>
>      8:    <Bits [ A B C D E F]>
>      9:  <END>
>
> with the optimization it generate (which it should)
>
>    Pattern: [a-zABCDEF]
>      0:  <Start>
>      1:  <Union>
>      2:    <Range[a-z]>
>      3:    <Bits [ A B C D E F]>
>      4:  <END>
>
>    The jmh benchmark [2] also indicates it is much faster, especially 
> for those no-match
>    case.
>
> (4) replaced those "constant" CharProperty nodes with IntPredicate 
> (the main change :-)).
>
>      if have time I might go further up to replace the 
> ""CharProperty.isSatisfiedBy" with a function
>      "predicate" directly ... in "class" case, we actually don't 
> really need a "node", the only thing
>      we really care about is the "predicate" and their combination, 
> only one node for each "class".
>
> oh, there is another one
> (5) I just threw in the change for the "j.u.regex: Negated Character 
> Classes" [3]
>      (I can take it out, if anyone has trouble with this)
>
> Thanks,
> Sherman
>
> [1] 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
> [2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
> [3] 
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html




More information about the core-libs-dev mailing list