RFR: Regex exponential backtracking issue --- more cleanup/tuning
Xueming Shen
xueming.shen at oracle.com
Thu Mar 10 06:55:26 UTC 2016
Now we have a issue id
issue: https://bugs.openjdk.java.net/browse/JDK-8151481
webrev: http://cr.openjdk.java.net/~sherman/8151481/webrev
thanks,
sherman
btw, the patter inside [...] no longer be "printable", they are
"function" now.
On 3/8/16 9:45 AM, Xueming Shen wrote:
> Hi,
>
> While waiting patiently for someone to help review the proposal for
> the exponential
> backtracking issue [1] I went ahead replacing those "CharProperty
> constant nodes" with
> the IntPredicate. We were hoping having closure back then when working
> on those
> CharProperty classes, which ended up with those make()/clone(). Now it
> might be the
> time to replace it with what we wanted at the beginning.
>
> http://cr.openjdk.java.net/~sherman/regexClosure/webrev.02/
>
> Here are the notes about the changes
>
> (1) pulled out the "broken" printNodeTree (for debugging) from the
> Pattern. This one does
> not work as expected for a while . I do have a working copy and
> have to put it in every
> time I need debug the engine. So now I replaced the printNoteTree
> with working one
> and putting it at a separate class j.u.regex.PrintPattern, which
> now can print out the
> clean and complete node tree of the pattern. For example,
>
> Pattern: [a-z0-9]+|ABCDEFG
> 0: <Start>
> 1: <Branch>
> 2: <CharPropertyGreedy +>
> 3: <Union>
> 4: <Range[a-z]>
> 5: <Range[0-9]>
> <-branch.separator->
> 6: <Slice "ABCDEFG">
> 7: </Branch>
> 8: <END>
>
> (2) the optimization for the greedy repetition of a "CharProperty",
> which parse the
> greedy repetition on a single "CharProperty", such as
> \p{IsGreek}+, or the most
> commonly used .* into a single/smooth loop node.
>
> from
>
> Pattern: \p{IsGreek}+
> 0: <Start>
> 1: <Curly GREEDY + >
> 2: <Script GREEK>
> </Curly>
> 3: <END>
>
> to
>
> Pattern: \p{IsGreek}+
> 0: <Start>
> 1: <CharPropertyGreedy Script GREEK+>
> 2: <END>
>
> The simple jmh benchmark [2] indicates it is about 50%+, especially
> for those no-match
> case.
>
> (3) the optimization for the "union" of various individual "char"
> inside a chracter class
> [...], usch as. [ABCDEF]. For a regex like [a-zABCDEF], now the
> engine generates
> the nodes like
>
> Pattern: [a-zABCDEF]
> 0: <Start>
> 1: <Union>
> 2: <Union>
> 3: <Union>
> 4: <Union>
> 5: <Union>
> 6: <Union>
> 7: <Range[a-z]>
> 8: <Bits [ A B C D E F]>
> 8: <Bits [ A B C D E F]>
> 8: <Bits [ A B C D E F]>
> 8: <Bits [ A B C D E F]>
> 8: <Bits [ A B C D E F]>
> 8: <Bits [ A B C D E F]>
> 9: <END>
>
> with the optimization it generate (which it should)
>
> Pattern: [a-zABCDEF]
> 0: <Start>
> 1: <Union>
> 2: <Range[a-z]>
> 3: <Bits [ A B C D E F]>
> 4: <END>
>
> The jmh benchmark [2] also indicates it is much faster, especially
> for those no-match
> case.
>
> (4) replaced those "constant" CharProperty nodes with IntPredicate
> (the main change :-)).
>
> if have time I might go further up to replace the
> ""CharProperty.isSatisfiedBy" with a function
> "predicate" directly ... in "class" case, we actually don't
> really need a "node", the only thing
> we really care about is the "predicate" and their combination,
> only one node for each "class".
>
> oh, there is another one
> (5) I just threw in the change for the "j.u.regex: Negated Character
> Classes" [3]
> (I can take it out, if anyone has trouble with this)
>
> Thanks,
> Sherman
>
> [1]
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html
> [2] http://cr.openjdk.java.net/~sherman/regexClosure/MyBenchmark.java
> [3]
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html
More information about the core-libs-dev
mailing list