RFR: Regex exponential backtracking issue

Fri Mar 4 06:43:27 UTC 2016

Hi,

webrev: http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev/

Backtracking is the powerful and popular feature provided by Java regex 
engine
when the regex pattern contains optional quantifiers or alternations, in 
which the
regex engine can return to the previous saved state (backtracking) to 
continue
matching/finding other alternative when the current try fails. However 
it is also
a "well known" issue that some not-well-written regex might trigger 
"exponential
backtraking" in situation like there is no possible match can be found 
after exhaustive
search, in the worst case the regex engine will just keep going forever, 
hangup the
vm. We have various reports in the past as showed in RegexFun.java 
<http://cr.openjdk.java.net/%7Esherman/regexBacktracking/RegexFun.java> 
[1] for this
issue. For example following is the one reported in  JDK-6693451.

     regex:  "^(\\s*foo\\s*)*$"
     input:  "foo foo foo foo foo foo foo foo foo foo foo foo foo foo 
foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo 
foo foo foo foo foo foo foo fo",

In which the "group" (\\s*foo\\s*)*$" can alternatively match any one of 
the "foo",
"foo ", " foo" and " foo ". The regex works fine and runs normally if it 
can find a
match (replace the last "fo" in the input to "foo" for example). But in 
this case
because the input text ends with a "fo", the regex actually can never 
match the
input, it always fails at last one, nothing in the regex can match the 
"fo". Given the
nature of the "greedy quantifier", the engine will be 
backtracking/bouncing among
these 4 choices at each/every iteration/level and keep searching to the 
end of the
input exhaustively, fail, backtrack, and try again recursively. Based on 
the number
of "foo" in the input, it can run forever until the end of the world.

There are lots of suggestions on how to avoid this kind of runaway 
regex. One is to
use the possessive quantifier, if possible, to replace the greedy 
quantifier ( * -> *+
for example ) to workaround this kind of catastrophic backtracking, and 
it normally
meets the real need and makes the exponential backtracking go away.

The observation of these exponential backtracking cases indicates that 
most of these
"exponential backtracking" however are actually not necessary in most 
cases. Even we
can't solve all of these catastrophic backtracking, it might be 
possible, with a small
cost, we can actually workaround most of the "cases", in particular

(1) when the "exponential backtracking" happens at the top level group + 
greedy
quantifier, and
(2) there is no group ref in the pattern.

The "group + greedy repetition" in j.u.regex is implemented via three 
nodes, "Prolog +
Loop + body node", in wihch the body takes care of (\\s*foo\\s*), the 
Loop takes care
of the looping for "*", and the Prolog, as its name,  just a starter to 
kick of the looping
(read source for more details).

The normal matching sequence works as

     Prolog.match() ->  Prolog.loop.matchInit()

         -> body.match() -> loop.match()
         -> body.match() -> loop.match()
         -> body.match() -> loop.match()
         ...
         -> body.match() -> loop.match()  -> loop.next.match()

The "body.match() -> loop.match -> body.match() -> loop.match() ..." is 
looping on the
runtime call stack "recursively" (the body.next is pointing to loop).

If we take the "recursive looping" part that we are interested (we only 
care about the
"count >= cmin" situation, as it is where the exponential backtracking 
happens) out of
Loop.match(), the loop.match is as simple as

     boolean match(Matcher matcher, int i, CharSequence seq) {

---> Before
                     boolean b = body.match(matcher, i, seq);
                     // If match failed we must backtrack, so
                     // the loop count should NOT be incremented
                     if (b)
                         return true;
                     matcher.locals[countIndex] = count;
---> After
                     return next.match(matcher, i, seq);
     }

It appears that each of the repetitive "body.match() + loop.match()" 
looping (though actually
recursive on the runtime stack) actually is matching  the input text 
kind of "linearly" in the
order of loop counter, if this is a "top level" group repetition (not an 
inner group inside
another repetitive group)

         ... foo foo foo foo foo foo foo...
             |_ i = n,
          (body-loop)(body-loop)(body-loop).... ->next()

given the nature of its "greediness", the engine will exhausts all the 
possible match of
"body+loop+ loop.next" from any given starting position "i". We can 
actually assume that
if we have failed to match A "body + loop + its down-stream matching" 
last time at
position "i", next time if we come back here again (after backtracking 
the "up stream"
and come down here again), and if we are at exactly the same position 
"i", we no longer
need to try it again, the result will not change (failed). With the 
assumption that we DO
NOT have a group ref in the pattern (the result/content of the group ref 
can change, if
the down-stream contains the group ref, we can't assume the result going 
forward will
be the same, even start at the same position "i").

for example for the above sample, when we backtrack to loop counter "n", 
we can dance
among 4 choices

     "foo", " foo", "foo " and " foo ",

but when we pick one and more on to the next iteration/loop at n + 1, 
the only possible
choice is either "foo..." or " foo" (with a leading space character), if 
we have tried either of
them (or both) last time and it failed, we no longer need to try the 
same "down stream"
again . And the good thing is that this "position-lookup-table" can be 
shared by all
cmin <= n <= cmax iterations, again, because the nature of greediness. 
In last failed
try, the engine has tried/exhausted all possible combinations of

     body->loop->body->loop...body->loop->next

at "i", including the possilble exhaustive backtracking done by the 
embedded inner nodes
of this group. So we are here at the same position 'i" again, the result 
will be the same.

So here is my propoased workaround solution.

We introduce in a "hashmap" (home-made hashmap for "int", supposed to be 
cheap and faster)
to memorize all the tried-and-failed position "i", and check this 
hashmap before starting a new
round of loop.

     boolean match(Matcher matcher, int i, CharSequence seq) {

----------------
         if (posIndex != -1 &&
             matcher.localsPos[posIndex].contains(i)) {
             return next.match(matcher, i, seq);
         }
----------------
         boolean b = body.match(matcher, i, seq);
         // If match failed we must backtrack, so
         // the loop count should NOT be incremented
         if (b)
             return true;
         matcher.locals[countIndex] = count;

----------------
          if (posIndex != -1) {
              matcher.localsPos[posIndex].add(i);
          }
----------------
          return next.match(matcher, i, seq);
     }

This makes most of the exponential backtracking failures in 
RegexFun.java [1] go away,
except the two issues.

(1) the exponential backtracking starts/happens at the second inner 
group level, in
which the proposed change might help, but does not make it go away

     regex: "(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*"
     input: 
"hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchicchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihichicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccchchhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihhichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihihiihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci"

(2) it's a {n, m} greedy case. This one probably can be handled the same 
way as the */+
  but need a little more testing.

     regex: "(\\w|_){1,64}@"
     input: "______________________________"

There are another kind of "abusing" case that overly and repeatitively 
uses ".*" or ".+", such as
reported in jdk-5014450
http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun3.java

     regex: 
"^\\s*sites\\[sites\\.length\\+\\+\\]\\s*=\\s*new\\s+Array\\(.*" +
            "\\s*//\\s*(\\d+).*" +
            "\\s*//\\s*([^-]+).*" +
            "\\s*//\\s*([^-]+).*" +
            "\\s*//\\s*([^-]+).*" +
            "/(?sui).*$"
     text   "\tsites[sites.length++] = new Array(\n" +
            "// 1079193366647 - creation time\n" +
            "// 1078902678663 1078852539723 1078753482632 0 0 0 0 0 0 0 
0 0 0 0 - creation time last 14 days\n" +
            "// 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - bad\n" +
            "// 0.0030 0.0080 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
0.0 0.0 -\n\n",

The performance can be much/much better if we can apply the similar 
optimization for top-level
dot-greedy-curly nodes as well, as showed at

http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev2/

But I'm a little concerned the possibility that the extra checks might 
slowdown the normal case
and just wonder if it is worth the cost. So leave it out for now.

Another optimization included is for the "CharProperty + Curly" 
combination, normally this
is what you get for the for regex construct .*, \w* \s*. The new 
GreedyCharProperty node now
takes advantage of the fact that we are iterating char/codepoint by 
char/codepoint, the matching
implementation is more smooth and much faster.

Anyway, please help review and comment.

thanks,
Sherman

[1] http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun.java