RFR: Regex exponential backtracking issue
Xueming Shen
xueming.shen at oracle.com
Fri Mar 4 06:43:27 UTC 2016
Hi,
webrev: http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev/
Backtracking is the powerful and popular feature provided by Java regex
engine
when the regex pattern contains optional quantifiers or alternations, in
which the
regex engine can return to the previous saved state (backtracking) to
continue
matching/finding other alternative when the current try fails. However
it is also
a "well known" issue that some not-well-written regex might trigger
"exponential
backtraking" in situation like there is no possible match can be found
after exhaustive
search, in the worst case the regex engine will just keep going forever,
hangup the
vm. We have various reports in the past as showed in RegexFun.java
<http://cr.openjdk.java.net/%7Esherman/regexBacktracking/RegexFun.java>
[1] for this
issue. For example following is the one reported in JDK-6693451.
regex: "^(\\s*foo\\s*)*$"
input: "foo foo foo foo foo foo foo foo foo foo foo foo foo foo
foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo
foo foo foo foo foo foo foo fo",
In which the "group" (\\s*foo\\s*)*$" can alternatively match any one of
the "foo",
"foo ", " foo" and " foo ". The regex works fine and runs normally if it
can find a
match (replace the last "fo" in the input to "foo" for example). But in
this case
because the input text ends with a "fo", the regex actually can never
match the
input, it always fails at last one, nothing in the regex can match the
"fo". Given the
nature of the "greedy quantifier", the engine will be
backtracking/bouncing among
these 4 choices at each/every iteration/level and keep searching to the
end of the
input exhaustively, fail, backtrack, and try again recursively. Based on
the number
of "foo" in the input, it can run forever until the end of the world.
There are lots of suggestions on how to avoid this kind of runaway
regex. One is to
use the possessive quantifier, if possible, to replace the greedy
quantifier ( * -> *+
for example ) to workaround this kind of catastrophic backtracking, and
it normally
meets the real need and makes the exponential backtracking go away.
The observation of these exponential backtracking cases indicates that
most of these
"exponential backtracking" however are actually not necessary in most
cases. Even we
can't solve all of these catastrophic backtracking, it might be
possible, with a small
cost, we can actually workaround most of the "cases", in particular
(1) when the "exponential backtracking" happens at the top level group +
greedy
quantifier, and
(2) there is no group ref in the pattern.
The "group + greedy repetition" in j.u.regex is implemented via three
nodes, "Prolog +
Loop + body node", in wihch the body takes care of (\\s*foo\\s*), the
Loop takes care
of the looping for "*", and the Prolog, as its name, just a starter to
kick of the looping
(read source for more details).
The normal matching sequence works as
Prolog.match() -> Prolog.loop.matchInit()
-> body.match() -> loop.match()
-> body.match() -> loop.match()
-> body.match() -> loop.match()
...
-> body.match() -> loop.match() -> loop.next.match()
The "body.match() -> loop.match -> body.match() -> loop.match() ..." is
looping on the
runtime call stack "recursively" (the body.next is pointing to loop).
If we take the "recursive looping" part that we are interested (we only
care about the
"count >= cmin" situation, as it is where the exponential backtracking
happens) out of
Loop.match(), the loop.match is as simple as
boolean match(Matcher matcher, int i, CharSequence seq) {
---> Before
boolean b = body.match(matcher, i, seq);
// If match failed we must backtrack, so
// the loop count should NOT be incremented
if (b)
return true;
matcher.locals[countIndex] = count;
---> After
return next.match(matcher, i, seq);
}
It appears that each of the repetitive "body.match() + loop.match()"
looping (though actually
recursive on the runtime stack) actually is matching the input text
kind of "linearly" in the
order of loop counter, if this is a "top level" group repetition (not an
inner group inside
another repetitive group)
... foo foo foo foo foo foo foo...
|_ i = n,
(body-loop)(body-loop)(body-loop).... ->next()
given the nature of its "greediness", the engine will exhausts all the
possible match of
"body+loop+ loop.next" from any given starting position "i". We can
actually assume that
if we have failed to match A "body + loop + its down-stream matching"
last time at
position "i", next time if we come back here again (after backtracking
the "up stream"
and come down here again), and if we are at exactly the same position
"i", we no longer
need to try it again, the result will not change (failed). With the
assumption that we DO
NOT have a group ref in the pattern (the result/content of the group ref
can change, if
the down-stream contains the group ref, we can't assume the result going
forward will
be the same, even start at the same position "i").
for example for the above sample, when we backtrack to loop counter "n",
we can dance
among 4 choices
"foo", " foo", "foo " and " foo ",
but when we pick one and more on to the next iteration/loop at n + 1,
the only possible
choice is either "foo..." or " foo" (with a leading space character), if
we have tried either of
them (or both) last time and it failed, we no longer need to try the
same "down stream"
again . And the good thing is that this "position-lookup-table" can be
shared by all
cmin <= n <= cmax iterations, again, because the nature of greediness.
In last failed
try, the engine has tried/exhausted all possible combinations of
body->loop->body->loop...body->loop->next
at "i", including the possilble exhaustive backtracking done by the
embedded inner nodes
of this group. So we are here at the same position 'i" again, the result
will be the same.
So here is my propoased workaround solution.
We introduce in a "hashmap" (home-made hashmap for "int", supposed to be
cheap and faster)
to memorize all the tried-and-failed position "i", and check this
hashmap before starting a new
round of loop.
boolean match(Matcher matcher, int i, CharSequence seq) {
----------------
if (posIndex != -1 &&
matcher.localsPos[posIndex].contains(i)) {
return next.match(matcher, i, seq);
}
----------------
boolean b = body.match(matcher, i, seq);
// If match failed we must backtrack, so
// the loop count should NOT be incremented
if (b)
return true;
matcher.locals[countIndex] = count;
----------------
if (posIndex != -1) {
matcher.localsPos[posIndex].add(i);
}
----------------
return next.match(matcher, i, seq);
}
This makes most of the exponential backtracking failures in
RegexFun.java [1] go away,
except the two issues.
(1) the exponential backtracking starts/happens at the second inner
group level, in
which the proposed change might help, but does not make it go away
regex: "(h|h|ih(((i|a|c|c|a|i|i|j|b|a|i|b|a|a|j))+h)ahbfhba|c|i)*"
input:
"hchcchicihcchciiicichhcichcihcchiihichiciiiihhcchicchhcihchcihiihciichhccciccichcichiihcchcihhicchcciicchcccihiiihhihihihichicihhcciccchihhhcchichchciihiicihciihcccciciccicciiiiiiiiicihhhiiiihchccchchhhhiiihchihcccchhhiiiiiiiicicichicihcciciihichhhhchihciiihhiccccccciciihhichiccchhicchicihihccichicciihcichccihhiciccccccccichhhhihihhcchchihihiihhihihihicichihiiiihhhhihhhchhichiicihhiiiiihchccccchichci"
(2) it's a {n, m} greedy case. This one probably can be handled the same
way as the */+
but need a little more testing.
regex: "(\\w|_){1,64}@"
input: "______________________________"
There are another kind of "abusing" case that overly and repeatitively
uses ".*" or ".+", such as
reported in jdk-5014450
http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun3.java
regex:
"^\\s*sites\\[sites\\.length\\+\\+\\]\\s*=\\s*new\\s+Array\\(.*" +
"\\s*//\\s*(\\d+).*" +
"\\s*//\\s*([^-]+).*" +
"\\s*//\\s*([^-]+).*" +
"\\s*//\\s*([^-]+).*" +
"/(?sui).*$"
text "\tsites[sites.length++] = new Array(\n" +
"// 1079193366647 - creation time\n" +
"// 1078902678663 1078852539723 1078753482632 0 0 0 0 0 0 0
0 0 0 0 - creation time last 14 days\n" +
"// 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - bad\n" +
"// 0.0030 0.0080 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 -\n\n",
The performance can be much/much better if we can apply the similar
optimization for top-level
dot-greedy-curly nodes as well, as showed at
http://cr.openjdk.java.net/~sherman/regexBacktracking/webrev2/
But I'm a little concerned the possibility that the extra checks might
slowdown the normal case
and just wonder if it is worth the cost. So leave it out for now.
Another optimization included is for the "CharProperty + Curly"
combination, normally this
is what you get for the for regex construct .*, \w* \s*. The new
GreedyCharProperty node now
takes advantage of the fact that we are iterating char/codepoint by
char/codepoint, the matching
implementation is more smooth and much faster.
Anyway, please help review and comment.
thanks,
Sherman
[1] http://cr.openjdk.java.net/~sherman/regexBacktracking/RegexFun.java
More information about the core-libs-dev
mailing list