<i18n dev> RFR: 8225061: Performance regression in Regex

Sat Jun 1 00:13:35 UTC 2019

Hi,

recent Unicode 12.1 updates caused a noticeable regression to Mac OS X
build times.

Quoting Naoto:
"The regression was caused by the call to Grapheme.nextBoundary() in
NFCCharProperty.match() method, which got slower with the fix to
JDK-8221431 / JDK-8222978 (Unicode 12.1 / Grapheme 12.0 support). The
purpose of issuing nextBoundary() is to detect whether to call (much
heavy weight) Normalizer.normalize() call or not. Since this fast check
does not require fully fledged boundary detection, including stateful
segmentation check such as Emoji sequence, simply checking the break
possibility between two code points as before should suffice. Suggested
fix is to bring back the isBoundary(cp1, cp2) method from the previous
revision in Grapheme.java, and issue it only from
NFCCharProperty.match() method for the fast check."

Bug:    https://bugs.openjdk.java.net/browse/JDK-8225061
Webrev: http://cr.openjdk.java.net/~redestad/8225061/open.01/

While narrowing this down, I created a couple of microbenchmarks and
experimented with a sequence of optimizations that got the regression of
using the heavier nextBoundary() check down from about 300x to just
about 2x as costly as before JDK-8221431. These improvements were then
bypassed by reverting to isBoundary in some micros, but still helps a
lot in other cases that has taken a toll from making the grapheme logic
more complete/correct, so I'd like to leave them in.

Testing: tier1-3, verified a 300x speedup in the complex
Pattern.CANON_EQ micro, and a 2x speedup on the simpler Grapheme/\\b{g}
micro.

Thanks!

/Claes