RFR: 8281315: Unicode, (?i) flag and backreference throwing IndexOutOfBounds Exception
Ian Graves
igraves at openjdk.java.net
Wed Feb 16 22:02:06 UTC 2022
On Wed, 16 Feb 2022 21:00:00 GMT, Naoto Sato <naoto at openjdk.org> wrote:
>> This is a fix in the buggy way CIBackRef traverses unicode characters that could be variable-length. Originally it followed the approach that BackRef does, but failed to account for unicode characters that could be 2 chars-long. The upper bound (groupSize) for the traversing loop is set by the difference between group start and stop indexes. This works for single char characters and it also works for case-sensitive comparisons because byte-by-byte comparisons are acceptable, but it doesn't work for a comparison where some kind of normalization (i.e. case) is required. This fix adjusts the upper bound for the loop that traverses the character when a two-char character is encountered.
>>
>> An alternative was to check the length of the group size by scanning the group in advance and converting to code points, but this could potentially result in multiple scans and codepoint conversions of the same matcher group which could be long. The solution that adjusts the loop bounds on the fly avoids this case.
>
> src/java.base/share/classes/java/util/regex/Pattern.java line 5104:
>
>> 5102: j += Character.charCount(c2);
>> 5103:
>> 5104: if(xIncr > 1) {
>
> You can eliminate `xIncr` by comparing `c1 >= Character.MIN_SUPPLEMENTARY_CODE_POINT` here.
Nice! Thanks will do.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7501
More information about the core-libs-dev
mailing list