RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes
Roger Riggs
Roger.Riggs at oracle.com
Tue Feb 11 20:49:20 UTC 2020
Hi Ivan,
Will this have enough of a compatibility concern to warrant a CSR?
It will change the behavor of these cases.
In the RegExTest, the failures should print which case is failing. (Line
4961, 4990).
Regards, Roger
On 2/7/20 3:05 PM, Ivan Gerasimov wrote:
> Gentle ping.
>
> I had to rebase the fix, as the code has diverged since the RFR was
> sent out 10 months ago.
>
> Also, the test was slightly modified to cover more cases.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/
>
> Thanks in advance to the volunteer to review the fix!
>
> With kind regards,
>
> Ivan
>
> On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
>> Hello!
>>
>> It turns out, that the case-insensitive j.u.regex.Pattern still pays
>> attention to the characters case when certain char classes are used.
>> For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase}
>> continue to recognize only lower, upper and title case characters,
>> even in case-insensitive context.
>>
>> For example, for POSIX char classes this behavior contradicts this
>> paragraph:
>> """
>> 9.2 Regular Expression General Requirements
>> ...
>> When a standard utility or function that uses regular expressions
>> specifies that pattern matching shall be performed without regard to
>> the case (uppercase or lowercase) of either data or patterns, then
>> when each character in the string is matched against the pattern, not
>> only the character, but also its case counterpart (if any), shall be
>> matched. This definition of case-insensitive processing is intended
>> to allow matching of multi-character collating elements as well as
>> characters, as each character in the string is matched using both its
>> cases.
>> ...
>> """
>>
>> I also checked how Perl is dealing with in such situation, and yes,
>> it ignores the case with various \p{} classes when they are used in
>> case-insensitive context, so all these tests run fine:
>> 'A' =~ /\p{Lower}/i or die;
>> 'a' =~ /\p{Upper}/i or die;
>> 'A' =~ /\p{gc=Lt}/i or die; # title case
>> 'a' =~ /\p{IsTitlecase}/i or die;
>> 'Lj' =~ /\p{Lower}/i or die; # title-cased digraph
>> 'lj' =~ /\p{Upper}/i or die;
>> 'LJ' =~ /\p{Lt}/i or die;
>>
>> For reference, here's a lengthy document, describing precise rules
>> used by Perl to deal with \p{} char classes:
>> https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d
>>
>>
>> So, for any Lower, Upper or Title case chars in case-insensitive
>> context Perl uses set of "Cased Letters", with is just a combination
>> of these three categories (aka "LC" general category).
>>
>> Would you please help review the patch?
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/
>>
More information about the core-libs-dev
mailing list