RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes

Tue Feb 25 20:18:58 UTC 2020

Thank you Roger for reviewing CSR and the release note!

On 2/11/20 12:49 PM, Roger Riggs wrote:
> Hi Ivan,
>
> Will this have enough of a compatibility concern to warrant a CSR?
> It will change the behavor of these cases.
>
> In the RegExTest, the failures should print which case is failing. 
> (Line 4961, 4990).
>
I agree that many testcases in RegExTest could provide better 
diagnostics in a case of a failure.

I think, it maybe done as a separate cleanup.

In the added testcase I made sure that both the input string and the 
pattern are printed upon failure.

With kind regards,

Ivan

> Regards, Roger
>
>
> On 2/7/20 3:05 PM, Ivan Gerasimov wrote:
>> Gentle ping.
>>
>> I had to rebase the fix, as the code has diverged since the RFR was 
>> sent out 10 months ago.
>>
>> Also, the test was slightly modified to cover more cases.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/
>>
>> Thanks in advance to the volunteer to review the fix!
>>
>> With kind regards,
>>
>> Ivan
>>
>> On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
>>> Hello!
>>>
>>> It turns out, that the case-insensitive j.u.regex.Pattern still pays 
>>> attention to the characters case when certain char classes are used.
>>> For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase} 
>>> continue to recognize only lower, upper and title case characters, 
>>> even in case-insensitive context.
>>>
>>> For example, for POSIX char classes this behavior contradicts this 
>>> paragraph:
>>> """
>>> 9.2 Regular Expression General Requirements
>>> ...
>>> When a standard utility or function that uses regular expressions 
>>> specifies that pattern matching shall be performed without regard to 
>>> the case (uppercase or lowercase) of either data or patterns, then 
>>> when each character in the string is matched against the pattern, 
>>> not only the character, but also its case counterpart (if any), 
>>> shall be matched. This definition of case-insensitive processing is 
>>> intended to allow matching of multi-character collating elements as 
>>> well as characters, as each character in the string is matched using 
>>> both its cases.
>>> ...
>>> """
>>>
>>> I also checked how Perl is dealing with in such situation, and yes, 
>>> it ignores the case with various \p{} classes when they are used in 
>>> case-insensitive context, so all these tests run fine:
>>> 'A' =~ /\p{Lower}/i or die;
>>> 'a' =~ /\p{Upper}/i or die;
>>> 'A' =~ /\p{gc=Lt}/i or die; # title case
>>> 'a' =~ /\p{IsTitlecase}/i or die;
>>> 'ǈ' =~ /\p{Lower}/i or die; # title-cased digraph
>>> 'ǉ' =~ /\p{Upper}/i or die;
>>> 'Ǉ' =~ /\p{Lt}/i or die;
>>>
>>> For reference, here's a lengthy document, describing precise rules 
>>> used by Perl to deal with \p{} char classes:
>>> https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d 
>>>
>>>
>>> So, for any Lower, Upper or Title case chars in case-insensitive 
>>> context Perl uses set of "Cased Letters", with is just a combination 
>>> of these three categories (aka "LC" general category).
>>>
>>> Would you please help review the patch?
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/
>>>
>
-- 
With kind regards,
Ivan Gerasimov