RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes

Ivan Gerasimov ivan.gerasimov at oracle.com
Tue Feb 25 01:54:59 UTC 2020


Thank you Roger and Joe for the feedback!

May I please ask you to review the CSR draft [1] and the Release Notes 
[2] for this issue:

[1] https://bugs.openjdk.java.net/browse/JDK-8238984

[2] https://bugs.openjdk.java.net/browse/JDK-8239887

Thanks in advance!

Ivan


On 2/11/20 3:10 PM, Joe Darcy wrote:
> Hello,
>
> Yes, I believe this change should have a CSR, and most likely a 
> release note.
>
> Thanks,
>
> -Joe
>
> On 2/11/2020 12:49 PM, Roger Riggs wrote:
>> Hi Ivan,
>>
>> Will this have enough of a compatibility concern to warrant a CSR?
>> It will change the behavor of these cases.
>>
>> In the RegExTest, the failures should print which case is failing. 
>> (Line 4961, 4990).
>>
>> Regards, Roger
>>
>>
>> On 2/7/20 3:05 PM, Ivan Gerasimov wrote:
>>> Gentle ping.
>>>
>>> I had to rebase the fix, as the code has diverged since the RFR was 
>>> sent out 10 months ago.
>>>
>>> Also, the test was slightly modified to cover more cases.
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/
>>>
>>> Thanks in advance to the volunteer to review the fix!
>>>
>>> With kind regards,
>>>
>>> Ivan
>>>
>>> On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
>>>> Hello!
>>>>
>>>> It turns out, that the case-insensitive j.u.regex.Pattern still 
>>>> pays attention to the characters case when certain char classes are 
>>>> used.
>>>> For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase} 
>>>> continue to recognize only lower, upper and title case characters, 
>>>> even in case-insensitive context.
>>>>
>>>> For example, for POSIX char classes this behavior contradicts this 
>>>> paragraph:
>>>> """
>>>> 9.2 Regular Expression General Requirements
>>>> ...
>>>> When a standard utility or function that uses regular expressions 
>>>> specifies that pattern matching shall be performed without regard 
>>>> to the case (uppercase or lowercase) of either data or patterns, 
>>>> then when each character in the string is matched against the 
>>>> pattern, not only the character, but also its case counterpart (if 
>>>> any), shall be matched. This definition of case-insensitive 
>>>> processing is intended to allow matching of multi-character 
>>>> collating elements as well as characters, as each character in the 
>>>> string is matched using both its cases.
>>>> ...
>>>> """
>>>>
>>>> I also checked how Perl is dealing with in such situation, and yes, 
>>>> it ignores the case with various \p{} classes when they are used in 
>>>> case-insensitive context, so all these tests run fine:
>>>> 'A' =~ /\p{Lower}/i or die;
>>>> 'a' =~ /\p{Upper}/i or die;
>>>> 'A' =~ /\p{gc=Lt}/i or die; # title case
>>>> 'a' =~ /\p{IsTitlecase}/i or die;
>>>> 'Lj' =~ /\p{Lower}/i or die; # title-cased digraph
>>>> 'lj' =~ /\p{Upper}/i or die;
>>>> 'LJ' =~ /\p{Lt}/i or die;
>>>>
>>>> For reference, here's a lengthy document, describing precise rules 
>>>> used by Perl to deal with \p{} char classes:
>>>> https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d 
>>>>
>>>>
>>>> So, for any Lower, Upper or Title case chars in case-insensitive 
>>>> context Perl uses set of "Cased Letters", with is just a 
>>>> combination of these three categories (aka "LC" general category).
>>>>
>>>> Would you please help review the patch?
>>>>
>>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/
>>>>
>>
-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list