RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes
Ivan Gerasimov
ivan.gerasimov at oracle.com
Wed Mar 18 08:06:36 UTC 2020
Thank you Roger!
Pushed.
With kind regards,
Ivan
On 3/17/20 5:20 PM, Roger Riggs wrote:
> Hi Ivan,
>
> I see the CSR is approved.
>
> I'm fine with the change.
>
> Regards, Roger
>
>
> On 2/25/20 3:18 PM, Ivan Gerasimov wrote:
>> Thank you Roger for reviewing CSR and the release note!
>>
>>
>> On 2/11/20 12:49 PM, Roger Riggs wrote:
>>> Hi Ivan,
>>>
>>> Will this have enough of a compatibility concern to warrant a CSR?
>>> It will change the behavor of these cases.
>>>
>>> In the RegExTest, the failures should print which case is failing.
>>> (Line 4961, 4990).
>>>
>> I agree that many testcases in RegExTest could provide better
>> diagnostics in a case of a failure.
>>
>> I think, it maybe done as a separate cleanup.
>>
>> In the added testcase I made sure that both the input string and the
>> pattern are printed upon failure.
>>
>> With kind regards,
>>
>> Ivan
>>
>>
>>> Regards, Roger
>>>
>>>
>>> On 2/7/20 3:05 PM, Ivan Gerasimov wrote:
>>>> Gentle ping.
>>>>
>>>> I had to rebase the fix, as the code has diverged since the RFR was
>>>> sent out 10 months ago.
>>>>
>>>> Also, the test was slightly modified to cover more cases.
>>>>
>>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/
>>>>
>>>> Thanks in advance to the volunteer to review the fix!
>>>>
>>>> With kind regards,
>>>>
>>>> Ivan
>>>>
>>>> On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
>>>>> Hello!
>>>>>
>>>>> It turns out, that the case-insensitive j.u.regex.Pattern still
>>>>> pays attention to the characters case when certain char classes
>>>>> are used.
>>>>> For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase}
>>>>> continue to recognize only lower, upper and title case characters,
>>>>> even in case-insensitive context.
>>>>>
>>>>> For example, for POSIX char classes this behavior contradicts this
>>>>> paragraph:
>>>>> """
>>>>> 9.2 Regular Expression General Requirements
>>>>> ...
>>>>> When a standard utility or function that uses regular expressions
>>>>> specifies that pattern matching shall be performed without regard
>>>>> to the case (uppercase or lowercase) of either data or patterns,
>>>>> then when each character in the string is matched against the
>>>>> pattern, not only the character, but also its case counterpart (if
>>>>> any), shall be matched. This definition of case-insensitive
>>>>> processing is intended to allow matching of multi-character
>>>>> collating elements as well as characters, as each character in the
>>>>> string is matched using both its cases.
>>>>> ...
>>>>> """
>>>>>
>>>>> I also checked how Perl is dealing with in such situation, and
>>>>> yes, it ignores the case with various \p{} classes when they are
>>>>> used in case-insensitive context, so all these tests run fine:
>>>>> 'A' =~ /\p{Lower}/i or die;
>>>>> 'a' =~ /\p{Upper}/i or die;
>>>>> 'A' =~ /\p{gc=Lt}/i or die; # title case
>>>>> 'a' =~ /\p{IsTitlecase}/i or die;
>>>>> 'Lj' =~ /\p{Lower}/i or die; # title-cased digraph
>>>>> 'lj' =~ /\p{Upper}/i or die;
>>>>> 'LJ' =~ /\p{Lt}/i or die;
>>>>>
>>>>> For reference, here's a lengthy document, describing precise rules
>>>>> used by Perl to deal with \p{} char classes:
>>>>> https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d
>>>>>
>>>>>
>>>>> So, for any Lower, Upper or Title case chars in case-insensitive
>>>>> context Perl uses set of "Cased Letters", with is just a
>>>>> combination of these three categories (aka "LC" general category).
>>>>>
>>>>> Would you please help review the patch?
>>>>>
>>>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
>>>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/
>>>>>
>>>
>
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list