RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes

Ivan Gerasimov ivan.gerasimov at oracle.com
Fri Feb 7 20:05:00 UTC 2020


Gentle ping.

I had to rebase the fix, as the code has diverged since the RFR was sent 
out 10 months ago.

Also, the test was slightly modified to cover more cases.

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/

Thanks in advance to the volunteer to review the fix!

With kind regards,

Ivan

On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
> Hello!
>
> It turns out, that the case-insensitive j.u.regex.Pattern still pays 
> attention to the characters case when certain char classes are used.
> For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase} 
> continue to recognize only lower, upper and title case characters, 
> even in case-insensitive context.
>
> For example, for POSIX char classes this behavior contradicts this 
> paragraph:
> """
> 9.2 Regular Expression General Requirements
> ...
> When a standard utility or function that uses regular expressions 
> specifies that pattern matching shall be performed without regard to 
> the case (uppercase or lowercase) of either data or patterns, then 
> when each character in the string is matched against the pattern, not 
> only the character, but also its case counterpart (if any), shall be 
> matched. This definition of case-insensitive processing is intended to 
> allow matching of multi-character collating elements as well as 
> characters, as each character in the string is matched using both its 
> cases.
> ...
> """
>
> I also checked how Perl is dealing with in such situation, and yes, it 
> ignores the case with various \p{} classes when they are used in 
> case-insensitive context, so all these tests run fine:
> 'A' =~ /\p{Lower}/i or die;
> 'a' =~ /\p{Upper}/i or die;
> 'A' =~ /\p{gc=Lt}/i or die; # title case
> 'a' =~ /\p{IsTitlecase}/i or die;
> 'Lj' =~ /\p{Lower}/i or die; # title-cased digraph
> 'lj' =~ /\p{Upper}/i or die;
> 'LJ' =~ /\p{Lt}/i or die;
>
> For reference, here's a lengthy document, describing precise rules 
> used by Perl to deal with \p{} char classes:
> https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d 
>
>
> So, for any Lower, Upper or Title case chars in case-insensitive 
> context Perl uses set of "Cased Letters", with is just a combination 
> of these three categories (aka "LC" general category).
>
> Would you please help review the patch?
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
> WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/
>
-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list