RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Wed Jul 15 18:27:28 UTC 2020
Hi Roger,
Thank you for your review and suggestions.
On 7/15/20 10:56 AM, Roger Riggs wrote:
> Hi Naoto,
>
> Given the extra tests in the body of the loop, I think its worth finding
> or creating
> a JMH test for this and checking the performance.
Good point.
>
> With performance in mind, I would try to fall back to the UC/LC
> conversions only
> when the bytes don't match. See java.util.Arrays.mismatch(byte[], byte[]).
>
> It might even be worth finding the mismatch in the byte arrays before even
> starting to look at the characters.
Agree.
>
> There's also an option to assemble 4 bytes at a time and compare the int's.
> If they are equal you are ahead of the game. If not, back off to comparing
> the characters and checking for surrogates. The backoff code will be a bit
> messier though.
Sure int comparison would be faster, I am not quite sure that would be
worth compared to more complicated recovery code though.
>
> Also, compareToCI and regionMatchesCI could share the implementation of
> the inner loop.
Yes.
>
> If k1 and k2 ever get out of sync, isn't that failed assertion, so why
> have two indexes.
This is for preventing possible case maps where one in BMP and the other
in supplementary, thus the indices could be different.
>
> The loop will have fewer checks against the length of it processes len-1
> chars
> and then have a check if there is a final char to be checked.
> it can always know there is another char and can blindly get it.
Maybe, but could be complicated if the last char is a low surrogate.
Anyway, I will come up with a JMH test case and revise the webrev.
Naoto
>
> Regards, Roger
>
>
> On 7/15/20 12:00 PM, naoto.sato at oracle.com wrote:
>> Hello,
>>
>> Please review the fix to the following issues:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8248655
>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>
>> The proposed changeset and its CSR are located at:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>
>> A bug was filed against SimpleDateFormat (8248434) where
>> case-insensitive date format/parse failed in some of the new locales
>> in JDK15. The root cause was that case-insensitive
>> String.regionMatches() method did not work with supplementary
>> characters. The problem is that the method's spec does not expect case
>> mappings of supplementary characters, possibly because it was
>> overlooked in the first place, JSR 204 - "Unicode Supplementary
>> Character support". Similar behavior is observed in other two
>> case-insensitive methods, i.e., compareToIgnoreCase() and
>> equalsIgnoreCase().
>>
>> The fix is straightforward to compare strings by code point basis,
>> instead of code unit (16bit "char") basis. Technically this change
>> will introduce a backward incompatibility, but I believe it is an
>> incompatibility to wrong behavior, not true to the meaning of those
>> methods' expectations.
>>
>> Naoto
>
More information about the core-libs-dev
mailing list