<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
Joe Wang
huizhe.wang at oracle.com
Wed Jul 15 19:32:36 UTC 2020
Jim: I was referring to the rest of the process (the calls to
toCodePoint/toUpperCase/toLowerCase), not isHighSurrogate itself. But
Roger has a more comprehensive review on performance, and Naoto is
planning on preparing a JMH test.
-Joe
On 7/15/2020 11:46 AM, Jim Laskey wrote:
> Joe: This is a defensive approach that I believe has minimal cost.
>
> public static boolean isHighSurrogate(char ch) {
> // Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE
> return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
> }
>
>
>> On Jul 15, 2020, at 3:32 PM, naoto.sato at oracle.com wrote:
>>
>> Hi Joe,
>>
>> Thank you for your review.
>>
>> On 7/15/20 10:57 AM, Joe Wang wrote:
>>> Hi Naoto,
>>> In StringUTF16.java, if one is isHighSurrogate and the other not, you may quickly return without going through the rest of the process, probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it could skip a couple of toCodePoint/toUpperCase/toLowerCase calls.
>> Yes, that is correct as of now, which is based on the assumption that case mappings do not cross BMP and supplementary planes boundary. I could not find any description where that's given or not. So I just took it to be safe.
>>
>> Naoto
>>
>>> -Joe
>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>>>> Hello,
>>>>
>>>> Please review the fix to the following issues:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>
>>>> The proposed changeset and its CSR are located at:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>
>>>> A bug was filed against SimpleDateFormat (8248434) where case-insensitive date format/parse failed in some of the new locales in JDK15. The root cause was that case-insensitive String.regionMatches() method did not work with supplementary characters. The problem is that the method's spec does not expect case mappings of supplementary characters, possibly because it was overlooked in the first place, JSR 204 - "Unicode Supplementary Character support". Similar behavior is observed in other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase().
>>>>
>>>> The fix is straightforward to compare strings by code point basis, instead of code unit (16bit "char") basis. Technically this change will introduce a backward incompatibility, but I believe it is an incompatibility to wrong behavior, not true to the meaning of those methods' expectations.
>>>>
>>>> Naoto
More information about the i18n-dev
mailing list