<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Wed Jul 22 20:37:21 UTC 2020

Hi Joe,

Thank you for the consecutive reviews!

On 7/22/20 1:20 PM, Joe Wang wrote:
> Hi Naoto,
> 
> The change looks good to me. "supLower" is indeed super slow :-)
> 
> The only minor comment I have is that the compareCodePointCI method 
> performs toUpperCase unconditionally. That's not a problem for the 
> regular case, where a check on cp1 == cp2 (line 337) is done prior to 
> the method call. But for the sup case (starting at line 341), the method 
> is called unconditionally while in webrev.04 there was a check "cp1 != 
> cp2".  One option to fix it is to include the "cp1 != cp2" check in the 
> method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.

That was intentional, as at the point when it calls compareCodePointCI() 
for the second time, it is guaranteed that the supplementary code points 
differ because either their high surrogates or low surrogates differ.

Naoto

> 
> Regards,
> Joe
> 
> On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote:
>> Hi,
>>
>> I revised the fix again, based on further suggestions:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
>>
>> Changes from v.04 are (all in StringUTF16.java):
>>
>> - The short cut now does case insensitive comparison that makes the 
>> fix closer to the previous implementation (for BMP characters).
>> - Changed the bit operation to negating for detecting needed index 
>> increment.
>> - Method name is changed to better reflect what it is doing, with more 
>> descriptive comments.
>>
>> Here is the benchmark results:
>>
>> before:
>> Benchmark                                Mode  Cnt   Score Error  Units
>> StringCompareToIgnoreCase.lower          avgt   25  49.960 ? 1.923  ns/op
>> StringCompareToIgnoreCase.supLower       avgt   25  21.003 ? 0.354  ns/op
>> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
>> StringCompareToIgnoreCase.upperLower     avgt   25  15.417 ? 1.046  ns/op
>>
>> after:
>> Benchmark                                Mode  Cnt    Score Error  Units
>> StringCompareToIgnoreCase.lower          avgt   25   46.857 ? 0.524 ns/op
>> StringCompareToIgnoreCase.supLower       avgt   25  148.688 ? 6.546 ns/op
>> StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259 ns/op
>> StringCompareToIgnoreCase.upperLower     avgt   25   15.126 ? 0.338 ns/op
>>
>> Now non-supplementary operations ("lower" and "upperLower") are on par 
>> with the "before" result (I am not quite sure why the "after" results 
>> are somewhat faster though). For supplementary test cases, "supLower" 
>> is very slow. The reason is two fold; one is because "before" one 
>> exits at the very first character (which I am addressing here) while 
>> "after" continues to compare to the last characters, the other reason 
>> is the test suffers from the change where supplementary cases double 
>> the case insensitivity checks (compared to the "after" result just 
>> below). Also "supUpperLower" gets slower for the same reason. These 
>> are expected results for supplementary comparisons (as we discussed).
>>
>> Naoto
>>
>> On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
>>> Hi,
>>>
>>> Based on the suggestions, I modified the fix as follows:
>>>
>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>
>>> Changes from the initial revision are:
>>>
>>> - Shared the implementation between compareToCI() and regionMatchesCI()
>>> - Enabled immediate short cut if two code points match.
>>> - Created a simple JMH benchmark. Here is the scores before and after 
>>> the change:
>>>
>>> before:
>>> Benchmark                                Mode  Cnt   Score Error  Units
>>> StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811 ns/op
>>> StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135 ns/op
>>> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 ns/op
>>> StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499 ns/op
>>>
>>> after:
>>> Benchmark                                Mode  Cnt   Score Error  Units
>>> StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603 ns/op
>>> StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672 ns/op
>>> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 ns/op
>>> StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272 ns/op
>>>
>>> Here, "sup" means all supplementary characters, BMP otherwise. 
>>> "lower" means each character requires upper->lower casemap. 
>>> "upperLower" means all characters are the same, except the last 
>>> character which requires casemap.
>>>
>>> I think the result is reasonable, considering surrogates check are 
>>> now mandatory. I have tried Roger's suggestion to use 
>>> Arrays.mismatch() but it did not seem to benefit here. In fact, the 
>>> performance degraded partly because I implemented the short cut, and 
>>> possibly for the overhead of extra checks.
>>>
>>> Naoto
>>>
>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>>>> Hello,
>>>>
>>>> Please review the fix to the following issues:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>
>>>> The proposed changeset and its CSR are located at:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>
>>>> A bug was filed against SimpleDateFormat (8248434) where 
>>>> case-insensitive date format/parse failed in some of the new locales 
>>>> in JDK15. The root cause was that case-insensitive 
>>>> String.regionMatches() method did not work with supplementary 
>>>> characters. The problem is that the method's spec does not expect 
>>>> case mappings of supplementary characters, possibly because it was 
>>>> overlooked in the first place, JSR 204 - "Unicode Supplementary 
>>>> Character support". Similar behavior is observed in other two 
>>>> case-insensitive methods, i.e., compareToIgnoreCase() and 
>>>> equalsIgnoreCase().
>>>>
>>>> The fix is straightforward to compare strings by code point basis, 
>>>> instead of code unit (16bit "char") basis. Technically this change 
>>>> will introduce a backward incompatibility, but I believe it is an 
>>>> incompatibility to wrong behavior, not true to the meaning of those 
>>>> methods' expectations.
>>>>
>>>> Naoto
>