RFR: 8248655: Support supplementary characters in String case insensitive operations

Wed Jul 22 21:32:51 UTC 2020

On 7/22/20 1:43 PM, naoto.sato at oracle.com wrote:
> Thanks Roger,
>
> Ah, I just saw your email just after I sent mine!

They probably saw each other crossing and said hi on the way to inboxes ;-)
>
> On 7/22/20 1:38 PM, Roger Riggs wrote:
>> Hi Naoto,
>>
>> Looks fine. (with Joe's suggestion)
>>
>> On 7/22/20 4:20 PM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> The change looks good to me. "supLower" is indeed super slow :-)
>>>
>>> The only minor comment I have is that the compareCodePointCI method 
>>> performs toUpperCase unconditionally. That's not a problem for the 
>>> regular case, where a check on cp1 == cp2 (line 337) is done prior 
>>> to the method call. But for the sup case (starting at line 341), the 
>>> method is called unconditionally while in webrev.04 there was a 
>>> check "cp1 != cp2".  One option to fix it is to include the "cp1 != 
>>> cp2" check in the method compareCodePointCI, then cp1 == cp2 at line 
>>> 337 can be omitted.
>> I would have added to line 353 the same cp1 == cp2 check as 337 to 
>> avoid the method call
>> unless it was needed.
>
> As in the previous email, cp1 != cp2 at that point, since either high 
> or low surrogates differ.

Make sense. The webrev looks good to me.

-Joe

>
> Naoto
>
>>
>> Roger
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote:
>>>> Hi,
>>>>
>>>> I revised the fix again, based on further suggestions:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
>>>>
>>>> Changes from v.04 are (all in StringUTF16.java):
>>>>
>>>> - The short cut now does case insensitive comparison that makes the 
>>>> fix closer to the previous implementation (for BMP characters).
>>>> - Changed the bit operation to negating for detecting needed index 
>>>> increment.
>>>> - Method name is changed to better reflect what it is doing, with 
>>>> more descriptive comments.
>>>>
>>>> Here is the benchmark results:
>>>>
>>>> before:
>>>> Benchmark                                Mode  Cnt   Score Error  
>>>> Units
>>>> StringCompareToIgnoreCase.lower          avgt   25  49.960 ? 1.923 
>>>> ns/op
>>>> StringCompareToIgnoreCase.supLower       avgt   25  21.003 ? 0.354 
>>>> ns/op
>>>> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529 
>>>> ns/op
>>>> StringCompareToIgnoreCase.upperLower     avgt   25  15.417 ? 1.046 
>>>> ns/op
>>>>
>>>> after:
>>>> Benchmark                                Mode  Cnt    Score Error  
>>>> Units
>>>> StringCompareToIgnoreCase.lower          avgt   25   46.857 ? 0.524 
>>>> ns/op
>>>> StringCompareToIgnoreCase.supLower       avgt   25  148.688 ? 6.546 
>>>> ns/op
>>>> StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259 
>>>> ns/op
>>>> StringCompareToIgnoreCase.upperLower     avgt   25   15.126 ? 0.338 
>>>> ns/op
>>>>
>>>> Now non-supplementary operations ("lower" and "upperLower") are on 
>>>> par with the "before" result (I am not quite sure why the "after" 
>>>> results are somewhat faster though). For supplementary test cases, 
>>>> "supLower" is very slow. The reason is two fold; one is because 
>>>> "before" one exits at the very first character (which I am 
>>>> addressing here) while "after" continues to compare to the last 
>>>> characters, the other reason is the test suffers from the change 
>>>> where supplementary cases double the case insensitivity checks 
>>>> (compared to the "after" result just below). Also "supUpperLower" 
>>>> gets slower for the same reason. These are expected results for 
>>>> supplementary comparisons (as we discussed).
>>>>
>>>> Naoto
>>>>
>>>> On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
>>>>> Hi,
>>>>>
>>>>> Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>> Changes from the initial revision are:
>>>>>
>>>>> - Shared the implementation between compareToCI() and 
>>>>> regionMatchesCI()
>>>>> - Enabled immediate short cut if two code points match.
>>>>> - Created a simple JMH benchmark. Here is the scores before and 
>>>>> after the change:
>>>>>
>>>>> before:
>>>>> Benchmark                                Mode  Cnt   Score Error  
>>>>> Units
>>>>> StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499 
>>>>> ns/op
>>>>>
>>>>> after:
>>>>> Benchmark                                Mode  Cnt   Score Error  
>>>>> Units
>>>>> StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 
>>>>> ns/op
>>>>> StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272 
>>>>> ns/op
>>>>>
>>>>> Here, "sup" means all supplementary characters, BMP otherwise. 
>>>>> "lower" means each character requires upper->lower casemap. 
>>>>> "upperLower" means all characters are the same, except the last 
>>>>> character which requires casemap.
>>>>>
>>>>> I think the result is reasonable, considering surrogates check are 
>>>>> now mandatory. I have tried Roger's suggestion to use 
>>>>> Arrays.mismatch() but it did not seem to benefit here. In fact, 
>>>>> the performance degraded partly because I implemented the short 
>>>>> cut, and possibly for the overhead of extra checks.
>>>>>
>>>>> Naoto
>>>>>
>>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Please review the fix to the following issues:
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>>
>>>>>> The proposed changeset and its CSR are located at:
>>>>>>
>>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>>
>>>>>> A bug was filed against SimpleDateFormat (8248434) where 
>>>>>> case-insensitive date format/parse failed in some of the new 
>>>>>> locales in JDK15. The root cause was that case-insensitive 
>>>>>> String.regionMatches() method did not work with supplementary 
>>>>>> characters. The problem is that the method's spec does not expect 
>>>>>> case mappings of supplementary characters, possibly because it 
>>>>>> was overlooked in the first place, JSR 204 - "Unicode 
>>>>>> Supplementary Character support". Similar behavior is observed in 
>>>>>> other two case-insensitive methods, i.e., compareToIgnoreCase() 
>>>>>> and equalsIgnoreCase().
>>>>>>
>>>>>> The fix is straightforward to compare strings by code point 
>>>>>> basis, instead of code unit (16bit "char") basis. Technically 
>>>>>> this change will introduce a backward incompatibility, but I 
>>>>>> believe it is an incompatibility to wrong behavior, not true to 
>>>>>> the meaning of those methods' expectations.
>>>>>>
>>>>>> Naoto
>>>
>>