RFR: 8248655: Support supplementary characters in String case insensitive operations

Wed Jul 22 20:20:20 UTC 2020

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior to 
the method call. But for the sup case (starting at line 341), the method 
is called unconditionally while in webrev.04 there was a check "cp1 != 
cp2".  One option to fix it is to include the "cp1 != cp2" check in the 
method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.

Regards,
Joe

On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote:
> Hi,
>
> I revised the fix again, based on further suggestions:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
>
> Changes from v.04 are (all in StringUTF16.java):
>
> - The short cut now does case insensitive comparison that makes the 
> fix closer to the previous implementation (for BMP characters).
> - Changed the bit operation to negating for detecting needed index 
> increment.
> - Method name is changed to better reflect what it is doing, with more 
> descriptive comments.
>
> Here is the benchmark results:
>
> before:
> Benchmark                                Mode  Cnt   Score Error  Units
> StringCompareToIgnoreCase.lower          avgt   25  49.960 ? 1.923  ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  21.003 ? 0.354  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25  15.417 ? 1.046  ns/op
>
> after:
> Benchmark                                Mode  Cnt    Score Error  Units
> StringCompareToIgnoreCase.lower          avgt   25   46.857 ? 0.524  
> ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  148.688 ? 6.546  
> ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259  
> ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25   15.126 ? 0.338  
> ns/op
>
> Now non-supplementary operations ("lower" and "upperLower") are on par 
> with the "before" result (I am not quite sure why the "after" results 
> are somewhat faster though). For supplementary test cases, "supLower" 
> is very slow. The reason is two fold; one is because "before" one 
> exits at the very first character (which I am addressing here) while 
> "after" continues to compare to the last characters, the other reason 
> is the test suffers from the change where supplementary cases double 
> the case insensitivity checks (compared to the "after" result just 
> below). Also "supUpperLower" gets slower for the same reason. These 
> are expected results for supplementary comparisons (as we discussed).
>
> Naoto
>
> On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
>> Hi,
>>
>> Based on the suggestions, I modified the fix as follows:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>
>> Changes from the initial revision are:
>>
>> - Shared the implementation between compareToCI() and regionMatchesCI()
>> - Enabled immediate short cut if two code points match.
>> - Created a simple JMH benchmark. Here is the scores before and after 
>> the change:
>>
>> before:
>> Benchmark                                Mode  Cnt   Score Error  Units
>> StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811  
>> ns/op
>> StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135  
>> ns/op
>> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  
>> ns/op
>> StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499  
>> ns/op
>>
>> after:
>> Benchmark                                Mode  Cnt   Score Error  Units
>> StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603  
>> ns/op
>> StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672  
>> ns/op
>> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  
>> ns/op
>> StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272  
>> ns/op
>>
>> Here, "sup" means all supplementary characters, BMP otherwise. 
>> "lower" means each character requires upper->lower casemap. 
>> "upperLower" means all characters are the same, except the last 
>> character which requires casemap.
>>
>> I think the result is reasonable, considering surrogates check are 
>> now mandatory. I have tried Roger's suggestion to use 
>> Arrays.mismatch() but it did not seem to benefit here. In fact, the 
>> performance degraded partly because I implemented the short cut, and 
>> possibly for the overhead of extra checks.
>>
>> Naoto
>>
>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>>> Hello,
>>>
>>> Please review the fix to the following issues:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>>
>>> The proposed changeset and its CSR are located at:
>>>
>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>>
>>> A bug was filed against SimpleDateFormat (8248434) where 
>>> case-insensitive date format/parse failed in some of the new locales 
>>> in JDK15. The root cause was that case-insensitive 
>>> String.regionMatches() method did not work with supplementary 
>>> characters. The problem is that the method's spec does not expect 
>>> case mappings of supplementary characters, possibly because it was 
>>> overlooked in the first place, JSR 204 - "Unicode Supplementary 
>>> Character support". Similar behavior is observed in other two 
>>> case-insensitive methods, i.e., compareToIgnoreCase() and 
>>> equalsIgnoreCase().
>>>
>>> The fix is straightforward to compare strings by code point basis, 
>>> instead of code unit (16bit "char") basis. Technically this change 
>>> will introduce a backward incompatibility, but I believe it is an 
>>> incompatibility to wrong behavior, not true to the meaning of those 
>>> methods' expectations.
>>>
>>> Naoto