<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Wed Jul 22 17:23:16 UTC 2020

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the fix 
closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with more 
descriptive comments.

Here is the benchmark results:

before:
Benchmark                                Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower          avgt   25  49.960 ? 1.923  ns/op
StringCompareToIgnoreCase.supLower       avgt   25  21.003 ? 0.354  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  15.417 ? 1.046  ns/op

after:
Benchmark                                Mode  Cnt    Score   Error  Units
StringCompareToIgnoreCase.lower          avgt   25   46.857 ? 0.524  ns/op
StringCompareToIgnoreCase.supLower       avgt   25  148.688 ? 6.546  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259  ns/op
StringCompareToIgnoreCase.upperLower     avgt   25   15.126 ? 0.338  ns/op

Now non-supplementary operations ("lower" and "upperLower") are on par 
with the "before" result (I am not quite sure why the "after" results 
are somewhat faster though). For supplementary test cases, "supLower" is 
very slow. The reason is two fold; one is because "before" one exits at 
the very first character (which I am addressing here) while "after" 
continues to compare to the last characters, the other reason is the 
test suffers from the change where supplementary cases double the case 
insensitivity checks (compared to the "after" result just below). Also 
"supUpperLower" gets slower for the same reason. These are expected 
results for supplementary comparisons (as we discussed).

Naoto

On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
> Hi,
> 
> Based on the suggestions, I modified the fix as follows:
> 
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
> 
> Changes from the initial revision are:
> 
> - Shared the implementation between compareToCI() and regionMatchesCI()
> - Enabled immediate short cut if two code points match.
> - Created a simple JMH benchmark. Here is the scores before and after 
> the change:
> 
> before:
> Benchmark                                Mode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811  ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499  ns/op
> 
> after:
> Benchmark                                Mode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603  ns/op
> StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
> StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272  ns/op
> 
> Here, "sup" means all supplementary characters, BMP otherwise. "lower" 
> means each character requires upper->lower casemap. "upperLower" means 
> all characters are the same, except the last character which requires 
> casemap.
> 
> I think the result is reasonable, considering surrogates check are now 
> mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but 
> it did not seem to benefit here. In fact, the performance degraded 
> partly because I implemented the short cut, and possibly for the 
> overhead of extra checks.
> 
> Naoto
> 
> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>> Hello,
>>
>> Please review the fix to the following issues:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8248655
>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>
>> The proposed changeset and its CSR are located at:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>
>> A bug was filed against SimpleDateFormat (8248434) where 
>> case-insensitive date format/parse failed in some of the new locales 
>> in JDK15. The root cause was that case-insensitive 
>> String.regionMatches() method did not work with supplementary 
>> characters. The problem is that the method's spec does not expect case 
>> mappings of supplementary characters, possibly because it was 
>> overlooked in the first place, JSR 204 - "Unicode Supplementary 
>> Character support". Similar behavior is observed in other two 
>> case-insensitive methods, i.e., compareToIgnoreCase() and 
>> equalsIgnoreCase().
>>
>> The fix is straightforward to compare strings by code point basis, 
>> instead of code unit (16bit "char") basis. Technically this change 
>> will introduce a backward incompatibility, but I believe it is an 
>> incompatibility to wrong behavior, not true to the meaning of those 
>> methods' expectations.
>>
>> Naoto