<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Wed Jul 22 17:23:16 UTC 2020
Hi,
I revised the fix again, based on further suggestions:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
Changes from v.04 are (all in StringUTF16.java):
- The short cut now does case insensitive comparison that makes the fix
closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index
increment.
- Method name is changed to better reflect what it is doing, with more
descriptive comments.
Here is the benchmark results:
before:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 49.960 ? 1.923 ns/op
StringCompareToIgnoreCase.supLower avgt 25 21.003 ? 0.354 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 30.863 ? 4.529 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 15.417 ? 1.046 ns/op
after:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524 ns/op
StringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338 ns/op
Now non-supplementary operations ("lower" and "upperLower") are on par
with the "before" result (I am not quite sure why the "after" results
are somewhat faster though). For supplementary test cases, "supLower" is
very slow. The reason is two fold; one is because "before" one exits at
the very first character (which I am addressing here) while "after"
continues to compare to the last characters, the other reason is the
test suffers from the change where supplementary cases double the case
insensitivity checks (compared to the "after" result just below). Also
"supUpperLower" gets slower for the same reason. These are expected
results for supplementary comparisons (as we discussed).
Naoto
On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
> Hi,
>
> Based on the suggestions, I modified the fix as follows:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>
> Changes from the initial revision are:
>
> - Shared the implementation between compareToCI() and regionMatchesCI()
> - Enabled immediate short cut if two code points match.
> - Created a simple JMH benchmark. Here is the scores before and after
> the change:
>
> before:
> Benchmark Mode Cnt Score Error Units
> StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811 ns/op
> StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135 ns/op
> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344 ns/op
> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499 ns/op
>
> after:
> Benchmark Mode Cnt Score Error Units
> StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603 ns/op
> StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672 ns/op
> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965 ns/op
> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272 ns/op
>
> Here, "sup" means all supplementary characters, BMP otherwise. "lower"
> means each character requires upper->lower casemap. "upperLower" means
> all characters are the same, except the last character which requires
> casemap.
>
> I think the result is reasonable, considering surrogates check are now
> mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
> it did not seem to benefit here. In fact, the performance degraded
> partly because I implemented the short cut, and possibly for the
> overhead of extra checks.
>
> Naoto
>
> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>> Hello,
>>
>> Please review the fix to the following issues:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8248655
>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>
>> The proposed changeset and its CSR are located at:
>>
>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>
>> A bug was filed against SimpleDateFormat (8248434) where
>> case-insensitive date format/parse failed in some of the new locales
>> in JDK15. The root cause was that case-insensitive
>> String.regionMatches() method did not work with supplementary
>> characters. The problem is that the method's spec does not expect case
>> mappings of supplementary characters, possibly because it was
>> overlooked in the first place, JSR 204 - "Unicode Supplementary
>> Character support". Similar behavior is observed in other two
>> case-insensitive methods, i.e., compareToIgnoreCase() and
>> equalsIgnoreCase().
>>
>> The fix is straightforward to compare strings by code point basis,
>> instead of code unit (16bit "char") basis. Technically this change
>> will introduce a backward incompatibility, but I believe it is an
>> incompatibility to wrong behavior, not true to the meaning of those
>> methods' expectations.
>>
>> Naoto
More information about the i18n-dev
mailing list