RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Wed Jul 22 20:43:54 UTC 2020
Thanks Roger,
Ah, I just saw your email just after I sent mine!
On 7/22/20 1:38 PM, Roger Riggs wrote:
> Hi Naoto,
>
> Looks fine. (with Joe's suggestion)
>
> On 7/22/20 4:20 PM, Joe Wang wrote:
>> Hi Naoto,
>>
>> The change looks good to me. "supLower" is indeed super slow :-)
>>
>> The only minor comment I have is that the compareCodePointCI method
>> performs toUpperCase unconditionally. That's not a problem for the
>> regular case, where a check on cp1 == cp2 (line 337) is done prior to
>> the method call. But for the sup case (starting at line 341), the
>> method is called unconditionally while in webrev.04 there was a check
>> "cp1 != cp2". One option to fix it is to include the "cp1 != cp2"
>> check in the method compareCodePointCI, then cp1 == cp2 at line 337
>> can be omitted.
> I would have added to line 353 the same cp1 == cp2 check as 337 to avoid
> the method call
> unless it was needed.
As in the previous email, cp1 != cp2 at that point, since either high or
low surrogates differ.
Naoto
>
> Roger
>
>>
>> Regards,
>> Joe
>>
>> On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote:
>>> Hi,
>>>
>>> I revised the fix again, based on further suggestions:
>>>
>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
>>>
>>> Changes from v.04 are (all in StringUTF16.java):
>>>
>>> - The short cut now does case insensitive comparison that makes the
>>> fix closer to the previous implementation (for BMP characters).
>>> - Changed the bit operation to negating for detecting needed index
>>> increment.
>>> - Method name is changed to better reflect what it is doing, with
>>> more descriptive comments.
>>>
>>> Here is the benchmark results:
>>>
>>> before:
>>> Benchmark Mode Cnt Score Error Units
>>> StringCompareToIgnoreCase.lower avgt 25 49.960 ? 1.923 ns/op
>>> StringCompareToIgnoreCase.supLower avgt 25 21.003 ? 0.354 ns/op
>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.863 ? 4.529 ns/op
>>> StringCompareToIgnoreCase.upperLower avgt 25 15.417 ? 1.046 ns/op
>>>
>>> after:
>>> Benchmark Mode Cnt Score Error Units
>>> StringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524
>>> ns/op
>>> StringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546
>>> ns/op
>>> StringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259
>>> ns/op
>>> StringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338
>>> ns/op
>>>
>>> Now non-supplementary operations ("lower" and "upperLower") are on
>>> par with the "before" result (I am not quite sure why the "after"
>>> results are somewhat faster though). For supplementary test cases,
>>> "supLower" is very slow. The reason is two fold; one is because
>>> "before" one exits at the very first character (which I am addressing
>>> here) while "after" continues to compare to the last characters, the
>>> other reason is the test suffers from the change where supplementary
>>> cases double the case insensitivity checks (compared to the "after"
>>> result just below). Also "supUpperLower" gets slower for the same
>>> reason. These are expected results for supplementary comparisons (as
>>> we discussed).
>>>
>>> Naoto
>>>
>>> On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote:
>>>> Hi,
>>>>
>>>> Based on the suggestions, I modified the fix as follows:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>
>>>> Changes from the initial revision are:
>>>>
>>>> - Shared the implementation between compareToCI() and regionMatchesCI()
>>>> - Enabled immediate short cut if two code points match.
>>>> - Created a simple JMH benchmark. Here is the scores before and
>>>> after the change:
>>>>
>>>> before:
>>>> Benchmark Mode Cnt Score Error Units
>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811
>>>> ns/op
>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135
>>>> ns/op
>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344
>>>> ns/op
>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499
>>>> ns/op
>>>>
>>>> after:
>>>> Benchmark Mode Cnt Score Error Units
>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603
>>>> ns/op
>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672
>>>> ns/op
>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965
>>>> ns/op
>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272
>>>> ns/op
>>>>
>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>> "lower" means each character requires upper->lower casemap.
>>>> "upperLower" means all characters are the same, except the last
>>>> character which requires casemap.
>>>>
>>>> I think the result is reasonable, considering surrogates check are
>>>> now mandatory. I have tried Roger's suggestion to use
>>>> Arrays.mismatch() but it did not seem to benefit here. In fact, the
>>>> performance degraded partly because I implemented the short cut, and
>>>> possibly for the overhead of extra checks.
>>>>
>>>> Naoto
>>>>
>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote:
>>>>> Hello,
>>>>>
>>>>> Please review the fix to the following issues:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>
>>>>> The proposed changeset and its CSR are located at:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>
>>>>> A bug was filed against SimpleDateFormat (8248434) where
>>>>> case-insensitive date format/parse failed in some of the new
>>>>> locales in JDK15. The root cause was that case-insensitive
>>>>> String.regionMatches() method did not work with supplementary
>>>>> characters. The problem is that the method's spec does not expect
>>>>> case mappings of supplementary characters, possibly because it was
>>>>> overlooked in the first place, JSR 204 - "Unicode Supplementary
>>>>> Character support". Similar behavior is observed in other two
>>>>> case-insensitive methods, i.e., compareToIgnoreCase() and
>>>>> equalsIgnoreCase().
>>>>>
>>>>> The fix is straightforward to compare strings by code point basis,
>>>>> instead of code unit (16bit "char") basis. Technically this change
>>>>> will introduce a backward incompatibility, but I believe it is an
>>>>> incompatibility to wrong behavior, not true to the meaning of those
>>>>> methods' expectations.
>>>>>
>>>>> Naoto
>>
>
More information about the core-libs-dev
mailing list