<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations

Joe Wang huizhe.wang at oracle.com
Tue Jul 21 02:14:47 UTC 2020


Hi Naoto,

"Unless it showed 100% slower", wow, your tolerance is quite high :-). 
On the other hand, I do agree it's unlikely to show in specJVM (that's a 
speculation though).

The short-cut worked well. There's maybe a further optimization we could 
do to rid us of the performance concern (or the need to run additional 
performance tests). Consider the cases where strings in comparison don't 
contain any sup characters, if we make the toLower/UpperCase() block a 
method and call it before and after the surrogate-check block, the 
routine would be effectively the same as prior to the introduction of 
the surrogate-check block, and regular comparisons would suffer the 
surrogate-check only once (the last check). For strings that do contain 
sup characters then, the toLower/UpperCase() method would have been 
called twice, but then we don't care about the performance in that 
situation. You may call the existing codePointAt method too when an 
extra getChar and performance is not a concern (but that's your call.

Regards,
Joe

On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
> Small correction in the updated part:
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>
> Naoto
>
> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>> Hi Joe,
>>
>> Thank you for your comments.
>>
>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
>>> there if 389:isHighSurrogate is not true. 
>>
>> Good point.
>>
>> But more importantly, StringUTF16
>>> has existing method "codePointAt" you may want to consider instead 
>>> of adding a new method.
>>
>> If we call codePointAt/Before, it would call an extra getChar(). 
>> Since we know one codepoint as an input, I would avoid the extra calls.
>>
>>>
>>> Comparing to the base benchmark:
>>> StringCompareToIgnoreCase.lower          8.5%
>>> StringCompareToIgnoreCase.supLower      139%
>>> StringCompareToIgnoreCase.supUpperLower  -21.8%
>>> StringCompareToIgnoreCase.upperLower     avgt   -5.9%
>>>
>>>
>>> "lower" was 8.5% slower, if such test exists in the specJVM, it 
>>> would be considered a regression. I would suggest you run the 
>>> specJVM. I agree with you on surrogate check being a requirement, 
>>> thus supLower being 139% slower is ok since it won't otherwise be 
>>> correct anyways.
>>
>> Yes, it would be a regression if SPECjvm produces 8+% degradation, 
>> but the test suite is aimed at the entire application performance. 
>> But for this one, it is a micro benchmark for relatively rarely 
>> issued methods (I would think normal cases fall into Latin1 
>> equivalents), I would consider it is acceptable.
>>
>>> But after introducing additional operations supUpperLower and 
>>> upperLower ran faster? That may indicate irregularity in the tests. 
>>> Maybe we should consider running tests with short, long and very 
>>> long sample strings to see if we can reduce the noise level and also 
>>> see how it fares for a longer string. I assume the machine you're 
>>> running the test on was isolated.
>>
>> This result pretty much depends on the data it is testing for. As I 
>> wrote in the previous email, (sup)UpperLower tests use the data that 
>> are almost identical, but one last character is case insensitively 
>> equal. So in these cases, the new short cut works really well and not 
>> call toLower/UpperCase() at all for most of the characters. Thus the 
>> new results are faster. Again the test result is very dependent on 
>> the input data, Unless the result showed 100% slower or something 
>> (except supLower case), I think it is OK.
>>
>> Anyways, here is the updated webrev addressing your first suggestion:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>
>> Naoto
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>> Hi Mark,
>>>>
>>>> Thank you for your comments.
>>>>
>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>> One option is to have a fast path that uses char functions, up to 
>>>>> the point where you hit a high surrogate, then drop into the 
>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>
>>>>> On the other hand, if the times are good enough, you might not 
>>>>> need the complication.
>>>>
>>>> The implementation is dealing with bare byte arrays. Only methods 
>>>> that it uses from Character class are toLowerCase(int) and 
>>>> toUpperCase(int) (sans surrogate check, which is needed at each 
>>>> iteration anyways), and their "char" equivalents are merely casting 
>>>> (char) to the int result. So it might not be so beneficial to 
>>>> differentiate char and int paths.
>>>>
>>>> Having said that, I found that there was an unnecessary surrogate 
>>>> check (always checks high *and* low surrogate on each iteration), 
>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>
>>>> Naoto
>>>>
>>>>>
>>>>> Mark
>>>>> //////
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com 
>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>>     Changes from the initial revision are:
>>>>>
>>>>>     - Shared the implementation between compareToCI() and 
>>>>> regionMatchesCI()
>>>>>     - Enabled immediate short cut if two code points match.
>>>>>     - Created a simple JMH benchmark. Here is the scores before 
>>>>> and after
>>>>>     the change:
>>>>>
>>>>>     before:
>>>>>     Benchmark                                Mode  Cnt  Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25 53.764 ? 
>>>>> 2.811     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25 24.211 ? 
>>>>> 1.135     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25 30.595 ? 
>>>>> 1.344     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25 18.859 ? 
>>>>> 1.499     ns/op
>>>>>
>>>>>     after:
>>>>>     Benchmark                                Mode  Cnt  Score 
>>>>>  Error     Units
>>>>>     StringCompareToIgnoreCase.lower          avgt   25 58.354 ? 
>>>>> 4.603     ns/op
>>>>>     StringCompareToIgnoreCase.supLower       avgt   25 57.975 ? 
>>>>> 5.672     ns/op
>>>>>     StringCompareToIgnoreCase.supUpperLower  avgt   25 23.912 ? 
>>>>> 0.965     ns/op
>>>>>     StringCompareToIgnoreCase.upperLower     avgt   25 17.744 ? 
>>>>> 0.272     ns/op
>>>>>
>>>>>     Here, "sup" means all supplementary characters, BMP otherwise. 
>>>>> "lower"
>>>>>     means each character requires upper->lower casemap. 
>>>>> "upperLower" means
>>>>>     all characters are the same, except the last character which 
>>>>> requires
>>>>>     casemap.
>>>>>
>>>>>     I think the result is reasonable, considering surrogates check 
>>>>> are now
>>>>>     mandatory. I have tried Roger's suggestion to use 
>>>>> Arrays.mismatch() but
>>>>>     it did not seem to benefit here. In fact, the performance 
>>>>> degraded
>>>>>     partly because I implemented the short cut, and possibly for the
>>>>>     overhead of extra checks.
>>>>>
>>>>>     Naoto
>>>>>
>>>>>     On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>     <mailto:naoto.sato at oracle.com> wrote:
>>>>>      > Hello,
>>>>>      >
>>>>>      > Please review the fix to the following issues:
>>>>>      >
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>      >
>>>>>      > The proposed changeset and its CSR are located at:
>>>>>      >
>>>>>      > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>      > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>      >
>>>>>      > A bug was filed against SimpleDateFormat (8248434) where
>>>>>      > case-insensitive date format/parse failed in some of the new
>>>>>     locales in
>>>>>      > JDK15. The root cause was that case-insensitive
>>>>>     String.regionMatches()
>>>>>      > method did not work with supplementary characters. The 
>>>>> problem is
>>>>>     that
>>>>>      > the method's spec does not expect case mappings of 
>>>>> supplementary
>>>>>      > characters, possibly because it was overlooked in the first
>>>>>     place, JSR
>>>>>      > 204 - "Unicode Supplementary Character support". Similar 
>>>>> behavior is
>>>>>      > observed in other two case-insensitive methods, i.e.,
>>>>>      > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>      >
>>>>>      > The fix is straightforward to compare strings by code point 
>>>>> basis,
>>>>>      > instead of code unit (16bit "char") basis. Technically this
>>>>>     change will
>>>>>      > introduce a backward incompatibility, but I believe it is an
>>>>>      > incompatibility to wrong behavior, not true to the meaning 
>>>>> of those
>>>>>      > methods' expectations.
>>>>>      >
>>>>>      > Naoto
>>>>>
>>>



More information about the i18n-dev mailing list