<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Tue Jul 21 03:58:08 UTC 2020
Hi Joe,
On 7/20/20 7:14 PM, Joe Wang wrote:
> Hi Naoto,
>
> "Unless it showed 100% slower", wow, your tolerance is quite high :-).
> On the other hand, I do agree it's unlikely to show in specJVM (that's a
> speculation though).
I am not saying 100% slowing is permissible :-) That's an example of
definite no.
>
> The short-cut worked well. There's maybe a further optimization we could
> do to rid us of the performance concern (or the need to run additional
> performance tests). Consider the cases where strings in comparison don't
> contain any sup characters, if we make the toLower/UpperCase() block a
> method and call it before and after the surrogate-check block, the
> routine would be effectively the same as prior to the introduction of
> the surrogate-check block, and regular comparisons would suffer the
> surrogate-check only once (the last check). For strings that do contain
> sup characters then, the toLower/UpperCase() method would have been
> called twice, but then we don't care about the performance in that
> situation. You may call the existing codePointAt method too when an
> extra getChar and performance is not a concern (but that's your call.
Can you please elaborate this more? What's "the last check" here?
Naoto
>
> Regards,
> Joe
>
> On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
>> Small correction in the updated part:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>>
>> Naoto
>>
>> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>>> Hi Joe,
>>>
>>> Thank you for your comments.
>>>
>>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>>> Hi Naoto,
>>>>
>>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get
>>>> there if 389:isHighSurrogate is not true.
>>>
>>> Good point.
>>>
>>> But more importantly, StringUTF16
>>>> has existing method "codePointAt" you may want to consider instead
>>>> of adding a new method.
>>>
>>> If we call codePointAt/Before, it would call an extra getChar().
>>> Since we know one codepoint as an input, I would avoid the extra calls.
>>>
>>>>
>>>> Comparing to the base benchmark:
>>>> StringCompareToIgnoreCase.lower 8.5%
>>>> StringCompareToIgnoreCase.supLower 139%
>>>> StringCompareToIgnoreCase.supUpperLower -21.8%
>>>> StringCompareToIgnoreCase.upperLower avgt -5.9%
>>>>
>>>>
>>>> "lower" was 8.5% slower, if such test exists in the specJVM, it
>>>> would be considered a regression. I would suggest you run the
>>>> specJVM. I agree with you on surrogate check being a requirement,
>>>> thus supLower being 139% slower is ok since it won't otherwise be
>>>> correct anyways.
>>>
>>> Yes, it would be a regression if SPECjvm produces 8+% degradation,
>>> but the test suite is aimed at the entire application performance.
>>> But for this one, it is a micro benchmark for relatively rarely
>>> issued methods (I would think normal cases fall into Latin1
>>> equivalents), I would consider it is acceptable.
>>>
>>>> But after introducing additional operations supUpperLower and
>>>> upperLower ran faster? That may indicate irregularity in the tests.
>>>> Maybe we should consider running tests with short, long and very
>>>> long sample strings to see if we can reduce the noise level and also
>>>> see how it fares for a longer string. I assume the machine you're
>>>> running the test on was isolated.
>>>
>>> This result pretty much depends on the data it is testing for. As I
>>> wrote in the previous email, (sup)UpperLower tests use the data that
>>> are almost identical, but one last character is case insensitively
>>> equal. So in these cases, the new short cut works really well and not
>>> call toLower/UpperCase() at all for most of the characters. Thus the
>>> new results are faster. Again the test result is very dependent on
>>> the input data, Unless the result showed 100% slower or something
>>> (except supLower case), I think it is OK.
>>>
>>> Anyways, here is the updated webrev addressing your first suggestion:
>>>
>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>>
>>> Naoto
>>>
>>>>
>>>> Regards,
>>>> Joe
>>>>
>>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thank you for your comments.
>>>>>
>>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>>> One option is to have a fast path that uses char functions, up to
>>>>>> the point where you hit a high surrogate, then drop into the
>>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>>
>>>>>> On the other hand, if the times are good enough, you might not
>>>>>> need the complication.
>>>>>
>>>>> The implementation is dealing with bare byte arrays. Only methods
>>>>> that it uses from Character class are toLowerCase(int) and
>>>>> toUpperCase(int) (sans surrogate check, which is needed at each
>>>>> iteration anyways), and their "char" equivalents are merely casting
>>>>> (char) to the int result. So it might not be so beneficial to
>>>>> differentiate char and int paths.
>>>>>
>>>>> Having said that, I found that there was an unnecessary surrogate
>>>>> check (always checks high *and* low surrogate on each iteration),
>>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>>
>>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>>
>>>>> Naoto
>>>>>
>>>>>>
>>>>>> Mark
>>>>>> //////
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Based on the suggestions, I modified the fix as follows:
>>>>>>
>>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>>
>>>>>> Changes from the initial revision are:
>>>>>>
>>>>>> - Shared the implementation between compareToCI() and
>>>>>> regionMatchesCI()
>>>>>> - Enabled immediate short cut if two code points match.
>>>>>> - Created a simple JMH benchmark. Here is the scores before
>>>>>> and after
>>>>>> the change:
>>>>>>
>>>>>> before:
>>>>>> Benchmark Mode Cnt Score
>>>>>> Error Units
>>>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>>>>>> 2.811 ns/op
>>>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>>>>>> 1.135 ns/op
>>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>>>>>> 1.344 ns/op
>>>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>>>>>> 1.499 ns/op
>>>>>>
>>>>>> after:
>>>>>> Benchmark Mode Cnt Score
>>>>>> Error Units
>>>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>>>>>> 4.603 ns/op
>>>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>>>>>> 5.672 ns/op
>>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>>>>>> 0.965 ns/op
>>>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>>>>>> 0.272 ns/op
>>>>>>
>>>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>>>> "lower"
>>>>>> means each character requires upper->lower casemap.
>>>>>> "upperLower" means
>>>>>> all characters are the same, except the last character which
>>>>>> requires
>>>>>> casemap.
>>>>>>
>>>>>> I think the result is reasonable, considering surrogates check
>>>>>> are now
>>>>>> mandatory. I have tried Roger's suggestion to use
>>>>>> Arrays.mismatch() but
>>>>>> it did not seem to benefit here. In fact, the performance
>>>>>> degraded
>>>>>> partly because I implemented the short cut, and possibly for the
>>>>>> overhead of extra checks.
>>>>>>
>>>>>> Naoto
>>>>>>
>>>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > Please review the fix to the following issues:
>>>>>> >
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>>> >
>>>>>> > The proposed changeset and its CSR are located at:
>>>>>> >
>>>>>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>>> >
>>>>>> > A bug was filed against SimpleDateFormat (8248434) where
>>>>>> > case-insensitive date format/parse failed in some of the new
>>>>>> locales in
>>>>>> > JDK15. The root cause was that case-insensitive
>>>>>> String.regionMatches()
>>>>>> > method did not work with supplementary characters. The
>>>>>> problem is
>>>>>> that
>>>>>> > the method's spec does not expect case mappings of
>>>>>> supplementary
>>>>>> > characters, possibly because it was overlooked in the first
>>>>>> place, JSR
>>>>>> > 204 - "Unicode Supplementary Character support". Similar
>>>>>> behavior is
>>>>>> > observed in other two case-insensitive methods, i.e.,
>>>>>> > compareToIgnoreCase() and equalsIgnoreCase().
>>>>>> >
>>>>>> > The fix is straightforward to compare strings by code point
>>>>>> basis,
>>>>>> > instead of code unit (16bit "char") basis. Technically this
>>>>>> change will
>>>>>> > introduce a backward incompatibility, but I believe it is an
>>>>>> > incompatibility to wrong behavior, not true to the meaning
>>>>>> of those
>>>>>> > methods' expectations.
>>>>>> >
>>>>>> > Naoto
>>>>>>
>>>>
>
More information about the i18n-dev
mailing list