<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
Roger Riggs
Roger.Riggs at oracle.com
Tue Jul 21 14:50:38 UTC 2020
Hi Naoto,
StringUTF16.java:
343, 348: The masking and comparisons seem awkward.
I'd suggest just negating the value and testing for < 0 to flag the less
usual case.
If you continue with the flag bit, define your own static final constant
for that bit;
It looks odd to be using Integer.MIN_VALUE with bit operators.
Performance wise, I'm a bit cautious about all the if's and branches;
And only calling getSupplementaryCodePoint if cp was a suggorate might
have some
benefit, but its hard to tell what will be inlined and when.
Thanks, Roger
On 7/20/20 6:20 PM, naoto.sato at oracle.com wrote:
> Small correction in the updated part:
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>
> Naoto
>
> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>> Hi Joe,
>>
>> Thank you for your comments.
>>
>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get
>>> there if 389:isHighSurrogate is not true.
>>
>> Good point.
>>
>> But more importantly, StringUTF16
>>> has existing method "codePointAt" you may want to consider instead
>>> of adding a new method.
>>
>> If we call codePointAt/Before, it would call an extra getChar().
>> Since we know one codepoint as an input, I would avoid the extra calls.
>>
>>>
>>> Comparing to the base benchmark:
>>> StringCompareToIgnoreCase.lower 8.5%
>>> StringCompareToIgnoreCase.supLower 139%
>>> StringCompareToIgnoreCase.supUpperLower -21.8%
>>> StringCompareToIgnoreCase.upperLower avgt -5.9%
>>>
>>>
>>> "lower" was 8.5% slower, if such test exists in the specJVM, it
>>> would be considered a regression. I would suggest you run the
>>> specJVM. I agree with you on surrogate check being a requirement,
>>> thus supLower being 139% slower is ok since it won't otherwise be
>>> correct anyways.
>>
>> Yes, it would be a regression if SPECjvm produces 8+% degradation,
>> but the test suite is aimed at the entire application performance.
>> But for this one, it is a micro benchmark for relatively rarely
>> issued methods (I would think normal cases fall into Latin1
>> equivalents), I would consider it is acceptable.
>>
>>> But after introducing additional operations supUpperLower and
>>> upperLower ran faster? That may indicate irregularity in the tests.
>>> Maybe we should consider running tests with short, long and very
>>> long sample strings to see if we can reduce the noise level and also
>>> see how it fares for a longer string. I assume the machine you're
>>> running the test on was isolated.
>>
>> This result pretty much depends on the data it is testing for. As I
>> wrote in the previous email, (sup)UpperLower tests use the data that
>> are almost identical, but one last character is case insensitively
>> equal. So in these cases, the new short cut works really well and not
>> call toLower/UpperCase() at all for most of the characters. Thus the
>> new results are faster. Again the test result is very dependent on
>> the input data, Unless the result showed 100% slower or something
>> (except supLower case), I think it is OK.
>>
>> Anyways, here is the updated webrev addressing your first suggestion:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>
>> Naoto
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>> Hi Mark,
>>>>
>>>> Thank you for your comments.
>>>>
>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>> One option is to have a fast path that uses char functions, up to
>>>>> the point where you hit a high surrogate, then drop into the
>>>>> slower codepoint path. That saves time for the high-runner cases.
>>>>>
>>>>> On the other hand, if the times are good enough, you might not
>>>>> need the complication.
>>>>
>>>> The implementation is dealing with bare byte arrays. Only methods
>>>> that it uses from Character class are toLowerCase(int) and
>>>> toUpperCase(int) (sans surrogate check, which is needed at each
>>>> iteration anyways), and their "char" equivalents are merely casting
>>>> (char) to the int result. So it might not be so beneficial to
>>>> differentiate char and int paths.
>>>>
>>>> Having said that, I found that there was an unnecessary surrogate
>>>> check (always checks high *and* low surrogate on each iteration),
>>>> so I revised the fix (added line 380-383 in StringUTF16.java):
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>
>>>> Naoto
>>>>
>>>>>
>>>>> Mark
>>>>> //////
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>> Changes from the initial revision are:
>>>>>
>>>>> - Shared the implementation between compareToCI() and
>>>>> regionMatchesCI()
>>>>> - Enabled immediate short cut if two code points match.
>>>>> - Created a simple JMH benchmark. Here is the scores before
>>>>> and after
>>>>> the change:
>>>>>
>>>>> before:
>>>>> Benchmark Mode Cnt Score
>>>>> Error Units
>>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>>>>> 2.811 ns/op
>>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>>>>> 1.135 ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>>>>> 1.344 ns/op
>>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>>>>> 1.499 ns/op
>>>>>
>>>>> after:
>>>>> Benchmark Mode Cnt Score
>>>>> Error Units
>>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>>>>> 4.603 ns/op
>>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>>>>> 5.672 ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>>>>> 0.965 ns/op
>>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>>>>> 0.272 ns/op
>>>>>
>>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>>> "lower"
>>>>> means each character requires upper->lower casemap.
>>>>> "upperLower" means
>>>>> all characters are the same, except the last character which
>>>>> requires
>>>>> casemap.
>>>>>
>>>>> I think the result is reasonable, considering surrogates check
>>>>> are now
>>>>> mandatory. I have tried Roger's suggestion to use
>>>>> Arrays.mismatch() but
>>>>> it did not seem to benefit here. In fact, the performance
>>>>> degraded
>>>>> partly because I implemented the short cut, and possibly for the
>>>>> overhead of extra checks.
>>>>>
>>>>> Naoto
>>>>>
>>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > Please review the fix to the following issues:
>>>>> >
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>> >
>>>>> > The proposed changeset and its CSR are located at:
>>>>> >
>>>>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>> >
>>>>> > A bug was filed against SimpleDateFormat (8248434) where
>>>>> > case-insensitive date format/parse failed in some of the new
>>>>> locales in
>>>>> > JDK15. The root cause was that case-insensitive
>>>>> String.regionMatches()
>>>>> > method did not work with supplementary characters. The
>>>>> problem is
>>>>> that
>>>>> > the method's spec does not expect case mappings of
>>>>> supplementary
>>>>> > characters, possibly because it was overlooked in the first
>>>>> place, JSR
>>>>> > 204 - "Unicode Supplementary Character support". Similar
>>>>> behavior is
>>>>> > observed in other two case-insensitive methods, i.e.,
>>>>> > compareToIgnoreCase() and equalsIgnoreCase().
>>>>> >
>>>>> > The fix is straightforward to compare strings by code point
>>>>> basis,
>>>>> > instead of code unit (16bit "char") basis. Technically this
>>>>> change will
>>>>> > introduce a backward incompatibility, but I believe it is an
>>>>> > incompatibility to wrong behavior, not true to the meaning
>>>>> of those
>>>>> > methods' expectations.
>>>>> >
>>>>> > Naoto
>>>>>
>>>
More information about the core-libs-dev
mailing list