<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
naoto.sato at oracle.com
naoto.sato at oracle.com
Mon Jul 20 22:20:18 UTC 2020
Small correction in the updated part:
http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
Naoto
On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
> Hi Joe,
>
> Thank you for your comments.
>
> On 7/20/20 11:20 AM, Joe Wang wrote:
>> Hi Naoto,
>>
>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get
>> there if 389:isHighSurrogate is not true.
>
> Good point.
>
> But more importantly, StringUTF16
>> has existing method "codePointAt" you may want to consider instead of
>> adding a new method.
>
> If we call codePointAt/Before, it would call an extra getChar(). Since
> we know one codepoint as an input, I would avoid the extra calls.
>
>>
>> Comparing to the base benchmark:
>> StringCompareToIgnoreCase.lower 8.5%
>> StringCompareToIgnoreCase.supLower 139%
>> StringCompareToIgnoreCase.supUpperLower -21.8%
>> StringCompareToIgnoreCase.upperLower avgt -5.9%
>>
>>
>> "lower" was 8.5% slower, if such test exists in the specJVM, it would
>> be considered a regression. I would suggest you run the specJVM. I
>> agree with you on surrogate check being a requirement, thus supLower
>> being 139% slower is ok since it won't otherwise be correct anyways.
>
> Yes, it would be a regression if SPECjvm produces 8+% degradation, but
> the test suite is aimed at the entire application performance. But for
> this one, it is a micro benchmark for relatively rarely issued methods
> (I would think normal cases fall into Latin1 equivalents), I would
> consider it is acceptable.
>
>> But after introducing additional operations supUpperLower and
>> upperLower ran faster? That may indicate irregularity in the tests.
>> Maybe we should consider running tests with short, long and very long
>> sample strings to see if we can reduce the noise level and also see
>> how it fares for a longer string. I assume the machine you're running
>> the test on was isolated.
>
> This result pretty much depends on the data it is testing for. As I
> wrote in the previous email, (sup)UpperLower tests use the data that are
> almost identical, but one last character is case insensitively equal. So
> in these cases, the new short cut works really well and not call
> toLower/UpperCase() at all for most of the characters. Thus the new
> results are faster. Again the test result is very dependent on the input
> data, Unless the result showed 100% slower or something (except supLower
> case), I think it is OK.
>
> Anyways, here is the updated webrev addressing your first suggestion:
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>
> Naoto
>
>>
>> Regards,
>> Joe
>>
>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>> Hi Mark,
>>>
>>> Thank you for your comments.
>>>
>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>> One option is to have a fast path that uses char functions, up to
>>>> the point where you hit a high surrogate, then drop into the slower
>>>> codepoint path. That saves time for the high-runner cases.
>>>>
>>>> On the other hand, if the times are good enough, you might not need
>>>> the complication.
>>>
>>> The implementation is dealing with bare byte arrays. Only methods
>>> that it uses from Character class are toLowerCase(int) and
>>> toUpperCase(int) (sans surrogate check, which is needed at each
>>> iteration anyways), and their "char" equivalents are merely casting
>>> (char) to the int result. So it might not be so beneficial to
>>> differentiate char and int paths.
>>>
>>> Having said that, I found that there was an unnecessary surrogate
>>> check (always checks high *and* low surrogate on each iteration), so
>>> I revised the fix (added line 380-383 in StringUTF16.java):
>>>
>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>
>>> Naoto
>>>
>>>>
>>>> Mark
>>>> //////
>>>>
>>>>
>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Based on the suggestions, I modified the fix as follows:
>>>>
>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>
>>>> Changes from the initial revision are:
>>>>
>>>> - Shared the implementation between compareToCI() and
>>>> regionMatchesCI()
>>>> - Enabled immediate short cut if two code points match.
>>>> - Created a simple JMH benchmark. Here is the scores before and
>>>> after
>>>> the change:
>>>>
>>>> before:
>>>> Benchmark Mode Cnt Score
>>>> Error Units
>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>>>> 2.811 ns/op
>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>>>> 1.135 ns/op
>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>>>> 1.344 ns/op
>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>>>> 1.499 ns/op
>>>>
>>>> after:
>>>> Benchmark Mode Cnt Score
>>>> Error Units
>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>>>> 4.603 ns/op
>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>>>> 5.672 ns/op
>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>>>> 0.965 ns/op
>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>>>> 0.272 ns/op
>>>>
>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>> "lower"
>>>> means each character requires upper->lower casemap. "upperLower"
>>>> means
>>>> all characters are the same, except the last character which
>>>> requires
>>>> casemap.
>>>>
>>>> I think the result is reasonable, considering surrogates check
>>>> are now
>>>> mandatory. I have tried Roger's suggestion to use
>>>> Arrays.mismatch() but
>>>> it did not seem to benefit here. In fact, the performance degraded
>>>> partly because I implemented the short cut, and possibly for the
>>>> overhead of extra checks.
>>>>
>>>> Naoto
>>>>
>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>> > Hello,
>>>> >
>>>> > Please review the fix to the following issues:
>>>> >
>>>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>> >
>>>> > The proposed changeset and its CSR are located at:
>>>> >
>>>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>> >
>>>> > A bug was filed against SimpleDateFormat (8248434) where
>>>> > case-insensitive date format/parse failed in some of the new
>>>> locales in
>>>> > JDK15. The root cause was that case-insensitive
>>>> String.regionMatches()
>>>> > method did not work with supplementary characters. The
>>>> problem is
>>>> that
>>>> > the method's spec does not expect case mappings of supplementary
>>>> > characters, possibly because it was overlooked in the first
>>>> place, JSR
>>>> > 204 - "Unicode Supplementary Character support". Similar
>>>> behavior is
>>>> > observed in other two case-insensitive methods, i.e.,
>>>> > compareToIgnoreCase() and equalsIgnoreCase().
>>>> >
>>>> > The fix is straightforward to compare strings by code point
>>>> basis,
>>>> > instead of code unit (16bit "char") basis. Technically this
>>>> change will
>>>> > introduce a backward incompatibility, but I believe it is an
>>>> > incompatibility to wrong behavior, not true to the meaning of
>>>> those
>>>> > methods' expectations.
>>>> >
>>>> > Naoto
>>>>
>>
More information about the core-libs-dev
mailing list