<i18n dev> RFR: 8248655: Support supplementary characters in String case insensitive operations
Brent Christian
brent.christian at oracle.com
Tue Jul 21 21:56:01 UTC 2020
Hi, Naoto
I have a few comments:
src/java.base/share/classes/java/lang/StringUTF16.java
379 private static int getSupplementaryCodePoint(byte[] ba, int cp,
int index, int start, int end)
I think it would be worth a small addition to the comment to reflect
that non-surrogate chars are returned as-is.
--
I thought about the scenario of an unpaired low or high surrogate at the
beginning or end of the string, respectively:
384 if (Character.isLowSurrogate((char)cp)) {
385 if (index > start) {
...
391 } else if (index + 1 < end) { // cp == high surrogate
392 char c = getChar(ba, index + 1);
...
397 return cp;
It looks like the cp itself would be returned from
getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's
converted using Character.to[Upper|Lower]Case(int), which will also
return the cp itself. I imagine that's the best we could do, so seems fine.
Is there a test case for unmatched surrogates at the beginning and end
of the string ? Should there be ?
--
I see there are no changes to StringLatin1.regionMatchesCI_UTF16(). I
presume there are no cases in which toUpperCase(toLowerCase()) of a
supplementary character could yield a Latin-1 character, yes?
Also, thanks for adding the benchmark!
-Brent
On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote:
> Small correction in the updated part:
>
> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
>
> Naoto
>
> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote:
>> Hi Joe,
>>
>> Thank you for your comments.
>>
>> On 7/20/20 11:20 AM, Joe Wang wrote:
>>> Hi Naoto,
>>>
>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get
>>> there if 389:isHighSurrogate is not true.
>>
>> Good point.
>>
>> But more importantly, StringUTF16
>>> has existing method "codePointAt" you may want to consider instead of
>>> adding a new method.
>>
>> If we call codePointAt/Before, it would call an extra getChar(). Since
>> we know one codepoint as an input, I would avoid the extra calls.
>>
>>>
>>> Comparing to the base benchmark:
>>> StringCompareToIgnoreCase.lower 8.5%
>>> StringCompareToIgnoreCase.supLower 139%
>>> StringCompareToIgnoreCase.supUpperLower -21.8%
>>> StringCompareToIgnoreCase.upperLower avgt -5.9%
>>>
>>>
>>> "lower" was 8.5% slower, if such test exists in the specJVM, it would
>>> be considered a regression. I would suggest you run the specJVM. I
>>> agree with you on surrogate check being a requirement, thus supLower
>>> being 139% slower is ok since it won't otherwise be correct anyways.
>>
>> Yes, it would be a regression if SPECjvm produces 8+% degradation, but
>> the test suite is aimed at the entire application performance. But for
>> this one, it is a micro benchmark for relatively rarely issued methods
>> (I would think normal cases fall into Latin1 equivalents), I would
>> consider it is acceptable.
>>
>>> But after introducing additional operations supUpperLower and
>>> upperLower ran faster? That may indicate irregularity in the tests.
>>> Maybe we should consider running tests with short, long and very long
>>> sample strings to see if we can reduce the noise level and also see
>>> how it fares for a longer string. I assume the machine you're running
>>> the test on was isolated.
>>
>> This result pretty much depends on the data it is testing for. As I
>> wrote in the previous email, (sup)UpperLower tests use the data that
>> are almost identical, but one last character is case insensitively
>> equal. So in these cases, the new short cut works really well and not
>> call toLower/UpperCase() at all for most of the characters. Thus the
>> new results are faster. Again the test result is very dependent on the
>> input data, Unless the result showed 100% slower or something (except
>> supLower case), I think it is OK.
>>
>> Anyways, here is the updated webrev addressing your first suggestion:
>>
>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
>>
>> Naoto
>>
>>>
>>> Regards,
>>> Joe
>>>
>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote:
>>>> Hi Mark,
>>>>
>>>> Thank you for your comments.
>>>>
>>>> On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
>>>>> One option is to have a fast path that uses char functions, up to
>>>>> the point where you hit a high surrogate, then drop into the slower
>>>>> codepoint path. That saves time for the high-runner cases.
>>>>>
>>>>> On the other hand, if the times are good enough, you might not need
>>>>> the complication.
>>>>
>>>> The implementation is dealing with bare byte arrays. Only methods
>>>> that it uses from Character class are toLowerCase(int) and
>>>> toUpperCase(int) (sans surrogate check, which is needed at each
>>>> iteration anyways), and their "char" equivalents are merely casting
>>>> (char) to the int result. So it might not be so beneficial to
>>>> differentiate char and int paths.
>>>>
>>>> Having said that, I found that there was an unnecessary surrogate
>>>> check (always checks high *and* low surrogate on each iteration), so
>>>> I revised the fix (added line 380-383 in StringUTF16.java):
>>>>
>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
>>>>
>>>> Naoto
>>>>
>>>>>
>>>>> Mark
>>>>> //////
>>>>>
>>>>>
>>>>> On Fri, Jul 17, 2020 at 4:39 PM <naoto.sato at oracle.com
>>>>> <mailto:naoto.sato at oracle.com>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Based on the suggestions, I modified the fix as follows:
>>>>>
>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>>>>>
>>>>> Changes from the initial revision are:
>>>>>
>>>>> - Shared the implementation between compareToCI() and
>>>>> regionMatchesCI()
>>>>> - Enabled immediate short cut if two code points match.
>>>>> - Created a simple JMH benchmark. Here is the scores before and
>>>>> after
>>>>> the change:
>>>>>
>>>>> before:
>>>>> Benchmark Mode Cnt Score
>>>>> Error Units
>>>>> StringCompareToIgnoreCase.lower avgt 25 53.764 ?
>>>>> 2.811 ns/op
>>>>> StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
>>>>> 1.135 ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
>>>>> 1.344 ns/op
>>>>> StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
>>>>> 1.499 ns/op
>>>>>
>>>>> after:
>>>>> Benchmark Mode Cnt Score
>>>>> Error Units
>>>>> StringCompareToIgnoreCase.lower avgt 25 58.354 ?
>>>>> 4.603 ns/op
>>>>> StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
>>>>> 5.672 ns/op
>>>>> StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
>>>>> 0.965 ns/op
>>>>> StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
>>>>> 0.272 ns/op
>>>>>
>>>>> Here, "sup" means all supplementary characters, BMP otherwise.
>>>>> "lower"
>>>>> means each character requires upper->lower casemap.
>>>>> "upperLower" means
>>>>> all characters are the same, except the last character which
>>>>> requires
>>>>> casemap.
>>>>>
>>>>> I think the result is reasonable, considering surrogates check
>>>>> are now
>>>>> mandatory. I have tried Roger's suggestion to use
>>>>> Arrays.mismatch() but
>>>>> it did not seem to benefit here. In fact, the performance degraded
>>>>> partly because I implemented the short cut, and possibly for the
>>>>> overhead of extra checks.
>>>>>
>>>>> Naoto
>>>>>
>>>>> On 7/15/20 9:00 AM, naoto.sato at oracle.com
>>>>> <mailto:naoto.sato at oracle.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > Please review the fix to the following issues:
>>>>> >
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248655
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248434
>>>>> >
>>>>> > The proposed changeset and its CSR are located at:
>>>>> >
>>>>> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>>>> > https://bugs.openjdk.java.net/browse/JDK-8248664
>>>>> >
>>>>> > A bug was filed against SimpleDateFormat (8248434) where
>>>>> > case-insensitive date format/parse failed in some of the new
>>>>> locales in
>>>>> > JDK15. The root cause was that case-insensitive
>>>>> String.regionMatches()
>>>>> > method did not work with supplementary characters. The
>>>>> problem is
>>>>> that
>>>>> > the method's spec does not expect case mappings of
>>>>> supplementary
>>>>> > characters, possibly because it was overlooked in the first
>>>>> place, JSR
>>>>> > 204 - "Unicode Supplementary Character support". Similar
>>>>> behavior is
>>>>> > observed in other two case-insensitive methods, i.e.,
>>>>> > compareToIgnoreCase() and equalsIgnoreCase().
>>>>> >
>>>>> > The fix is straightforward to compare strings by code point
>>>>> basis,
>>>>> > instead of code unit (16bit "char") basis. Technically this
>>>>> change will
>>>>> > introduce a backward incompatibility, but I believe it is an
>>>>> > incompatibility to wrong behavior, not true to the meaning
>>>>> of those
>>>>> > methods' expectations.
>>>>> >
>>>>> > Naoto
>>>>>
>>>
More information about the core-libs-dev
mailing list