From naoto.sato at oracle.com Tue Jul 7 22:55:32 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Tue, 7 Jul 2020 15:55:32 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only Message-ID: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> Hello, Please review the fix to the following issue: https://bugs.openjdk.java.net/browse/JDK-8248695 The proposed changeset is located at: http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ There were two causes that resulted in throwing exceptions. One was that the Host adapter for Windows always produced Date and Time combined patterns, so formatting a LocalDate ended up with unsupported temporal field for HourOfDay (reported in the bug), and the other cause was the pattern for am/pm was "aa", which was not valid as a DateTimeFormatter pattern. Besides these issues, localized DayOfWeek/AM_PM names have not been correctly implemented in the host adapter. Now those names are correctly returned from Windows. Naoto From naoto.sato at oracle.com Mon Jul 13 12:54:22 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 13 Jul 2020 05:54:22 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> Message-ID: Ping. On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: > Hello, > > Please review the fix to the following issue: > > https://bugs.openjdk.java.net/browse/JDK-8248695 > > The proposed changeset is located at: > > http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ > > There were two causes that resulted in throwing exceptions. One was that > the Host adapter for Windows always produced Date and Time combined > patterns, so formatting a LocalDate ended up with unsupported temporal > field for HourOfDay (reported in the bug), and the other cause was the > pattern for am/pm was "aa", which was not valid as a DateTimeFormatter > pattern. > > Besides these issues, localized DayOfWeek/AM_PM names have not been > correctly implemented in the host adapter. Now those names are correctly > returned from Windows. > > Naoto From huizhe.wang at oracle.com Mon Jul 13 22:55:42 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Mon, 13 Jul 2020 15:55:42 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> Message-ID: <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> Hi Naoto, Would it make sense to provide an additional test using the public APIs similar to the one provided in the bug report? I'm sure yours is correct and covers more cases than the original, but it would be nice to have an actual use case and use the public APIs. The report showed it was failed somewhere down the stream than when it is run against the current build, which produces IAE "Too many pattern letters: a" instead of what's reported. HostLocaleProviderAdapter_md:849 - 865: may be compacted into one if statement, (bCal && getCalendarInfoWrapper(...) || getLocaleInfoWrapper(...)). Regards, Joe On 7/13/2020 5:54 AM, naoto.sato at oracle.com wrote: > Ping. > > On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: >> Hello, >> >> Please review the fix to the following issue: >> >> https://bugs.openjdk.java.net/browse/JDK-8248695 >> >> The proposed changeset is located at: >> >> http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ >> >> There were two causes that resulted in throwing exceptions. One was >> that the Host adapter for Windows always produced Date and Time >> combined patterns, so formatting a LocalDate ended up with >> unsupported temporal field for HourOfDay (reported in the bug), and >> the other cause was the pattern for am/pm was "aa", which was not >> valid as a DateTimeFormatter pattern. >> >> Besides these issues, localized DayOfWeek/AM_PM names have not been >> correctly implemented in the host adapter. Now those names are >> correctly returned from Windows. >> >> Naoto From naoto.sato at oracle.com Tue Jul 14 02:01:06 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 13 Jul 2020 19:01:06 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> Message-ID: Hi Joe, Thank you for your review. On 7/13/20 3:55 PM, Joe Wang wrote: > Hi Naoto, > > Would it make sense to provide an additional test using the public APIs > similar to the one provided in the bug report? I'm sure yours is correct > and covers more cases than the original, but it would be nice to have an > actual use case and use the public APIs. The report showed it was failed > somewhere down the stream than when it is run against the current build, > which produces IAE "Too many pattern letters: a" instead of what's > reported. I am not quite sure I got your suggestion, but the test case uses the same API as in the bug report (LocaleProviders.java, line 421-428) that are supposed to catch two cases, i.e., ofLocalizedDate()/ofLocalizedTime() tests catch "unsupported temporal field" exception, such as HourOfDay, and ofLocalizedDateTime() catches "Too many pattern letters: aa" which you saw in (possibly) en_US locale. To make it clearer, I added some comments to it. > > HostLocaleProviderAdapter_md:849 - 865: may be compacted into one if > statement, (bCal && getCalendarInfoWrapper(...) || > getLocaleInfoWrapper(...)). Quite right. Modified. The updated webrev is located at: https://cr.openjdk.java.net/~naoto/8248695/webrev.01/ Naoto > > Regards, > Joe > > On 7/13/2020 5:54 AM, naoto.sato at oracle.com wrote: >> Ping. >> >> On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: >>> Hello, >>> >>> Please review the fix to the following issue: >>> >>> https://bugs.openjdk.java.net/browse/JDK-8248695 >>> >>> The proposed changeset is located at: >>> >>> http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ >>> >>> There were two causes that resulted in throwing exceptions. One was >>> that the Host adapter for Windows always produced Date and Time >>> combined patterns, so formatting a LocalDate ended up with >>> unsupported temporal field for HourOfDay (reported in the bug), and >>> the other cause was the pattern for am/pm was "aa", which was not >>> valid as a DateTimeFormatter pattern. >>> >>> Besides these issues, localized DayOfWeek/AM_PM names have not been >>> correctly implemented in the host adapter. Now those names are >>> correctly returned from Windows. >>> >>> Naoto > From huizhe.wang at oracle.com Tue Jul 14 02:28:11 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Mon, 13 Jul 2020 19:28:11 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> Message-ID: <2fabb235-aff1-eb5f-c919-2637a70ee5dd@oracle.com> On 7/13/2020 7:01 PM, naoto.sato at oracle.com wrote: > Hi Joe, > > Thank you for your review. > > On 7/13/20 3:55 PM, Joe Wang wrote: >> Hi Naoto, >> >> Would it make sense to provide an additional test using the public >> APIs similar to the one provided in the bug report? I'm sure yours is >> correct and covers more cases than the original, but it would be nice >> to have an actual use case and use the public APIs. The report showed >> it was failed somewhere down the stream than when it is run against >> the current build, which produces IAE "Too many pattern letters: a" >> instead of what's reported. > > I am not quite sure I got your suggestion, but the test case uses the > same API as in the bug report (LocaleProviders.java, line 421-428) > that are supposed to catch two cases, i.e., > ofLocalizedDate()/ofLocalizedTime() tests catch "unsupported temporal > field" exception, such as HourOfDay, and ofLocalizedDateTime() catches > "Too many pattern letters: aa" which you saw in (possibly) en_US > locale. To make it clearer, I added some comments to it. Yes, the test covered the cases. What I was suggesting was an additional test that uses only public APIs similar to that in the bug report, that is what users would do and how they may encounter this issue, e.g. ??? System.setProperty("java.locale.providers", "HOST"); ??? DateTimeFormatter formatter = DateTimeFormatter.ofLocalizedDate(FormatStyle.FULL); It's unlikely for an user application to refer to a private package/class like sun.util.locale.provider.LocaleProviderAdapter. I understand it's been that way for these test. But to me, a real world use case is always nice to have. Your call whether you want to add it or not. There's no missing coverage. > >> >> HostLocaleProviderAdapter_md:849 - 865: may be compacted into one if >> statement, (bCal && getCalendarInfoWrapper(...) || >> getLocaleInfoWrapper(...)). > > Quite right. Modified. > > The updated webrev is located at: > > https://cr.openjdk.java.net/~naoto/8248695/webrev.01/ Looks good to me. Best, Joe > > Naoto > >> >> Regards, >> Joe >> >> On 7/13/2020 5:54 AM, naoto.sato at oracle.com wrote: >>> Ping. >>> >>> On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: >>>> Hello, >>>> >>>> Please review the fix to the following issue: >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8248695 >>>> >>>> The proposed changeset is located at: >>>> >>>> http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ >>>> >>>> There were two causes that resulted in throwing exceptions. One was >>>> that the Host adapter for Windows always produced Date and Time >>>> combined patterns, so formatting a LocalDate ended up with >>>> unsupported temporal field for HourOfDay (reported in the bug), and >>>> the other cause was the pattern for am/pm was "aa", which was not >>>> valid as a DateTimeFormatter pattern. >>>> >>>> Besides these issues, localized DayOfWeek/AM_PM names have not been >>>> correctly implemented in the host adapter. Now those names are >>>> correctly returned from Windows. >>>> >>>> Naoto >> From naoto.sato at oracle.com Tue Jul 14 04:04:15 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 13 Jul 2020 21:04:15 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: <2fabb235-aff1-eb5f-c919-2637a70ee5dd@oracle.com> References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> <2fabb235-aff1-eb5f-c919-2637a70ee5dd@oracle.com> Message-ID: Hi Joe, On 7/13/20 7:28 PM, Joe Wang wrote: > > > On 7/13/2020 7:01 PM, naoto.sato at oracle.com wrote: >> Hi Joe, >> >> Thank you for your review. >> >> On 7/13/20 3:55 PM, Joe Wang wrote: >>> Hi Naoto, >>> >>> Would it make sense to provide an additional test using the public >>> APIs similar to the one provided in the bug report? I'm sure yours is >>> correct and covers more cases than the original, but it would be nice >>> to have an actual use case and use the public APIs. The report showed >>> it was failed somewhere down the stream than when it is run against >>> the current build, which produces IAE "Too many pattern letters: a" >>> instead of what's reported. >> >> I am not quite sure I got your suggestion, but the test case uses the >> same API as in the bug report (LocaleProviders.java, line 421-428) >> that are supposed to catch two cases, i.e., >> ofLocalizedDate()/ofLocalizedTime() tests catch "unsupported temporal >> field" exception, such as HourOfDay, and ofLocalizedDateTime() catches >> "Too many pattern letters: aa" which you saw in (possibly) en_US >> locale. To make it clearer, I added some comments to it. > > Yes, the test covered the cases. What I was suggesting was an additional > test that uses only public APIs similar to that in the bug report, that > is what users would do and how they may encounter this issue, e.g. > ??? System.setProperty("java.locale.providers", "HOST"); > ??? DateTimeFormatter formatter = > DateTimeFormatter.ofLocalizedDate(FormatStyle.FULL); > > It's unlikely for an user application to refer to a private > package/class like sun.util.locale.provider.LocaleProviderAdapter. > > I understand it's been that way for these test. But to me, a real world > use case is always nice to have. Your call whether you want to add it or > not. There's no missing coverage. Now I got what you are talking about. The way the bug report is doing is in fact discouraged, as once it is read at the startup, it won't be read again with later setProperty() call. Apps are supposed to specify the property as -D launcher parameter. (And the test case is doing it, at line 184 of LocaleProvidersRun.java) Internal interface is used only to tell whether HOST provider is actually used or not, so apps should not depend on it. > >> >>> >>> HostLocaleProviderAdapter_md:849 - 865: may be compacted into one if >>> statement, (bCal && getCalendarInfoWrapper(...) || >>> getLocaleInfoWrapper(...)). >> >> Quite right. Modified. >> >> The updated webrev is located at: >> >> https://cr.openjdk.java.net/~naoto/8248695/webrev.01/ > > Looks good to me. Thanks! Naoto > > Best, > Joe > >> >> Naoto >> >>> >>> Regards, >>> Joe >>> >>> On 7/13/2020 5:54 AM, naoto.sato at oracle.com wrote: >>>> Ping. >>>> >>>> On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: >>>>> Hello, >>>>> >>>>> Please review the fix to the following issue: >>>>> >>>>> https://bugs.openjdk.java.net/browse/JDK-8248695 >>>>> >>>>> The proposed changeset is located at: >>>>> >>>>> http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ >>>>> >>>>> There were two causes that resulted in throwing exceptions. One was >>>>> that the Host adapter for Windows always produced Date and Time >>>>> combined patterns, so formatting a LocalDate ended up with >>>>> unsupported temporal field for HourOfDay (reported in the bug), and >>>>> the other cause was the pattern for am/pm was "aa", which was not >>>>> valid as a DateTimeFormatter pattern. >>>>> >>>>> Besides these issues, localized DayOfWeek/AM_PM names have not been >>>>> correctly implemented in the host adapter. Now those names are >>>>> correctly returned from Windows. >>>>> >>>>> Naoto >>> > From huizhe.wang at oracle.com Tue Jul 14 05:20:31 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Mon, 13 Jul 2020 22:20:31 -0700 Subject: RFR: 8248695: HostLocaleProviderAdapterImpl provides invalid date-only In-Reply-To: References: <04830fb4-9a7b-aded-099a-d635f5754857@oracle.com> <1bb42b4b-d2e5-36a7-45c7-37aedbb0e751@oracle.com> <2fabb235-aff1-eb5f-c919-2637a70ee5dd@oracle.com> Message-ID: On 7/13/2020 9:04 PM, naoto.sato at oracle.com wrote: > Hi Joe, > > On 7/13/20 7:28 PM, Joe Wang wrote: >> >> >> On 7/13/2020 7:01 PM, naoto.sato at oracle.com wrote: >>> Hi Joe, >>> >>> Thank you for your review. >>> >>> On 7/13/20 3:55 PM, Joe Wang wrote: >>>> Hi Naoto, >>>> >>>> Would it make sense to provide an additional test using the public >>>> APIs similar to the one provided in the bug report? I'm sure yours >>>> is correct and covers more cases than the original, but it would be >>>> nice to have an actual use case and use the public APIs. The report >>>> showed it was failed somewhere down the stream than when it is run >>>> against the current build, which produces IAE "Too many pattern >>>> letters: a" instead of what's reported. >>> >>> I am not quite sure I got your suggestion, but the test case uses >>> the same API as in the bug report (LocaleProviders.java, line >>> 421-428) that are supposed to catch two cases, i.e., >>> ofLocalizedDate()/ofLocalizedTime() tests catch "unsupported >>> temporal field" exception, such as HourOfDay, and >>> ofLocalizedDateTime() catches "Too many pattern letters: aa" which >>> you saw in (possibly) en_US locale. To make it clearer, I added some >>> comments to it. >> >> Yes, the test covered the cases. What I was suggesting was an >> additional test that uses only public APIs similar to that in the bug >> report, that is what users would do and how they may encounter this >> issue, e.g. >> ???? System.setProperty("java.locale.providers", "HOST"); >> ???? DateTimeFormatter formatter = >> DateTimeFormatter.ofLocalizedDate(FormatStyle.FULL); >> >> It's unlikely for an user application to refer to a private >> package/class like sun.util.locale.provider.LocaleProviderAdapter. >> >> I understand it's been that way for these test. But to me, a real >> world use case is always nice to have. Your call whether you want to >> add it or not. There's no missing coverage. > > Now I got what you are talking about. The way the bug report is doing > is in fact discouraged, as once it is read at the startup, it won't be > read again with later setProperty() call. Apps are supposed to specify > the property as -D launcher parameter. (And the test case is doing it, > at line 184 of LocaleProvidersRun.java) Internal interface is used > only to tell whether HOST provider is actually used or not, so apps > should not depend on it. I see. Thanks for the clarification. Also, nice comments to the test, helpful for understanding the details if ever it's read again. Best, Joe > >> >>> >>>> >>>> HostLocaleProviderAdapter_md:849 - 865: may be compacted into one >>>> if statement, (bCal && getCalendarInfoWrapper(...) || >>>> getLocaleInfoWrapper(...)). >>> >>> Quite right. Modified. >>> >>> The updated webrev is located at: >>> >>> https://cr.openjdk.java.net/~naoto/8248695/webrev.01/ >> >> Looks good to me. > > Thanks! > > Naoto > >> >> Best, >> Joe >> >>> >>> Naoto >>> >>>> >>>> Regards, >>>> Joe >>>> >>>> On 7/13/2020 5:54 AM, naoto.sato at oracle.com wrote: >>>>> Ping. >>>>> >>>>> On 7/7/20 3:55 PM, naoto.sato at oracle.com wrote: >>>>>> Hello, >>>>>> >>>>>> Please review the fix to the following issue: >>>>>> >>>>>> https://bugs.openjdk.java.net/browse/JDK-8248695 >>>>>> >>>>>> The proposed changeset is located at: >>>>>> >>>>>> http://cr.openjdk.java.net/~naoto/8248695/webrev.00/ >>>>>> >>>>>> There were two causes that resulted in throwing exceptions. One >>>>>> was that the Host adapter for Windows always produced Date and >>>>>> Time combined patterns, so formatting a LocalDate ended up with >>>>>> unsupported temporal field for HourOfDay (reported in the bug), >>>>>> and the other cause was the pattern for am/pm was "aa", which was >>>>>> not valid as a DateTimeFormatter pattern. >>>>>> >>>>>> Besides these issues, localized DayOfWeek/AM_PM names have not >>>>>> been correctly implemented in the host adapter. Now those names >>>>>> are correctly returned from Windows. >>>>>> >>>>>> Naoto >>>> >> From naoto.sato at oracle.com Wed Jul 15 16:00:45 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Wed, 15 Jul 2020 09:00:45 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations Message-ID: Hello, Please review the fix to the following issues: https://bugs.openjdk.java.net/browse/JDK-8248655 https://bugs.openjdk.java.net/browse/JDK-8248434 The proposed changeset and its CSR are located at: https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8248664 A bug was filed against SimpleDateFormat (8248434) where case-insensitive date format/parse failed in some of the new locales in JDK15. The root cause was that case-insensitive String.regionMatches() method did not work with supplementary characters. The problem is that the method's spec does not expect case mappings of supplementary characters, possibly because it was overlooked in the first place, JSR 204 - "Unicode Supplementary Character support". Similar behavior is observed in other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase(). The fix is straightforward to compare strings by code point basis, instead of code unit (16bit "char") basis. Technically this change will introduce a backward incompatibility, but I believe it is an incompatibility to wrong behavior, not true to the meaning of those methods' expectations. Naoto From naoto.sato at oracle.com Wed Jul 15 16:39:26 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Wed, 15 Jul 2020 09:39:26 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References:

Message-ID: <2cfcaf44-322b-eb9c-597f-2a5745230e5b@oracle.com> Thank you, Jim, for the quick review! On 7/15/20 9:26 AM, Jim Laskey wrote: > I think I'm good with this. +1 > > Asides: > > 325 int cp1 = (int)getChar(value, k1); > 326 int cp2 = (int)getChar(other, k2); > > I would be tempted to short cut by exiting when not equal, but I think we agreed we need to allow for upper/lowers on different planes. > > In the UTF-16 code I was trying to think of how your could exhaust the first string and not the second, and still have their lengths the same. I think I have convinced myself that it's not possible as long as surrogates always map upper/lowers to surrogates (two chars each.) Right. All code points as of JDK15/6 is in the same plane, thus the lengths won't change. I was trying to create a test case for that hypothetical situation, but gave up because each character case map is embedded in Unicode Character Database, which cannot be modified. Naoto > > Cheers, > > -- Jim > > > > > >> On Jul 15, 2020, at 1:00 PM, naoto.sato at oracle.com wrote: >> >> Hello, >> >> Please review the fix to the following issues: >> >> https://bugs.openjdk.java.net/browse/JDK-8248655 >> https://bugs.openjdk.java.net/browse/JDK-8248434 >> >> The proposed changeset and its CSR are located at: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8248664 >> >> A bug was filed against SimpleDateFormat (8248434) where case-insensitive date format/parse failed in some of the new locales in JDK15. The root cause was that case-insensitive String.regionMatches() method did not work with supplementary characters. The problem is that the method's spec does not expect case mappings of supplementary characters, possibly because it was overlooked in the first place, JSR 204 - "Unicode Supplementary Character support". Similar behavior is observed in other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase(). >> >> The fix is straightforward to compare strings by code point basis, instead of code unit (16bit "char") basis. Technically this change will introduce a backward incompatibility, but I believe it is an incompatibility to wrong behavior, not true to the meaning of those methods' expectations. >> >> Naoto > From huizhe.wang at oracle.com Wed Jul 15 17:57:17 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Wed, 15 Jul 2020 10:57:17 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: Message-ID: <4899dc65-c851-f6f6-94ee-bd4743e71997@oracle.com> Hi Naoto, In StringUTF16.java, if one is isHighSurrogate and the other not, you may quickly return without going through the rest of the process, probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it could skip a couple of toCodePoint/toUpperCase/toLowerCase calls. -Joe On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: > Hello, > > Please review the fix to the following issues: > > https://bugs.openjdk.java.net/browse/JDK-8248655 > https://bugs.openjdk.java.net/browse/JDK-8248434 > > The proposed changeset and its CSR are located at: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8248664 > > A bug was filed against SimpleDateFormat (8248434) where > case-insensitive date format/parse failed in some of the new locales > in JDK15. The root cause was that case-insensitive > String.regionMatches() method did not work with supplementary > characters. The problem is that the method's spec does not expect case > mappings of supplementary characters, possibly because it was > overlooked in the first place, JSR 204 - "Unicode Supplementary > Character support". Similar behavior is observed in other two > case-insensitive methods, i.e., compareToIgnoreCase() and > equalsIgnoreCase(). > > The fix is straightforward to compare strings by code point basis, > instead of code unit (16bit "char") basis. Technically this change > will introduce a backward incompatibility, but I believe it is an > incompatibility to wrong behavior, not true to the meaning of those > methods' expectations. > > Naoto From naoto.sato at oracle.com Wed Jul 15 18:32:40 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Wed, 15 Jul 2020 11:32:40 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <4899dc65-c851-f6f6-94ee-bd4743e71997@oracle.com> References: <4899dc65-c851-f6f6-94ee-bd4743e71997@oracle.com> Message-ID: <9730d3e2-93bf-bf12-748f-9e2851743d80@oracle.com> Hi Joe, Thank you for your review. On 7/15/20 10:57 AM, Joe Wang wrote: > Hi Naoto, > > In StringUTF16.java, if one is isHighSurrogate and the other not, you > may quickly return without going through the rest of the process, > probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal > anyways. But it could skip a couple of > toCodePoint/toUpperCase/toLowerCase calls. Yes, that is correct as of now, which is based on the assumption that case mappings do not cross BMP and supplementary planes boundary. I could not find any description where that's given or not. So I just took it to be safe. Naoto > > -Joe > > On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: >> Hello, >> >> Please review the fix to the following issues: >> >> https://bugs.openjdk.java.net/browse/JDK-8248655 >> https://bugs.openjdk.java.net/browse/JDK-8248434 >> >> The proposed changeset and its CSR are located at: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8248664 >> >> A bug was filed against SimpleDateFormat (8248434) where >> case-insensitive date format/parse failed in some of the new locales >> in JDK15. The root cause was that case-insensitive >> String.regionMatches() method did not work with supplementary >> characters. The problem is that the method's spec does not expect case >> mappings of supplementary characters, possibly because it was >> overlooked in the first place, JSR 204 - "Unicode Supplementary >> Character support". Similar behavior is observed in other two >> case-insensitive methods, i.e., compareToIgnoreCase() and >> equalsIgnoreCase(). >> >> The fix is straightforward to compare strings by code point basis, >> instead of code unit (16bit "char") basis. Technically this change >> will introduce a backward incompatibility, but I believe it is an >> incompatibility to wrong behavior, not true to the meaning of those >> methods' expectations. >> >> Naoto > From huizhe.wang at oracle.com Wed Jul 15 19:32:36 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Wed, 15 Jul 2020 12:32:36 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: <4899dc65-c851-f6f6-94ee-bd4743e71997@oracle.com> <9730d3e2-93bf-bf12-748f-9e2851743d80@oracle.com> Message-ID: Jim: I was referring to the rest of the process (the calls to toCodePoint/toUpperCase/toLowerCase), not isHighSurrogate itself. But Roger has a more comprehensive review on performance, and Naoto is planning on preparing a JMH test. -Joe On 7/15/2020 11:46 AM, Jim Laskey wrote: > Joe: This is a defensive approach that I believe has minimal cost. > > public static boolean isHighSurrogate(char ch) { > // Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE > return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1); > } > > >> On Jul 15, 2020, at 3:32 PM, naoto.sato at oracle.com wrote: >> >> Hi Joe, >> >> Thank you for your review. >> >> On 7/15/20 10:57 AM, Joe Wang wrote: >>> Hi Naoto, >>> In StringUTF16.java, if one is isHighSurrogate and the other not, you may quickly return without going through the rest of the process, probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it could skip a couple of toCodePoint/toUpperCase/toLowerCase calls. >> Yes, that is correct as of now, which is based on the assumption that case mappings do not cross BMP and supplementary planes boundary. I could not find any description where that's given or not. So I just took it to be safe. >> >> Naoto >> >>> -Joe >>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: >>>> Hello, >>>> >>>> Please review the fix to the following issues: >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8248655 >>>> https://bugs.openjdk.java.net/browse/JDK-8248434 >>>> >>>> The proposed changeset and its CSR are located at: >>>> >>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>> https://bugs.openjdk.java.net/browse/JDK-8248664 >>>> >>>> A bug was filed against SimpleDateFormat (8248434) where case-insensitive date format/parse failed in some of the new locales in JDK15. The root cause was that case-insensitive String.regionMatches() method did not work with supplementary characters. The problem is that the method's spec does not expect case mappings of supplementary characters, possibly because it was overlooked in the first place, JSR 204 - "Unicode Supplementary Character support". Similar behavior is observed in other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase(). >>>> >>>> The fix is straightforward to compare strings by code point basis, instead of code unit (16bit "char") basis. Technically this change will introduce a backward incompatibility, but I believe it is an incompatibility to wrong behavior, not true to the meaning of those methods' expectations. >>>> >>>> Naoto From naoto.sato at oracle.com Fri Jul 17 23:36:00 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Fri, 17 Jul 2020 16:36:00 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: Message-ID: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> Hi, Based on the suggestions, I modified the fix as follows: https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ Changes from the initial revision are: - Shared the implementation between compareToCI() and regionMatchesCI() - Enabled immediate short cut if two code points match. - Created a simple JMH benchmark. Here is the scores before and after the change: before: Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811 ns/op StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344 ns/op StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499 ns/op after: Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603 ns/op StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965 ns/op StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272 ns/op Here, "sup" means all supplementary characters, BMP otherwise. "lower" means each character requires upper->lower casemap. "upperLower" means all characters are the same, except the last character which requires casemap. I think the result is reasonable, considering surrogates check are now mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but it did not seem to benefit here. In fact, the performance degraded partly because I implemented the short cut, and possibly for the overhead of extra checks. Naoto On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: > Hello, > > Please review the fix to the following issues: > > https://bugs.openjdk.java.net/browse/JDK-8248655 > https://bugs.openjdk.java.net/browse/JDK-8248434 > > The proposed changeset and its CSR are located at: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8248664 > > A bug was filed against SimpleDateFormat (8248434) where > case-insensitive date format/parse failed in some of the new locales in > JDK15. The root cause was that case-insensitive String.regionMatches() > method did not work with supplementary characters. The problem is that > the method's spec does not expect case mappings of supplementary > characters, possibly because it was overlooked in the first place, JSR > 204 - "Unicode Supplementary Character support". Similar behavior is > observed in other two case-insensitive methods, i.e., > compareToIgnoreCase() and equalsIgnoreCase(). > > The fix is straightforward to compare strings by code point basis, > instead of code unit (16bit "char") basis. Technically this change will > introduce a backward incompatibility, but I believe it is an > incompatibility to wrong behavior, not true to the meaning of those > methods' expectations. > > Naoto From mark at macchiato.com Sat Jul 18 03:03:01 2020 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 17 Jul 2020 20:03:01 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> Message-ID: One option is to have a fast path that uses char functions, up to the point where you hit a high surrogate, then drop into the slower codepoint path. That saves time for the high-runner cases. On the other hand, if the times are good enough, you might not need the complication. Mark On Fri, Jul 17, 2020 at 4:39 PM wrote: > Hi, > > Based on the suggestions, I modified the fix as follows: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ > > Changes from the initial revision are: > > - Shared the implementation between compareToCI() and regionMatchesCI() > - Enabled immediate short cut if two code points match. > - Created a simple JMH benchmark. Here is the scores before and after > the change: > > before: > Benchmark Mode Cnt Score Error Units > StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811 ns/op > StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135 ns/op > StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344 ns/op > StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499 ns/op > > after: > Benchmark Mode Cnt Score Error Units > StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603 ns/op > StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672 ns/op > StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965 ns/op > StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272 ns/op > > Here, "sup" means all supplementary characters, BMP otherwise. "lower" > means each character requires upper->lower casemap. "upperLower" means > all characters are the same, except the last character which requires > casemap. > > I think the result is reasonable, considering surrogates check are now > mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but > it did not seem to benefit here. In fact, the performance degraded > partly because I implemented the short cut, and possibly for the > overhead of extra checks. > > Naoto > > On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: > > Hello, > > > > Please review the fix to the following issues: > > > > https://bugs.openjdk.java.net/browse/JDK-8248655 > > https://bugs.openjdk.java.net/browse/JDK-8248434 > > > > The proposed changeset and its CSR are located at: > > > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ > > https://bugs.openjdk.java.net/browse/JDK-8248664 > > > > A bug was filed against SimpleDateFormat (8248434) where > > case-insensitive date format/parse failed in some of the new locales in > > JDK15. The root cause was that case-insensitive String.regionMatches() > > method did not work with supplementary characters. The problem is that > > the method's spec does not expect case mappings of supplementary > > characters, possibly because it was overlooked in the first place, JSR > > 204 - "Unicode Supplementary Character support". Similar behavior is > > observed in other two case-insensitive methods, i.e., > > compareToIgnoreCase() and equalsIgnoreCase(). > > > > The fix is straightforward to compare strings by code point basis, > > instead of code unit (16bit "char") basis. Technically this change will > > introduce a backward incompatibility, but I believe it is an > > incompatibility to wrong behavior, not true to the meaning of those > > methods' expectations. > > > > Naoto > From naoto.sato at oracle.com Sun Jul 19 18:05:11 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Sun, 19 Jul 2020 11:05:11 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> Message-ID: <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> Hi Mark, Thank you for your comments. On 7/17/20 8:03 PM, Mark Davis ? wrote: > One option is to have a fast path that uses char functions, up to the > point where you hit a high surrogate, then drop into the slower > codepoint path. That saves time for the high-runner cases. > > On the other hand, if the times are good enough, you might not need the > complication. The implementation is dealing with bare byte arrays. Only methods that it uses from Character class are toLowerCase(int) and toUpperCase(int) (sans surrogate check, which is needed at each iteration anyways), and their "char" equivalents are merely casting (char) to the int result. So it might not be so beneficial to differentiate char and int paths. Having said that, I found that there was an unnecessary surrogate check (always checks high *and* low surrogate on each iteration), so I revised the fix (added line 380-383 in StringUTF16.java): http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ Naoto > > Mark > ////// > > > On Fri, Jul 17, 2020 at 4:39 PM > wrote: > > Hi, > > Based on the suggestions, I modified the fix as follows: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ > > Changes from the initial revision are: > > - Shared the implementation between compareToCI() and regionMatchesCI() > - Enabled immediate short cut if two code points match. > - Created a simple JMH benchmark. Here is the scores before and after > the change: > > before: > Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score? ?Error > Units > StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? 2.811 > ns/op > StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? 1.135 > ns/op > StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? 1.344 > ns/op > StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? 1.499 > ns/op > > after: > Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score? ?Error > Units > StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? 4.603 > ns/op > StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? 5.672 > ns/op > StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? 0.965 > ns/op > StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? 0.272 > ns/op > > Here, "sup" means all supplementary characters, BMP otherwise. "lower" > means each character requires upper->lower casemap. "upperLower" means > all characters are the same, except the last character which requires > casemap. > > I think the result is reasonable, considering surrogates check are now > mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but > it did not seem to benefit here. In fact, the performance degraded > partly because I implemented the short cut, and possibly for the > overhead of extra checks. > > Naoto > > On 7/15/20 9:00 AM, naoto.sato at oracle.com > wrote: > > Hello, > > > > Please review the fix to the following issues: > > > > https://bugs.openjdk.java.net/browse/JDK-8248655 > > https://bugs.openjdk.java.net/browse/JDK-8248434 > > > > The proposed changeset and its CSR are located at: > > > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ > > https://bugs.openjdk.java.net/browse/JDK-8248664 > > > > A bug was filed against SimpleDateFormat (8248434) where > > case-insensitive date format/parse failed in some of the new > locales in > > JDK15. The root cause was that case-insensitive > String.regionMatches() > > method did not work with supplementary characters. The problem is > that > > the method's spec does not expect case mappings of supplementary > > characters, possibly because it was overlooked in the first > place, JSR > > 204 - "Unicode Supplementary Character support". Similar behavior is > > observed in other two case-insensitive methods, i.e., > > compareToIgnoreCase() and equalsIgnoreCase(). > > > > The fix is straightforward to compare strings by code point basis, > > instead of code unit (16bit "char") basis. Technically this > change will > > introduce a backward incompatibility, but I believe it is an > > incompatibility to wrong behavior, not true to the meaning of those > > methods' expectations. > > > > Naoto > From huizhe.wang at oracle.com Mon Jul 20 18:20:36 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Mon, 20 Jul 2020 11:20:36 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> Message-ID: Hi Naoto, StringUTF16: line 384 - 388 seem unnecessary since you'd only get there if 389:isHighSurrogate is not true. But more importantly, StringUTF16 has existing method "codePointAt" you may want to consider instead of adding a new method. Comparing to the base benchmark: StringCompareToIgnoreCase.lower????????? 8.5% StringCompareToIgnoreCase.supLower????? 139% StringCompareToIgnoreCase.supUpperLower? -21.8% StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% "lower" was 8.5% slower, if such test exists in the specJVM, it would be considered a regression. I would suggest you run the specJVM. I agree with you on surrogate check being a requirement, thus supLower being 139% slower is ok since it won't otherwise be correct anyways. But after introducing additional operations supUpperLower and upperLower ran faster? That may indicate irregularity in the tests. Maybe we should consider running tests with short, long and very long sample strings to see if we can reduce the noise level and also see how it fares for a longer string. I assume the machine you're running the test on was isolated. Regards, Joe On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: > Hi Mark, > > Thank you for your comments. > > On 7/17/20 8:03 PM, Mark Davis ? wrote: >> One option is to have a fast path that uses char functions, up to the >> point where you hit a high surrogate, then drop into the slower >> codepoint path. That saves time for the high-runner cases. >> >> On the other hand, if the times are good enough, you might not need >> the complication. > > The implementation is dealing with bare byte arrays. Only methods that > it uses from Character class are toLowerCase(int) and toUpperCase(int) > (sans surrogate check, which is needed at each iteration anyways), and > their "char" equivalents are merely casting (char) to the int result. > So it might not be so beneficial to differentiate char and int paths. > > Having said that, I found that there was an unnecessary surrogate > check (always checks high *and* low surrogate on each iteration), so I > revised the fix (added line 380-383 in StringUTF16.java): > > http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ > > Naoto > >> >> Mark >> ////// >> >> >> On Fri, Jul 17, 2020 at 4:39 PM > > wrote: >> >> ??? Hi, >> >> ??? Based on the suggestions, I modified the fix as follows: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >> >> ??? Changes from the initial revision are: >> >> ??? - Shared the implementation between compareToCI() and >> regionMatchesCI() >> ??? - Enabled immediate short cut if two code points match. >> ??? - Created a simple JMH benchmark. Here is the scores before and >> after >> ??? the change: >> >> ??? before: >> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score ?Error >> ??? Units >> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? >> 2.811 ??? ns/op >> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? >> 1.135 ??? ns/op >> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? >> 1.344 ??? ns/op >> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? >> 1.499 ??? ns/op >> >> ??? after: >> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score ?Error >> ??? Units >> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? >> 4.603 ??? ns/op >> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? >> 5.672 ??? ns/op >> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? >> 0.965 ??? ns/op >> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? >> 0.272 ??? ns/op >> >> ??? Here, "sup" means all supplementary characters, BMP otherwise. >> "lower" >> ??? means each character requires upper->lower casemap. "upperLower" >> means >> ??? all characters are the same, except the last character which >> requires >> ??? casemap. >> >> ??? I think the result is reasonable, considering surrogates check >> are now >> ??? mandatory. I have tried Roger's suggestion to use >> Arrays.mismatch() but >> ??? it did not seem to benefit here. In fact, the performance degraded >> ??? partly because I implemented the short cut, and possibly for the >> ??? overhead of extra checks. >> >> ??? Naoto >> >> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >> ??? wrote: >> ???? > Hello, >> ???? > >> ???? > Please review the fix to the following issues: >> ???? > >> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >> ???? > >> ???? > The proposed changeset and its CSR are located at: >> ???? > >> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >> ???? > >> ???? > A bug was filed against SimpleDateFormat (8248434) where >> ???? > case-insensitive date format/parse failed in some of the new >> ??? locales in >> ???? > JDK15. The root cause was that case-insensitive >> ??? String.regionMatches() >> ???? > method did not work with supplementary characters. The problem is >> ??? that >> ???? > the method's spec does not expect case mappings of supplementary >> ???? > characters, possibly because it was overlooked in the first >> ??? place, JSR >> ???? > 204 - "Unicode Supplementary Character support". Similar >> behavior is >> ???? > observed in other two case-insensitive methods, i.e., >> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >> ???? > >> ???? > The fix is straightforward to compare strings by code point >> basis, >> ???? > instead of code unit (16bit "char") basis. Technically this >> ??? change will >> ???? > introduce a backward incompatibility, but I believe it is an >> ???? > incompatibility to wrong behavior, not true to the meaning of >> those >> ???? > methods' expectations. >> ???? > >> ???? > Naoto >> From naoto.sato at oracle.com Mon Jul 20 21:39:09 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 20 Jul 2020 14:39:09 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> Message-ID: <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> Hi Joe, Thank you for your comments. On 7/20/20 11:20 AM, Joe Wang wrote: > Hi Naoto, > > StringUTF16: line 384 - 388 seem unnecessary since you'd only get there > if 389:isHighSurrogate is not true. Good point. But more importantly, StringUTF16 > has existing method "codePointAt" you may want to consider instead of > adding a new method. If we call codePointAt/Before, it would call an extra getChar(). Since we know one codepoint as an input, I would avoid the extra calls. > > Comparing to the base benchmark: > StringCompareToIgnoreCase.lower????????? 8.5% > StringCompareToIgnoreCase.supLower????? 139% > StringCompareToIgnoreCase.supUpperLower? -21.8% > StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% > > > "lower" was 8.5% slower, if such test exists in the specJVM, it would be > considered a regression. I would suggest you run the specJVM. I agree > with you on surrogate check being a requirement, thus supLower being > 139% slower is ok since it won't otherwise be correct anyways. Yes, it would be a regression if SPECjvm produces 8+% degradation, but the test suite is aimed at the entire application performance. But for this one, it is a micro benchmark for relatively rarely issued methods (I would think normal cases fall into Latin1 equivalents), I would consider it is acceptable. > But after > introducing additional operations supUpperLower and upperLower ran > faster? That may indicate irregularity in the tests. Maybe we should > consider running tests with short, long and very long sample strings to > see if we can reduce the noise level and also see how it fares for a > longer string. I assume the machine you're running the test on was > isolated. This result pretty much depends on the data it is testing for. As I wrote in the previous email, (sup)UpperLower tests use the data that are almost identical, but one last character is case insensitively equal. So in these cases, the new short cut works really well and not call toLower/UpperCase() at all for most of the characters. Thus the new results are faster. Again the test result is very dependent on the input data, Unless the result showed 100% slower or something (except supLower case), I think it is OK. Anyways, here is the updated webrev addressing your first suggestion: http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ Naoto > > Regards, > Joe > > On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >> Hi Mark, >> >> Thank you for your comments. >> >> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>> One option is to have a fast path that uses char functions, up to the >>> point where you hit a high surrogate, then drop into the slower >>> codepoint path. That saves time for the high-runner cases. >>> >>> On the other hand, if the times are good enough, you might not need >>> the complication. >> >> The implementation is dealing with bare byte arrays. Only methods that >> it uses from Character class are toLowerCase(int) and toUpperCase(int) >> (sans surrogate check, which is needed at each iteration anyways), and >> their "char" equivalents are merely casting (char) to the int result. >> So it might not be so beneficial to differentiate char and int paths. >> >> Having said that, I found that there was an unnecessary surrogate >> check (always checks high *and* low surrogate on each iteration), so I >> revised the fix (added line 380-383 in StringUTF16.java): >> >> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >> >> Naoto >> >>> >>> Mark >>> ////// >>> >>> >>> On Fri, Jul 17, 2020 at 4:39 PM >> > wrote: >>> >>> ??? Hi, >>> >>> ??? Based on the suggestions, I modified the fix as follows: >>> >>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>> >>> ??? Changes from the initial revision are: >>> >>> ??? - Shared the implementation between compareToCI() and >>> regionMatchesCI() >>> ??? - Enabled immediate short cut if two code points match. >>> ??? - Created a simple JMH benchmark. Here is the scores before and >>> after >>> ??? the change: >>> >>> ??? before: >>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score ?Error >>> ??? Units >>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? >>> 2.811 ??? ns/op >>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? >>> 1.135 ??? ns/op >>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? >>> 1.344 ??? ns/op >>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? >>> 1.499 ??? ns/op >>> >>> ??? after: >>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score ?Error >>> ??? Units >>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? >>> 4.603 ??? ns/op >>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? >>> 5.672 ??? ns/op >>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? >>> 0.965 ??? ns/op >>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? >>> 0.272 ??? ns/op >>> >>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>> "lower" >>> ??? means each character requires upper->lower casemap. "upperLower" >>> means >>> ??? all characters are the same, except the last character which >>> requires >>> ??? casemap. >>> >>> ??? I think the result is reasonable, considering surrogates check >>> are now >>> ??? mandatory. I have tried Roger's suggestion to use >>> Arrays.mismatch() but >>> ??? it did not seem to benefit here. In fact, the performance degraded >>> ??? partly because I implemented the short cut, and possibly for the >>> ??? overhead of extra checks. >>> >>> ??? Naoto >>> >>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>> ??? wrote: >>> ???? > Hello, >>> ???? > >>> ???? > Please review the fix to the following issues: >>> ???? > >>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>> ???? > >>> ???? > The proposed changeset and its CSR are located at: >>> ???? > >>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>> ???? > >>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>> ???? > case-insensitive date format/parse failed in some of the new >>> ??? locales in >>> ???? > JDK15. The root cause was that case-insensitive >>> ??? String.regionMatches() >>> ???? > method did not work with supplementary characters. The problem is >>> ??? that >>> ???? > the method's spec does not expect case mappings of supplementary >>> ???? > characters, possibly because it was overlooked in the first >>> ??? place, JSR >>> ???? > 204 - "Unicode Supplementary Character support". Similar >>> behavior is >>> ???? > observed in other two case-insensitive methods, i.e., >>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>> ???? > >>> ???? > The fix is straightforward to compare strings by code point >>> basis, >>> ???? > instead of code unit (16bit "char") basis. Technically this >>> ??? change will >>> ???? > introduce a backward incompatibility, but I believe it is an >>> ???? > incompatibility to wrong behavior, not true to the meaning of >>> those >>> ???? > methods' expectations. >>> ???? > >>> ???? > Naoto >>> > From naoto.sato at oracle.com Mon Jul 20 22:20:18 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 20 Jul 2020 15:20:18 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> Message-ID: <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> Small correction in the updated part: http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/ Naoto On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote: > Hi Joe, > > Thank you for your comments. > > On 7/20/20 11:20 AM, Joe Wang wrote: >> Hi Naoto, >> >> StringUTF16: line 384 - 388 seem unnecessary since you'd only get >> there if 389:isHighSurrogate is not true. > > Good point. > > But more importantly, StringUTF16 >> has existing method "codePointAt" you may want to consider instead of >> adding a new method. > > If we call codePointAt/Before, it would call an extra getChar(). Since > we know one codepoint as an input, I would avoid the extra calls. > >> >> Comparing to the base benchmark: >> StringCompareToIgnoreCase.lower????????? 8.5% >> StringCompareToIgnoreCase.supLower????? 139% >> StringCompareToIgnoreCase.supUpperLower? -21.8% >> StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% >> >> >> "lower" was 8.5% slower, if such test exists in the specJVM, it would >> be considered a regression. I would suggest you run the specJVM. I >> agree with you on surrogate check being a requirement, thus supLower >> being 139% slower is ok since it won't otherwise be correct anyways. > > Yes, it would be a regression if SPECjvm produces 8+% degradation, but > the test suite is aimed at the entire application performance. But for > this one, it is a micro benchmark for relatively rarely issued methods > (I would think normal cases fall into Latin1 equivalents), I would > consider it is acceptable. > >> But after introducing additional operations supUpperLower and >> upperLower ran faster? That may indicate irregularity in the tests. >> Maybe we should consider running tests with short, long and very long >> sample strings to see if we can reduce the noise level and also see >> how it fares for a longer string. I assume the machine you're running >> the test on was isolated. > > This result pretty much depends on the data it is testing for. As I > wrote in the previous email, (sup)UpperLower tests use the data that are > almost identical, but one last character is case insensitively equal. So > in these cases, the new short cut works really well and not call > toLower/UpperCase() at all for most of the characters. Thus the new > results are faster. Again the test result is very dependent on the input > data, Unless the result showed 100% slower or something (except supLower > case), I think it is OK. > > Anyways, here is the updated webrev addressing your first suggestion: > > http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ > > Naoto > >> >> Regards, >> Joe >> >> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >>> Hi Mark, >>> >>> Thank you for your comments. >>> >>> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>>> One option is to have a fast path that uses char functions, up to >>>> the point where you hit a high surrogate, then drop into the slower >>>> codepoint path. That saves time for the high-runner cases. >>>> >>>> On the other hand, if the times are good enough, you might not need >>>> the complication. >>> >>> The implementation is dealing with bare byte arrays. Only methods >>> that it uses from Character class are toLowerCase(int) and >>> toUpperCase(int) (sans surrogate check, which is needed at each >>> iteration anyways), and their "char" equivalents are merely casting >>> (char) to the int result. So it might not be so beneficial to >>> differentiate char and int paths. >>> >>> Having said that, I found that there was an unnecessary surrogate >>> check (always checks high *and* low surrogate on each iteration), so >>> I revised the fix (added line 380-383 in StringUTF16.java): >>> >>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >>> >>> Naoto >>> >>>> >>>> Mark >>>> ////// >>>> >>>> >>>> On Fri, Jul 17, 2020 at 4:39 PM >>> > wrote: >>>> >>>> ??? Hi, >>>> >>>> ??? Based on the suggestions, I modified the fix as follows: >>>> >>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>>> >>>> ??? Changes from the initial revision are: >>>> >>>> ??? - Shared the implementation between compareToCI() and >>>> regionMatchesCI() >>>> ??? - Enabled immediate short cut if two code points match. >>>> ??? - Created a simple JMH benchmark. Here is the scores before and >>>> after >>>> ??? the change: >>>> >>>> ??? before: >>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>> ?Error ??? Units >>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? >>>> 2.811 ??? ns/op >>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? >>>> 1.135 ??? ns/op >>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? >>>> 1.344 ??? ns/op >>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? >>>> 1.499 ??? ns/op >>>> >>>> ??? after: >>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>> ?Error ??? Units >>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? >>>> 4.603 ??? ns/op >>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? >>>> 5.672 ??? ns/op >>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? >>>> 0.965 ??? ns/op >>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? >>>> 0.272 ??? ns/op >>>> >>>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>>> "lower" >>>> ??? means each character requires upper->lower casemap. "upperLower" >>>> means >>>> ??? all characters are the same, except the last character which >>>> requires >>>> ??? casemap. >>>> >>>> ??? I think the result is reasonable, considering surrogates check >>>> are now >>>> ??? mandatory. I have tried Roger's suggestion to use >>>> Arrays.mismatch() but >>>> ??? it did not seem to benefit here. In fact, the performance degraded >>>> ??? partly because I implemented the short cut, and possibly for the >>>> ??? overhead of extra checks. >>>> >>>> ??? Naoto >>>> >>>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>>> ??? wrote: >>>> ???? > Hello, >>>> ???? > >>>> ???? > Please review the fix to the following issues: >>>> ???? > >>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>>> ???? > >>>> ???? > The proposed changeset and its CSR are located at: >>>> ???? > >>>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>>> ???? > >>>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>>> ???? > case-insensitive date format/parse failed in some of the new >>>> ??? locales in >>>> ???? > JDK15. The root cause was that case-insensitive >>>> ??? String.regionMatches() >>>> ???? > method did not work with supplementary characters. The >>>> problem is >>>> ??? that >>>> ???? > the method's spec does not expect case mappings of supplementary >>>> ???? > characters, possibly because it was overlooked in the first >>>> ??? place, JSR >>>> ???? > 204 - "Unicode Supplementary Character support". Similar >>>> behavior is >>>> ???? > observed in other two case-insensitive methods, i.e., >>>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>>> ???? > >>>> ???? > The fix is straightforward to compare strings by code point >>>> basis, >>>> ???? > instead of code unit (16bit "char") basis. Technically this >>>> ??? change will >>>> ???? > introduce a backward incompatibility, but I believe it is an >>>> ???? > incompatibility to wrong behavior, not true to the meaning of >>>> those >>>> ???? > methods' expectations. >>>> ???? > >>>> ???? > Naoto >>>> >> From huizhe.wang at oracle.com Tue Jul 21 02:14:47 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Mon, 20 Jul 2020 19:14:47 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> Message-ID: <186dfcff-5c57-bc5f-7e0f-a29d1ba65446@oracle.com> Hi Naoto, "Unless it showed 100% slower", wow, your tolerance is quite high :-). On the other hand, I do agree it's unlikely to show in specJVM (that's a speculation though). The short-cut worked well. There's maybe a further optimization we could do to rid us of the performance concern (or the need to run additional performance tests). Consider the cases where strings in comparison don't contain any sup characters, if we make the toLower/UpperCase() block a method and call it before and after the surrogate-check block, the routine would be effectively the same as prior to the introduction of the surrogate-check block, and regular comparisons would suffer the surrogate-check only once (the last check). For strings that do contain sup characters then, the toLower/UpperCase() method would have been called twice, but then we don't care about the performance in that situation. You may call the existing codePointAt method too when an extra getChar and performance is not a concern (but that's your call. Regards, Joe On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote: > Small correction in the updated part: > > http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/ > > Naoto > > On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote: >> Hi Joe, >> >> Thank you for your comments. >> >> On 7/20/20 11:20 AM, Joe Wang wrote: >>> Hi Naoto, >>> >>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get >>> there if 389:isHighSurrogate is not true. >> >> Good point. >> >> But more importantly, StringUTF16 >>> has existing method "codePointAt" you may want to consider instead >>> of adding a new method. >> >> If we call codePointAt/Before, it would call an extra getChar(). >> Since we know one codepoint as an input, I would avoid the extra calls. >> >>> >>> Comparing to the base benchmark: >>> StringCompareToIgnoreCase.lower????????? 8.5% >>> StringCompareToIgnoreCase.supLower????? 139% >>> StringCompareToIgnoreCase.supUpperLower? -21.8% >>> StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% >>> >>> >>> "lower" was 8.5% slower, if such test exists in the specJVM, it >>> would be considered a regression. I would suggest you run the >>> specJVM. I agree with you on surrogate check being a requirement, >>> thus supLower being 139% slower is ok since it won't otherwise be >>> correct anyways. >> >> Yes, it would be a regression if SPECjvm produces 8+% degradation, >> but the test suite is aimed at the entire application performance. >> But for this one, it is a micro benchmark for relatively rarely >> issued methods (I would think normal cases fall into Latin1 >> equivalents), I would consider it is acceptable. >> >>> But after introducing additional operations supUpperLower and >>> upperLower ran faster? That may indicate irregularity in the tests. >>> Maybe we should consider running tests with short, long and very >>> long sample strings to see if we can reduce the noise level and also >>> see how it fares for a longer string. I assume the machine you're >>> running the test on was isolated. >> >> This result pretty much depends on the data it is testing for. As I >> wrote in the previous email, (sup)UpperLower tests use the data that >> are almost identical, but one last character is case insensitively >> equal. So in these cases, the new short cut works really well and not >> call toLower/UpperCase() at all for most of the characters. Thus the >> new results are faster. Again the test result is very dependent on >> the input data, Unless the result showed 100% slower or something >> (except supLower case), I think it is OK. >> >> Anyways, here is the updated webrev addressing your first suggestion: >> >> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ >> >> Naoto >> >>> >>> Regards, >>> Joe >>> >>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >>>> Hi Mark, >>>> >>>> Thank you for your comments. >>>> >>>> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>>>> One option is to have a fast path that uses char functions, up to >>>>> the point where you hit a high surrogate, then drop into the >>>>> slower codepoint path. That saves time for the high-runner cases. >>>>> >>>>> On the other hand, if the times are good enough, you might not >>>>> need the complication. >>>> >>>> The implementation is dealing with bare byte arrays. Only methods >>>> that it uses from Character class are toLowerCase(int) and >>>> toUpperCase(int) (sans surrogate check, which is needed at each >>>> iteration anyways), and their "char" equivalents are merely casting >>>> (char) to the int result. So it might not be so beneficial to >>>> differentiate char and int paths. >>>> >>>> Having said that, I found that there was an unnecessary surrogate >>>> check (always checks high *and* low surrogate on each iteration), >>>> so I revised the fix (added line 380-383 in StringUTF16.java): >>>> >>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >>>> >>>> Naoto >>>> >>>>> >>>>> Mark >>>>> ////// >>>>> >>>>> >>>>> On Fri, Jul 17, 2020 at 4:39 PM >>>> > wrote: >>>>> >>>>> ??? Hi, >>>>> >>>>> ??? Based on the suggestions, I modified the fix as follows: >>>>> >>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>>>> >>>>> ??? Changes from the initial revision are: >>>>> >>>>> ??? - Shared the implementation between compareToCI() and >>>>> regionMatchesCI() >>>>> ??? - Enabled immediate short cut if two code points match. >>>>> ??? - Created a simple JMH benchmark. Here is the scores before >>>>> and after >>>>> ??? the change: >>>>> >>>>> ??? before: >>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt ?Score >>>>> ?Error ??? Units >>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25 53.764 ? >>>>> 2.811 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25 24.211 ? >>>>> 1.135 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25 30.595 ? >>>>> 1.344 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25 18.859 ? >>>>> 1.499 ??? ns/op >>>>> >>>>> ??? after: >>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt ?Score >>>>> ?Error ??? Units >>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25 58.354 ? >>>>> 4.603 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25 57.975 ? >>>>> 5.672 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25 23.912 ? >>>>> 0.965 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25 17.744 ? >>>>> 0.272 ??? ns/op >>>>> >>>>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>>>> "lower" >>>>> ??? means each character requires upper->lower casemap. >>>>> "upperLower" means >>>>> ??? all characters are the same, except the last character which >>>>> requires >>>>> ??? casemap. >>>>> >>>>> ??? I think the result is reasonable, considering surrogates check >>>>> are now >>>>> ??? mandatory. I have tried Roger's suggestion to use >>>>> Arrays.mismatch() but >>>>> ??? it did not seem to benefit here. In fact, the performance >>>>> degraded >>>>> ??? partly because I implemented the short cut, and possibly for the >>>>> ??? overhead of extra checks. >>>>> >>>>> ??? Naoto >>>>> >>>>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>>>> ??? wrote: >>>>> ???? > Hello, >>>>> ???? > >>>>> ???? > Please review the fix to the following issues: >>>>> ???? > >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>>>> ???? > >>>>> ???? > The proposed changeset and its CSR are located at: >>>>> ???? > >>>>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>>>> ???? > >>>>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>>>> ???? > case-insensitive date format/parse failed in some of the new >>>>> ??? locales in >>>>> ???? > JDK15. The root cause was that case-insensitive >>>>> ??? String.regionMatches() >>>>> ???? > method did not work with supplementary characters. The >>>>> problem is >>>>> ??? that >>>>> ???? > the method's spec does not expect case mappings of >>>>> supplementary >>>>> ???? > characters, possibly because it was overlooked in the first >>>>> ??? place, JSR >>>>> ???? > 204 - "Unicode Supplementary Character support". Similar >>>>> behavior is >>>>> ???? > observed in other two case-insensitive methods, i.e., >>>>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>>>> ???? > >>>>> ???? > The fix is straightforward to compare strings by code point >>>>> basis, >>>>> ???? > instead of code unit (16bit "char") basis. Technically this >>>>> ??? change will >>>>> ???? > introduce a backward incompatibility, but I believe it is an >>>>> ???? > incompatibility to wrong behavior, not true to the meaning >>>>> of those >>>>> ???? > methods' expectations. >>>>> ???? > >>>>> ???? > Naoto >>>>> >>> From naoto.sato at oracle.com Tue Jul 21 03:58:08 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Mon, 20 Jul 2020 20:58:08 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <186dfcff-5c57-bc5f-7e0f-a29d1ba65446@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> <186dfcff-5c57-bc5f-7e0f-a29d1ba65446@oracle.com> Message-ID: <36cb163f-9902-9d34-9c5a-c31f3b905eb9@oracle.com> Hi Joe, On 7/20/20 7:14 PM, Joe Wang wrote: > Hi Naoto, > > "Unless it showed 100% slower", wow, your tolerance is quite high :-). > On the other hand, I do agree it's unlikely to show in specJVM (that's a > speculation though). I am not saying 100% slowing is permissible :-) That's an example of definite no. > > The short-cut worked well. There's maybe a further optimization we could > do to rid us of the performance concern (or the need to run additional > performance tests). Consider the cases where strings in comparison don't > contain any sup characters, if we make the toLower/UpperCase() block a > method and call it before and after the surrogate-check block, the > routine would be effectively the same as prior to the introduction of > the surrogate-check block, and regular comparisons would suffer the > surrogate-check only once (the last check). For strings that do contain > sup characters then, the toLower/UpperCase() method would have been > called twice, but then we don't care about the performance in that > situation. You may call the existing codePointAt method too when an > extra getChar and performance is not a concern (but that's your call. Can you please elaborate this more? What's "the last check" here? Naoto > > Regards, > Joe > > On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote: >> Small correction in the updated part: >> >> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/ >> >> Naoto >> >> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote: >>> Hi Joe, >>> >>> Thank you for your comments. >>> >>> On 7/20/20 11:20 AM, Joe Wang wrote: >>>> Hi Naoto, >>>> >>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get >>>> there if 389:isHighSurrogate is not true. >>> >>> Good point. >>> >>> But more importantly, StringUTF16 >>>> has existing method "codePointAt" you may want to consider instead >>>> of adding a new method. >>> >>> If we call codePointAt/Before, it would call an extra getChar(). >>> Since we know one codepoint as an input, I would avoid the extra calls. >>> >>>> >>>> Comparing to the base benchmark: >>>> StringCompareToIgnoreCase.lower????????? 8.5% >>>> StringCompareToIgnoreCase.supLower????? 139% >>>> StringCompareToIgnoreCase.supUpperLower? -21.8% >>>> StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% >>>> >>>> >>>> "lower" was 8.5% slower, if such test exists in the specJVM, it >>>> would be considered a regression. I would suggest you run the >>>> specJVM. I agree with you on surrogate check being a requirement, >>>> thus supLower being 139% slower is ok since it won't otherwise be >>>> correct anyways. >>> >>> Yes, it would be a regression if SPECjvm produces 8+% degradation, >>> but the test suite is aimed at the entire application performance. >>> But for this one, it is a micro benchmark for relatively rarely >>> issued methods (I would think normal cases fall into Latin1 >>> equivalents), I would consider it is acceptable. >>> >>>> But after introducing additional operations supUpperLower and >>>> upperLower ran faster? That may indicate irregularity in the tests. >>>> Maybe we should consider running tests with short, long and very >>>> long sample strings to see if we can reduce the noise level and also >>>> see how it fares for a longer string. I assume the machine you're >>>> running the test on was isolated. >>> >>> This result pretty much depends on the data it is testing for. As I >>> wrote in the previous email, (sup)UpperLower tests use the data that >>> are almost identical, but one last character is case insensitively >>> equal. So in these cases, the new short cut works really well and not >>> call toLower/UpperCase() at all for most of the characters. Thus the >>> new results are faster. Again the test result is very dependent on >>> the input data, Unless the result showed 100% slower or something >>> (except supLower case), I think it is OK. >>> >>> Anyways, here is the updated webrev addressing your first suggestion: >>> >>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ >>> >>> Naoto >>> >>>> >>>> Regards, >>>> Joe >>>> >>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >>>>> Hi Mark, >>>>> >>>>> Thank you for your comments. >>>>> >>>>> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>>>>> One option is to have a fast path that uses char functions, up to >>>>>> the point where you hit a high surrogate, then drop into the >>>>>> slower codepoint path. That saves time for the high-runner cases. >>>>>> >>>>>> On the other hand, if the times are good enough, you might not >>>>>> need the complication. >>>>> >>>>> The implementation is dealing with bare byte arrays. Only methods >>>>> that it uses from Character class are toLowerCase(int) and >>>>> toUpperCase(int) (sans surrogate check, which is needed at each >>>>> iteration anyways), and their "char" equivalents are merely casting >>>>> (char) to the int result. So it might not be so beneficial to >>>>> differentiate char and int paths. >>>>> >>>>> Having said that, I found that there was an unnecessary surrogate >>>>> check (always checks high *and* low surrogate on each iteration), >>>>> so I revised the fix (added line 380-383 in StringUTF16.java): >>>>> >>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >>>>> >>>>> Naoto >>>>> >>>>>> >>>>>> Mark >>>>>> ////// >>>>>> >>>>>> >>>>>> On Fri, Jul 17, 2020 at 4:39 PM >>>>> > wrote: >>>>>> >>>>>> ??? Hi, >>>>>> >>>>>> ??? Based on the suggestions, I modified the fix as follows: >>>>>> >>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>>>>> >>>>>> ??? Changes from the initial revision are: >>>>>> >>>>>> ??? - Shared the implementation between compareToCI() and >>>>>> regionMatchesCI() >>>>>> ??? - Enabled immediate short cut if two code points match. >>>>>> ??? - Created a simple JMH benchmark. Here is the scores before >>>>>> and after >>>>>> ??? the change: >>>>>> >>>>>> ??? before: >>>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt ?Score >>>>>> ?Error ??? Units >>>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25 53.764 ? >>>>>> 2.811 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25 24.211 ? >>>>>> 1.135 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25 30.595 ? >>>>>> 1.344 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25 18.859 ? >>>>>> 1.499 ??? ns/op >>>>>> >>>>>> ??? after: >>>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt ?Score >>>>>> ?Error ??? Units >>>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25 58.354 ? >>>>>> 4.603 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25 57.975 ? >>>>>> 5.672 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25 23.912 ? >>>>>> 0.965 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25 17.744 ? >>>>>> 0.272 ??? ns/op >>>>>> >>>>>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>>>>> "lower" >>>>>> ??? means each character requires upper->lower casemap. >>>>>> "upperLower" means >>>>>> ??? all characters are the same, except the last character which >>>>>> requires >>>>>> ??? casemap. >>>>>> >>>>>> ??? I think the result is reasonable, considering surrogates check >>>>>> are now >>>>>> ??? mandatory. I have tried Roger's suggestion to use >>>>>> Arrays.mismatch() but >>>>>> ??? it did not seem to benefit here. In fact, the performance >>>>>> degraded >>>>>> ??? partly because I implemented the short cut, and possibly for the >>>>>> ??? overhead of extra checks. >>>>>> >>>>>> ??? Naoto >>>>>> >>>>>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>>>>> ??? wrote: >>>>>> ???? > Hello, >>>>>> ???? > >>>>>> ???? > Please review the fix to the following issues: >>>>>> ???? > >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>>>>> ???? > >>>>>> ???? > The proposed changeset and its CSR are located at: >>>>>> ???? > >>>>>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>>>>> ???? > >>>>>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>>>>> ???? > case-insensitive date format/parse failed in some of the new >>>>>> ??? locales in >>>>>> ???? > JDK15. The root cause was that case-insensitive >>>>>> ??? String.regionMatches() >>>>>> ???? > method did not work with supplementary characters. The >>>>>> problem is >>>>>> ??? that >>>>>> ???? > the method's spec does not expect case mappings of >>>>>> supplementary >>>>>> ???? > characters, possibly because it was overlooked in the first >>>>>> ??? place, JSR >>>>>> ???? > 204 - "Unicode Supplementary Character support". Similar >>>>>> behavior is >>>>>> ???? > observed in other two case-insensitive methods, i.e., >>>>>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>>>>> ???? > >>>>>> ???? > The fix is straightforward to compare strings by code point >>>>>> basis, >>>>>> ???? > instead of code unit (16bit "char") basis. Technically this >>>>>> ??? change will >>>>>> ???? > introduce a backward incompatibility, but I believe it is an >>>>>> ???? > incompatibility to wrong behavior, not true to the meaning >>>>>> of those >>>>>> ???? > methods' expectations. >>>>>> ???? > >>>>>> ???? > Naoto >>>>>> >>>> > From huizhe.wang at oracle.com Tue Jul 21 17:05:25 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Tue, 21 Jul 2020 10:05:25 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <36cb163f-9902-9d34-9c5a-c31f3b905eb9@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> <186dfcff-5c57-bc5f-7e0f-a29d1ba65446@oracle.com> <36cb163f-9902-9d34-9c5a-c31f3b905eb9@oracle.com> Message-ID: <6d7a2b63-de1c-698a-85d8-7b8be46e57ec@oracle.com> On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote: >> The short-cut worked well. There's maybe a further optimization we >> could do to rid us of the performance concern (or the need to run >> additional performance tests). Consider the cases where strings in >> comparison don't contain any sup characters, if we make the >> toLower/UpperCase() block a method and call it before and after the >> surrogate-check block, the routine would be effectively the same as >> prior to the introduction of the surrogate-check block, and regular >> comparisons would suffer the surrogate-check only once (the last >> check). For strings that do contain sup characters then, the >> toLower/UpperCase() method would have been called twice, but then we >> don't care about the performance in that situation. You may call the >> existing codePointAt method too when an extra getChar and performance >> is not a concern (but that's your call. > > Can you please elaborate this more? What's "the last check" here? What I meant was that we could expand the 'short-cut' from case sensitive to case insensitive, that is in addition to the line 337, do that line 353 - 370 case-insensitive check as well. I guess it can be explained better with code. I added inline comment: ??????? for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; k1++, k2++) { ??????????? int cp1 = (int)getChar(value, k1); ??????????? int cp2 = (int)getChar(other, k2); // does a case-insensitive check: ??????????? if (checkEqual(cp1, cp2) == 0) { ??????????????? continue; ??????????? } // this block will be run once for strings that don't contain any supplementary characters ???????????? // Check for supplementary characters case ??????????? cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, tlast); ??????????? if ((cp1 & Integer.MIN_VALUE) != 0) { ??????????????? k1++; ??????????????? cp1 ^= Integer.MIN_VALUE; ??????????? } ??????????? cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, olast); ??????????? if ((cp2 & Integer.MIN_VALUE) != 0) { ??????????????? k2++; ??????????????? cp2 ^= Integer.MIN_VALUE; ??????????? } // thischeck will have been called twice for strings that contain supplementary characters // but only one more for strings that don't ??????????? int diff = checkEqual(cp1, cp2); ??????????? if (diff != 0) { ??????????????? return diff; ??????????? } ??????? } ??????? return tlen - olen; ??? } // the code block between line 353 - 370 in webrev.04 except the last line (return 0): ??? private static int checkEqual(int cp1, int cp2) { ??????? if (cp1 != cp2) { ??????????? // try converting both characters to uppercase. ??????????? // If the results match, then the comparison scan should ??????????? // continue. ??????????? cp1 = Character.toUpperCase(cp1); ??????????? cp2 = Character.toUpperCase(cp2); ??????????? if (cp1 != cp2) { ??????????????? // Unfortunately, conversion to uppercase does not work properly ??????????????? // for the Georgian alphabet, which has strange rules about case ??????????????? // conversion.? So we need to make one last check before ??????????????? // exiting. ??????????????? cp1 = Character.toLowerCase(cp1); ??????????????? cp2 = Character.toLowerCase(cp2); ??????????????? if (cp1 != cp2) { ??????????????????? return cp1 - cp2; ??????????????? } ??????????? } ??????? } ??????? return 0; ??? } > > Naoto From naoto.sato at oracle.com Tue Jul 21 21:01:35 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Tue, 21 Jul 2020 14:01:35 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <6d7a2b63-de1c-698a-85d8-7b8be46e57ec@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> <186dfcff-5c57-bc5f-7e0f-a29d1ba65446@oracle.com> <36cb163f-9902-9d34-9c5a-c31f3b905eb9@oracle.com> <6d7a2b63-de1c-698a-85d8-7b8be46e57ec@oracle.com> Message-ID: Thank you, Joe. I got it now. Will try out and benchmark. Naoto On 7/21/20 10:05 AM, Joe Wang wrote: > > > On 7/20/2020 8:58 PM, naoto.sato at oracle.com wrote: >>> The short-cut worked well. There's maybe a further optimization we >>> could do to rid us of the performance concern (or the need to run >>> additional performance tests). Consider the cases where strings in >>> comparison don't contain any sup characters, if we make the >>> toLower/UpperCase() block a method and call it before and after the >>> surrogate-check block, the routine would be effectively the same as >>> prior to the introduction of the surrogate-check block, and regular >>> comparisons would suffer the surrogate-check only once (the last >>> check). For strings that do contain sup characters then, the >>> toLower/UpperCase() method would have been called twice, but then we >>> don't care about the performance in that situation. You may call the >>> existing codePointAt method too when an extra getChar and performance >>> is not a concern (but that's your call. >> >> Can you please elaborate this more? What's "the last check" here? > > What I meant was that we could expand the 'short-cut' from case > sensitive to case insensitive, that is in addition to the line 337, do > that line 353 - 370 case-insensitive check as well. > > I guess it can be explained better with code. I added inline comment: > > ??????? for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; > k1++, k2++) { > ??????????? int cp1 = (int)getChar(value, k1); > ??????????? int cp2 = (int)getChar(other, k2); > > // does a case-insensitive check: > > ??????????? if (checkEqual(cp1, cp2) == 0) { > ??????????????? continue; > ??????????? } > > // this block will be run once for strings that don't contain any > supplementary characters > > ???????????? // Check for supplementary characters case > ??????????? cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, > tlast); > ??????????? if ((cp1 & Integer.MIN_VALUE) != 0) { > ??????????????? k1++; > ??????????????? cp1 ^= Integer.MIN_VALUE; > ??????????? } > ??????????? cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, > olast); > ??????????? if ((cp2 & Integer.MIN_VALUE) != 0) { > ??????????????? k2++; > ??????????????? cp2 ^= Integer.MIN_VALUE; > ??????????? } > > > // thischeck will have been called twice for strings that contain > supplementary characters > // but only one more for strings that don't > > ??????????? int diff = checkEqual(cp1, cp2); > ??????????? if (diff != 0) { > ??????????????? return diff; > ??????????? } > ??????? } > ??????? return tlen - olen; > ??? } > > // the code block between line 353 - 370 in webrev.04 except the last > line (return 0): > ??? private static int checkEqual(int cp1, int cp2) { > ??????? if (cp1 != cp2) { > ??????????? // try converting both characters to uppercase. > ??????????? // If the results match, then the comparison scan should > ??????????? // continue. > ??????????? cp1 = Character.toUpperCase(cp1); > ??????????? cp2 = Character.toUpperCase(cp2); > ??????????? if (cp1 != cp2) { > ??????????????? // Unfortunately, conversion to uppercase does not work > properly > ??????????????? // for the Georgian alphabet, which has strange rules > about case > ??????????????? // conversion.? So we need to make one last check before > ??????????????? // exiting. > ??????????????? cp1 = Character.toLowerCase(cp1); > ??????????????? cp2 = Character.toLowerCase(cp2); > ??????????????? if (cp1 != cp2) { > ??????????????????? return cp1 - cp2; > ??????????????? } > ??????????? } > ??????? } > ??????? return 0; > ??? } > > > >> >> Naoto > From brent.christian at oracle.com Tue Jul 21 21:56:01 2020 From: brent.christian at oracle.com (Brent Christian) Date: Tue, 21 Jul 2020 14:56:01 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> Message-ID: Hi, Naoto I have a few comments: src/java.base/share/classes/java/lang/StringUTF16.java 379 private static int getSupplementaryCodePoint(byte[] ba, int cp, int index, int start, int end) I think it would be worth a small addition to the comment to reflect that non-surrogate chars are returned as-is. -- I thought about the scenario of an unpaired low or high surrogate at the beginning or end of the string, respectively: 384 if (Character.isLowSurrogate((char)cp)) { 385 if (index > start) { ... 391 } else if (index + 1 < end) { // cp == high surrogate 392 char c = getChar(ba, index + 1); ... 397 return cp; It looks like the cp itself would be returned from getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's converted using Character.to[Upper|Lower]Case(int), which will also return the cp itself. I imagine that's the best we could do, so seems fine. Is there a test case for unmatched surrogates at the beginning and end of the string ? Should there be ? -- I see there are no changes to StringLatin1.regionMatchesCI_UTF16(). I presume there are no cases in which toUpperCase(toLowerCase()) of a supplementary character could yield a Latin-1 character, yes? Also, thanks for adding the benchmark! -Brent On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote: > Small correction in the updated part: > > http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/ > > Naoto > > On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote: >> Hi Joe, >> >> Thank you for your comments. >> >> On 7/20/20 11:20 AM, Joe Wang wrote: >>> Hi Naoto, >>> >>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get >>> there if 389:isHighSurrogate is not true. >> >> Good point. >> >> But more importantly, StringUTF16 >>> has existing method "codePointAt" you may want to consider instead of >>> adding a new method. >> >> If we call codePointAt/Before, it would call an extra getChar(). Since >> we know one codepoint as an input, I would avoid the extra calls. >> >>> >>> Comparing to the base benchmark: >>> StringCompareToIgnoreCase.lower????????? 8.5% >>> StringCompareToIgnoreCase.supLower????? 139% >>> StringCompareToIgnoreCase.supUpperLower? -21.8% >>> StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% >>> >>> >>> "lower" was 8.5% slower, if such test exists in the specJVM, it would >>> be considered a regression. I would suggest you run the specJVM. I >>> agree with you on surrogate check being a requirement, thus supLower >>> being 139% slower is ok since it won't otherwise be correct anyways. >> >> Yes, it would be a regression if SPECjvm produces 8+% degradation, but >> the test suite is aimed at the entire application performance. But for >> this one, it is a micro benchmark for relatively rarely issued methods >> (I would think normal cases fall into Latin1 equivalents), I would >> consider it is acceptable. >> >>> But after introducing additional operations supUpperLower and >>> upperLower ran faster? That may indicate irregularity in the tests. >>> Maybe we should consider running tests with short, long and very long >>> sample strings to see if we can reduce the noise level and also see >>> how it fares for a longer string. I assume the machine you're running >>> the test on was isolated. >> >> This result pretty much depends on the data it is testing for. As I >> wrote in the previous email, (sup)UpperLower tests use the data that >> are almost identical, but one last character is case insensitively >> equal. So in these cases, the new short cut works really well and not >> call toLower/UpperCase() at all for most of the characters. Thus the >> new results are faster. Again the test result is very dependent on the >> input data, Unless the result showed 100% slower or something (except >> supLower case), I think it is OK. >> >> Anyways, here is the updated webrev addressing your first suggestion: >> >> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ >> >> Naoto >> >>> >>> Regards, >>> Joe >>> >>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >>>> Hi Mark, >>>> >>>> Thank you for your comments. >>>> >>>> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>>>> One option is to have a fast path that uses char functions, up to >>>>> the point where you hit a high surrogate, then drop into the slower >>>>> codepoint path. That saves time for the high-runner cases. >>>>> >>>>> On the other hand, if the times are good enough, you might not need >>>>> the complication. >>>> >>>> The implementation is dealing with bare byte arrays. Only methods >>>> that it uses from Character class are toLowerCase(int) and >>>> toUpperCase(int) (sans surrogate check, which is needed at each >>>> iteration anyways), and their "char" equivalents are merely casting >>>> (char) to the int result. So it might not be so beneficial to >>>> differentiate char and int paths. >>>> >>>> Having said that, I found that there was an unnecessary surrogate >>>> check (always checks high *and* low surrogate on each iteration), so >>>> I revised the fix (added line 380-383 in StringUTF16.java): >>>> >>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >>>> >>>> Naoto >>>> >>>>> >>>>> Mark >>>>> ////// >>>>> >>>>> >>>>> On Fri, Jul 17, 2020 at 4:39 PM >>>> > wrote: >>>>> >>>>> ??? Hi, >>>>> >>>>> ??? Based on the suggestions, I modified the fix as follows: >>>>> >>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>>>> >>>>> ??? Changes from the initial revision are: >>>>> >>>>> ??? - Shared the implementation between compareToCI() and >>>>> regionMatchesCI() >>>>> ??? - Enabled immediate short cut if two code points match. >>>>> ??? - Created a simple JMH benchmark. Here is the scores before and >>>>> after >>>>> ??? the change: >>>>> >>>>> ??? before: >>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>>> ?Error ??? Units >>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? >>>>> 2.811 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? >>>>> 1.135 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? >>>>> 1.344 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? >>>>> 1.499 ??? ns/op >>>>> >>>>> ??? after: >>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>>> ?Error ??? Units >>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? >>>>> 4.603 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? >>>>> 5.672 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? >>>>> 0.965 ??? ns/op >>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? >>>>> 0.272 ??? ns/op >>>>> >>>>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>>>> "lower" >>>>> ??? means each character requires upper->lower casemap. >>>>> "upperLower" means >>>>> ??? all characters are the same, except the last character which >>>>> requires >>>>> ??? casemap. >>>>> >>>>> ??? I think the result is reasonable, considering surrogates check >>>>> are now >>>>> ??? mandatory. I have tried Roger's suggestion to use >>>>> Arrays.mismatch() but >>>>> ??? it did not seem to benefit here. In fact, the performance degraded >>>>> ??? partly because I implemented the short cut, and possibly for the >>>>> ??? overhead of extra checks. >>>>> >>>>> ??? Naoto >>>>> >>>>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>>>> ??? wrote: >>>>> ???? > Hello, >>>>> ???? > >>>>> ???? > Please review the fix to the following issues: >>>>> ???? > >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>>>> ???? > >>>>> ???? > The proposed changeset and its CSR are located at: >>>>> ???? > >>>>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>>>> ???? > >>>>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>>>> ???? > case-insensitive date format/parse failed in some of the new >>>>> ??? locales in >>>>> ???? > JDK15. The root cause was that case-insensitive >>>>> ??? String.regionMatches() >>>>> ???? > method did not work with supplementary characters. The >>>>> problem is >>>>> ??? that >>>>> ???? > the method's spec does not expect case mappings of >>>>> supplementary >>>>> ???? > characters, possibly because it was overlooked in the first >>>>> ??? place, JSR >>>>> ???? > 204 - "Unicode Supplementary Character support". Similar >>>>> behavior is >>>>> ???? > observed in other two case-insensitive methods, i.e., >>>>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>>>> ???? > >>>>> ???? > The fix is straightforward to compare strings by code point >>>>> basis, >>>>> ???? > instead of code unit (16bit "char") basis. Technically this >>>>> ??? change will >>>>> ???? > introduce a backward incompatibility, but I believe it is an >>>>> ???? > incompatibility to wrong behavior, not true to the meaning >>>>> of those >>>>> ???? > methods' expectations. >>>>> ???? > >>>>> ???? > Naoto >>>>> >>> From naoto.sato at oracle.com Tue Jul 21 22:26:43 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Tue, 21 Jul 2020 15:26:43 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <5de5993a-766c-3e95-f6f7-4cea3ad82834@oracle.com> <9f6030cb-0099-bf7d-e581-636ba1f791ca@oracle.com> <74d5bf34-8c44-e000-5161-a030c8d59593@oracle.com> Message-ID: <5c8f19e8-a58d-b6d8-64ab-772838e3d356@oracle.com> Hi Brent, On 7/21/20 2:56 PM, Brent Christian wrote: > Hi, Naoto > > I have a few comments: > > src/java.base/share/classes/java/lang/StringUTF16.java > > 379???? private static int getSupplementaryCodePoint(byte[] ba, int cp, > int index, int start, int end) > > I think it would be worth a small addition to the comment to reflect > that non-surrogate chars are returned as-is. Sure, I will add some more comments to the method. > > -- > > I thought about the scenario of an unpaired low or high surrogate at the > beginning or end of the string, respectively: > > 384???????? if (Character.isLowSurrogate((char)cp)) { > 385???????????? if (index > start) { > ??????????????? ... > 391???????? } else if (index + 1 < end) { // cp == high surrogate > 392???????????? char c = getChar(ba, index + 1); > ??????????????? ... > 397???????? return cp; > > It looks like the cp itself would be returned from > getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's > converted using Character.to[Upper|Lower]Case(int), which will also > return the cp itself.? I imagine that's the best we could do, so seems > fine. Yes, that is exactly what is intended. > > Is there a test case for unmatched surrogates at the beginning and end > of the string ? Should there be ? Interestingly, there has been a test case for supplementary characters before this change, where it intentionally begins from a low surrogate, and ends with a high surrogate, so that it would succeed in the previous *exact match* logic. Line 82 in RegionMatches.java tests: "\uD801\uDC28\uD801\uDC29\uFF41a".regionMatches(true, 1, "\uDC28\uD801", 0, 2) == true And the proposed change is compatible with this test case. > > -- > > I see there are no changes to StringLatin1.regionMatchesCI_UTF16().? I > presume there are no cases in which toUpperCase(toLowerCase()) of a > supplementary character could yield a Latin-1 character, yes? Yes, that is correct. Naoto > > Also, thanks for adding the benchmark! > > -Brent > > On 7/20/20 3:20 PM, naoto.sato at oracle.com wrote: >> Small correction in the updated part: >> >> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/ >> >> Naoto >> >> On 7/20/20 2:39 PM, naoto.sato at oracle.com wrote: >>> Hi Joe, >>> >>> Thank you for your comments. >>> >>> On 7/20/20 11:20 AM, Joe Wang wrote: >>>> Hi Naoto, >>>> >>>> StringUTF16: line 384 - 388 seem unnecessary since you'd only get >>>> there if 389:isHighSurrogate is not true. >>> >>> Good point. >>> >>> But more importantly, StringUTF16 >>>> has existing method "codePointAt" you may want to consider instead >>>> of adding a new method. >>> >>> If we call codePointAt/Before, it would call an extra getChar(). >>> Since we know one codepoint as an input, I would avoid the extra calls. >>> >>>> >>>> Comparing to the base benchmark: >>>> StringCompareToIgnoreCase.lower????????? 8.5% >>>> StringCompareToIgnoreCase.supLower????? 139% >>>> StringCompareToIgnoreCase.supUpperLower? -21.8% >>>> StringCompareToIgnoreCase.upperLower???? avgt?? -5.9% >>>> >>>> >>>> "lower" was 8.5% slower, if such test exists in the specJVM, it >>>> would be considered a regression. I would suggest you run the >>>> specJVM. I agree with you on surrogate check being a requirement, >>>> thus supLower being 139% slower is ok since it won't otherwise be >>>> correct anyways. >>> >>> Yes, it would be a regression if SPECjvm produces 8+% degradation, >>> but the test suite is aimed at the entire application performance. >>> But for this one, it is a micro benchmark for relatively rarely >>> issued methods (I would think normal cases fall into Latin1 >>> equivalents), I would consider it is acceptable. >>> >>>> But after introducing additional operations supUpperLower and >>>> upperLower ran faster? That may indicate irregularity in the tests. >>>> Maybe we should consider running tests with short, long and very >>>> long sample strings to see if we can reduce the noise level and also >>>> see how it fares for a longer string. I assume the machine you're >>>> running the test on was isolated. >>> >>> This result pretty much depends on the data it is testing for. As I >>> wrote in the previous email, (sup)UpperLower tests use the data that >>> are almost identical, but one last character is case insensitively >>> equal. So in these cases, the new short cut works really well and not >>> call toLower/UpperCase() at all for most of the characters. Thus the >>> new results are faster. Again the test result is very dependent on >>> the input data, Unless the result showed 100% slower or something >>> (except supLower case), I think it is OK. >>> >>> Anyways, here is the updated webrev addressing your first suggestion: >>> >>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/ >>> >>> Naoto >>> >>>> >>>> Regards, >>>> Joe >>>> >>>> On 7/19/2020 11:05 AM, naoto.sato at oracle.com wrote: >>>>> Hi Mark, >>>>> >>>>> Thank you for your comments. >>>>> >>>>> On 7/17/20 8:03 PM, Mark Davis ? wrote: >>>>>> One option is to have a fast path that uses char functions, up to >>>>>> the point where you hit a high surrogate, then drop into the >>>>>> slower codepoint path. That saves time for the high-runner cases. >>>>>> >>>>>> On the other hand, if the times are good enough, you might not >>>>>> need the complication. >>>>> >>>>> The implementation is dealing with bare byte arrays. Only methods >>>>> that it uses from Character class are toLowerCase(int) and >>>>> toUpperCase(int) (sans surrogate check, which is needed at each >>>>> iteration anyways), and their "char" equivalents are merely casting >>>>> (char) to the int result. So it might not be so beneficial to >>>>> differentiate char and int paths. >>>>> >>>>> Having said that, I found that there was an unnecessary surrogate >>>>> check (always checks high *and* low surrogate on each iteration), >>>>> so I revised the fix (added line 380-383 in StringUTF16.java): >>>>> >>>>> http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/ >>>>> >>>>> Naoto >>>>> >>>>>> >>>>>> Mark >>>>>> ////// >>>>>> >>>>>> >>>>>> On Fri, Jul 17, 2020 at 4:39 PM >>>>> > wrote: >>>>>> >>>>>> ??? Hi, >>>>>> >>>>>> ??? Based on the suggestions, I modified the fix as follows: >>>>>> >>>>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>>>>> >>>>>> ??? Changes from the initial revision are: >>>>>> >>>>>> ??? - Shared the implementation between compareToCI() and >>>>>> regionMatchesCI() >>>>>> ??? - Enabled immediate short cut if two code points match. >>>>>> ??? - Created a simple JMH benchmark. Here is the scores before >>>>>> and after >>>>>> ??? the change: >>>>>> >>>>>> ??? before: >>>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>>>> ?Error ??? Units >>>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 53.764 ? >>>>>> 2.811 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 24.211 ? >>>>>> 1.135 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 30.595 ? >>>>>> 1.344 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 18.859 ? >>>>>> 1.499 ??? ns/op >>>>>> >>>>>> ??? after: >>>>>> ??? Benchmark? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mode? Cnt? ?Score >>>>>> ?Error ??? Units >>>>>> ??? StringCompareToIgnoreCase.lower? ? ? ? ? avgt? ?25? 58.354 ? >>>>>> 4.603 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supLower? ? ? ?avgt? ?25? 57.975 ? >>>>>> 5.672 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.supUpperLower? avgt? ?25? 23.912 ? >>>>>> 0.965 ??? ns/op >>>>>> ??? StringCompareToIgnoreCase.upperLower? ? ?avgt? ?25? 17.744 ? >>>>>> 0.272 ??? ns/op >>>>>> >>>>>> ??? Here, "sup" means all supplementary characters, BMP otherwise. >>>>>> "lower" >>>>>> ??? means each character requires upper->lower casemap. >>>>>> "upperLower" means >>>>>> ??? all characters are the same, except the last character which >>>>>> requires >>>>>> ??? casemap. >>>>>> >>>>>> ??? I think the result is reasonable, considering surrogates check >>>>>> are now >>>>>> ??? mandatory. I have tried Roger's suggestion to use >>>>>> Arrays.mismatch() but >>>>>> ??? it did not seem to benefit here. In fact, the performance >>>>>> degraded >>>>>> ??? partly because I implemented the short cut, and possibly for the >>>>>> ??? overhead of extra checks. >>>>>> >>>>>> ??? Naoto >>>>>> >>>>>> ??? On 7/15/20 9:00 AM, naoto.sato at oracle.com >>>>>> ??? wrote: >>>>>> ???? > Hello, >>>>>> ???? > >>>>>> ???? > Please review the fix to the following issues: >>>>>> ???? > >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248655 >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248434 >>>>>> ???? > >>>>>> ???? > The proposed changeset and its CSR are located at: >>>>>> ???? > >>>>>> ???? > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>>>> ???? > https://bugs.openjdk.java.net/browse/JDK-8248664 >>>>>> ???? > >>>>>> ???? > A bug was filed against SimpleDateFormat (8248434) where >>>>>> ???? > case-insensitive date format/parse failed in some of the new >>>>>> ??? locales in >>>>>> ???? > JDK15. The root cause was that case-insensitive >>>>>> ??? String.regionMatches() >>>>>> ???? > method did not work with supplementary characters. The >>>>>> problem is >>>>>> ??? that >>>>>> ???? > the method's spec does not expect case mappings of >>>>>> supplementary >>>>>> ???? > characters, possibly because it was overlooked in the first >>>>>> ??? place, JSR >>>>>> ???? > 204 - "Unicode Supplementary Character support". Similar >>>>>> behavior is >>>>>> ???? > observed in other two case-insensitive methods, i.e., >>>>>> ???? > compareToIgnoreCase() and equalsIgnoreCase(). >>>>>> ???? > >>>>>> ???? > The fix is straightforward to compare strings by code point >>>>>> basis, >>>>>> ???? > instead of code unit (16bit "char") basis. Technically this >>>>>> ??? change will >>>>>> ???? > introduce a backward incompatibility, but I believe it is an >>>>>> ???? > incompatibility to wrong behavior, not true to the meaning >>>>>> of those >>>>>> ???? > methods' expectations. >>>>>> ???? > >>>>>> ???? > Naoto >>>>>> >>>> From naoto.sato at oracle.com Wed Jul 22 17:23:16 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Wed, 22 Jul 2020 10:23:16 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> Message-ID: <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> Hi, I revised the fix again, based on further suggestions: https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/ Changes from v.04 are (all in StringUTF16.java): - The short cut now does case insensitive comparison that makes the fix closer to the previous implementation (for BMP characters). - Changed the bit operation to negating for detecting needed index increment. - Method name is changed to better reflect what it is doing, with more descriptive comments. Here is the benchmark results: before: Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 49.960 ? 1.923 ns/op StringCompareToIgnoreCase.supLower avgt 25 21.003 ? 0.354 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 30.863 ? 4.529 ns/op StringCompareToIgnoreCase.upperLower avgt 25 15.417 ? 1.046 ns/op after: Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524 ns/op StringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259 ns/op StringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338 ns/op Now non-supplementary operations ("lower" and "upperLower") are on par with the "before" result (I am not quite sure why the "after" results are somewhat faster though). For supplementary test cases, "supLower" is very slow. The reason is two fold; one is because "before" one exits at the very first character (which I am addressing here) while "after" continues to compare to the last characters, the other reason is the test suffers from the change where supplementary cases double the case insensitivity checks (compared to the "after" result just below). Also "supUpperLower" gets slower for the same reason. These are expected results for supplementary comparisons (as we discussed). Naoto On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote: > Hi, > > Based on the suggestions, I modified the fix as follows: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ > > Changes from the initial revision are: > > - Shared the implementation between compareToCI() and regionMatchesCI() > - Enabled immediate short cut if two code points match. > - Created a simple JMH benchmark. Here is the scores before and after > the change: > > before: > Benchmark??????????????????????????????? Mode? Cnt?? Score?? Error? Units > StringCompareToIgnoreCase.lower????????? avgt?? 25? 53.764 ? 2.811? ns/op > StringCompareToIgnoreCase.supLower?????? avgt?? 25? 24.211 ? 1.135? ns/op > StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 30.595 ? 1.344? ns/op > StringCompareToIgnoreCase.upperLower???? avgt?? 25? 18.859 ? 1.499? ns/op > > after: > Benchmark??????????????????????????????? Mode? Cnt?? Score?? Error? Units > StringCompareToIgnoreCase.lower????????? avgt?? 25? 58.354 ? 4.603? ns/op > StringCompareToIgnoreCase.supLower?????? avgt?? 25? 57.975 ? 5.672? ns/op > StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 23.912 ? 0.965? ns/op > StringCompareToIgnoreCase.upperLower???? avgt?? 25? 17.744 ? 0.272? ns/op > > Here, "sup" means all supplementary characters, BMP otherwise. "lower" > means each character requires upper->lower casemap. "upperLower" means > all characters are the same, except the last character which requires > casemap. > > I think the result is reasonable, considering surrogates check are now > mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but > it did not seem to benefit here. In fact, the performance degraded > partly because I implemented the short cut, and possibly for the > overhead of extra checks. > > Naoto > > On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: >> Hello, >> >> Please review the fix to the following issues: >> >> https://bugs.openjdk.java.net/browse/JDK-8248655 >> https://bugs.openjdk.java.net/browse/JDK-8248434 >> >> The proposed changeset and its CSR are located at: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8248664 >> >> A bug was filed against SimpleDateFormat (8248434) where >> case-insensitive date format/parse failed in some of the new locales >> in JDK15. The root cause was that case-insensitive >> String.regionMatches() method did not work with supplementary >> characters. The problem is that the method's spec does not expect case >> mappings of supplementary characters, possibly because it was >> overlooked in the first place, JSR 204 - "Unicode Supplementary >> Character support". Similar behavior is observed in other two >> case-insensitive methods, i.e., compareToIgnoreCase() and >> equalsIgnoreCase(). >> >> The fix is straightforward to compare strings by code point basis, >> instead of code unit (16bit "char") basis. Technically this change >> will introduce a backward incompatibility, but I believe it is an >> incompatibility to wrong behavior, not true to the meaning of those >> methods' expectations. >> >> Naoto From huizhe.wang at oracle.com Wed Jul 22 20:20:20 2020 From: huizhe.wang at oracle.com (Joe Wang) Date: Wed, 22 Jul 2020 13:20:20 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> Message-ID: Hi Naoto, The change looks good to me. "supLower" is indeed super slow :-) The only minor comment I have is that the compareCodePointCI method performs toUpperCase unconditionally. That's not a problem for the regular case, where a check on cp1 == cp2 (line 337) is done prior to the method call. But for the sup case (starting at line 341), the method is called unconditionally while in webrev.04 there was a check "cp1 != cp2".? One option to fix it is to include the "cp1 != cp2" check in the method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted. Regards, Joe On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote: > Hi, > > I revised the fix again, based on further suggestions: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/ > > Changes from v.04 are (all in StringUTF16.java): > > - The short cut now does case insensitive comparison that makes the > fix closer to the previous implementation (for BMP characters). > - Changed the bit operation to negating for detecting needed index > increment. > - Method name is changed to better reflect what it is doing, with more > descriptive comments. > > Here is the benchmark results: > > before: > Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units > StringCompareToIgnoreCase.lower????????? avgt?? 25? 49.960 ? 1.923? ns/op > StringCompareToIgnoreCase.supLower?????? avgt?? 25? 21.003 ? 0.354? ns/op > StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 30.863 ? 4.529? ns/op > StringCompareToIgnoreCase.upperLower???? avgt?? 25? 15.417 ? 1.046? ns/op > > after: > Benchmark??????????????????????????????? Mode? Cnt??? Score Error? Units > StringCompareToIgnoreCase.lower????????? avgt?? 25?? 46.857 ? 0.524? > ns/op > StringCompareToIgnoreCase.supLower?????? avgt?? 25? 148.688 ? 6.546? > ns/op > StringCompareToIgnoreCase.supUpperLower? avgt?? 25?? 37.160 ? 0.259? > ns/op > StringCompareToIgnoreCase.upperLower???? avgt?? 25?? 15.126 ? 0.338? > ns/op > > Now non-supplementary operations ("lower" and "upperLower") are on par > with the "before" result (I am not quite sure why the "after" results > are somewhat faster though). For supplementary test cases, "supLower" > is very slow. The reason is two fold; one is because "before" one > exits at the very first character (which I am addressing here) while > "after" continues to compare to the last characters, the other reason > is the test suffers from the change where supplementary cases double > the case insensitivity checks (compared to the "after" result just > below). Also "supUpperLower" gets slower for the same reason. These > are expected results for supplementary comparisons (as we discussed). > > Naoto > > On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote: >> Hi, >> >> Based on the suggestions, I modified the fix as follows: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >> >> Changes from the initial revision are: >> >> - Shared the implementation between compareToCI() and regionMatchesCI() >> - Enabled immediate short cut if two code points match. >> - Created a simple JMH benchmark. Here is the scores before and after >> the change: >> >> before: >> Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units >> StringCompareToIgnoreCase.lower????????? avgt?? 25? 53.764 ? 2.811? >> ns/op >> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 24.211 ? 1.135? >> ns/op >> StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 30.595 ? 1.344? >> ns/op >> StringCompareToIgnoreCase.upperLower???? avgt?? 25? 18.859 ? 1.499? >> ns/op >> >> after: >> Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units >> StringCompareToIgnoreCase.lower????????? avgt?? 25? 58.354 ? 4.603? >> ns/op >> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 57.975 ? 5.672? >> ns/op >> StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 23.912 ? 0.965? >> ns/op >> StringCompareToIgnoreCase.upperLower???? avgt?? 25? 17.744 ? 0.272? >> ns/op >> >> Here, "sup" means all supplementary characters, BMP otherwise. >> "lower" means each character requires upper->lower casemap. >> "upperLower" means all characters are the same, except the last >> character which requires casemap. >> >> I think the result is reasonable, considering surrogates check are >> now mandatory. I have tried Roger's suggestion to use >> Arrays.mismatch() but it did not seem to benefit here. In fact, the >> performance degraded partly because I implemented the short cut, and >> possibly for the overhead of extra checks. >> >> Naoto >> >> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: >>> Hello, >>> >>> Please review the fix to the following issues: >>> >>> https://bugs.openjdk.java.net/browse/JDK-8248655 >>> https://bugs.openjdk.java.net/browse/JDK-8248434 >>> >>> The proposed changeset and its CSR are located at: >>> >>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>> https://bugs.openjdk.java.net/browse/JDK-8248664 >>> >>> A bug was filed against SimpleDateFormat (8248434) where >>> case-insensitive date format/parse failed in some of the new locales >>> in JDK15. The root cause was that case-insensitive >>> String.regionMatches() method did not work with supplementary >>> characters. The problem is that the method's spec does not expect >>> case mappings of supplementary characters, possibly because it was >>> overlooked in the first place, JSR 204 - "Unicode Supplementary >>> Character support". Similar behavior is observed in other two >>> case-insensitive methods, i.e., compareToIgnoreCase() and >>> equalsIgnoreCase(). >>> >>> The fix is straightforward to compare strings by code point basis, >>> instead of code unit (16bit "char") basis. Technically this change >>> will introduce a backward incompatibility, but I believe it is an >>> incompatibility to wrong behavior, not true to the meaning of those >>> methods' expectations. >>> >>> Naoto From naoto.sato at oracle.com Wed Jul 22 20:37:21 2020 From: naoto.sato at oracle.com (naoto.sato at oracle.com) Date: Wed, 22 Jul 2020 13:37:21 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> Message-ID: Hi Joe, Thank you for the consecutive reviews! On 7/22/20 1:20 PM, Joe Wang wrote: > Hi Naoto, > > The change looks good to me. "supLower" is indeed super slow :-) > > The only minor comment I have is that the compareCodePointCI method > performs toUpperCase unconditionally. That's not a problem for the > regular case, where a check on cp1 == cp2 (line 337) is done prior to > the method call. But for the sup case (starting at line 341), the method > is called unconditionally while in webrev.04 there was a check "cp1 != > cp2".? One option to fix it is to include the "cp1 != cp2" check in the > method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted. That was intentional, as at the point when it calls compareCodePointCI() for the second time, it is guaranteed that the supplementary code points differ because either their high surrogates or low surrogates differ. Naoto > > Regards, > Joe > > On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote: >> Hi, >> >> I revised the fix again, based on further suggestions: >> >> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/ >> >> Changes from v.04 are (all in StringUTF16.java): >> >> - The short cut now does case insensitive comparison that makes the >> fix closer to the previous implementation (for BMP characters). >> - Changed the bit operation to negating for detecting needed index >> increment. >> - Method name is changed to better reflect what it is doing, with more >> descriptive comments. >> >> Here is the benchmark results: >> >> before: >> Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units >> StringCompareToIgnoreCase.lower????????? avgt?? 25? 49.960 ? 1.923? ns/op >> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 21.003 ? 0.354? ns/op >> StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 30.863 ? 4.529? ns/op >> StringCompareToIgnoreCase.upperLower???? avgt?? 25? 15.417 ? 1.046? ns/op >> >> after: >> Benchmark??????????????????????????????? Mode? Cnt??? Score Error? Units >> StringCompareToIgnoreCase.lower????????? avgt?? 25?? 46.857 ? 0.524 ns/op >> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 148.688 ? 6.546 ns/op >> StringCompareToIgnoreCase.supUpperLower? avgt?? 25?? 37.160 ? 0.259 ns/op >> StringCompareToIgnoreCase.upperLower???? avgt?? 25?? 15.126 ? 0.338 ns/op >> >> Now non-supplementary operations ("lower" and "upperLower") are on par >> with the "before" result (I am not quite sure why the "after" results >> are somewhat faster though). For supplementary test cases, "supLower" >> is very slow. The reason is two fold; one is because "before" one >> exits at the very first character (which I am addressing here) while >> "after" continues to compare to the last characters, the other reason >> is the test suffers from the change where supplementary cases double >> the case insensitivity checks (compared to the "after" result just >> below). Also "supUpperLower" gets slower for the same reason. These >> are expected results for supplementary comparisons (as we discussed). >> >> Naoto >> >> On 7/17/20 4:36 PM, naoto.sato at oracle.com wrote: >>> Hi, >>> >>> Based on the suggestions, I modified the fix as follows: >>> >>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/ >>> >>> Changes from the initial revision are: >>> >>> - Shared the implementation between compareToCI() and regionMatchesCI() >>> - Enabled immediate short cut if two code points match. >>> - Created a simple JMH benchmark. Here is the scores before and after >>> the change: >>> >>> before: >>> Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units >>> StringCompareToIgnoreCase.lower????????? avgt?? 25? 53.764 ? 2.811 ns/op >>> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 24.211 ? 1.135 ns/op >>> StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 30.595 ? 1.344 ns/op >>> StringCompareToIgnoreCase.upperLower???? avgt?? 25? 18.859 ? 1.499 ns/op >>> >>> after: >>> Benchmark??????????????????????????????? Mode? Cnt?? Score Error? Units >>> StringCompareToIgnoreCase.lower????????? avgt?? 25? 58.354 ? 4.603 ns/op >>> StringCompareToIgnoreCase.supLower?????? avgt?? 25? 57.975 ? 5.672 ns/op >>> StringCompareToIgnoreCase.supUpperLower? avgt?? 25? 23.912 ? 0.965 ns/op >>> StringCompareToIgnoreCase.upperLower???? avgt?? 25? 17.744 ? 0.272 ns/op >>> >>> Here, "sup" means all supplementary characters, BMP otherwise. >>> "lower" means each character requires upper->lower casemap. >>> "upperLower" means all characters are the same, except the last >>> character which requires casemap. >>> >>> I think the result is reasonable, considering surrogates check are >>> now mandatory. I have tried Roger's suggestion to use >>> Arrays.mismatch() but it did not seem to benefit here. In fact, the >>> performance degraded partly because I implemented the short cut, and >>> possibly for the overhead of extra checks. >>> >>> Naoto >>> >>> On 7/15/20 9:00 AM, naoto.sato at oracle.com wrote: >>>> Hello, >>>> >>>> Please review the fix to the following issues: >>>> >>>> https://bugs.openjdk.java.net/browse/JDK-8248655 >>>> https://bugs.openjdk.java.net/browse/JDK-8248434 >>>> >>>> The proposed changeset and its CSR are located at: >>>> >>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/ >>>> https://bugs.openjdk.java.net/browse/JDK-8248664 >>>> >>>> A bug was filed against SimpleDateFormat (8248434) where >>>> case-insensitive date format/parse failed in some of the new locales >>>> in JDK15. The root cause was that case-insensitive >>>> String.regionMatches() method did not work with supplementary >>>> characters. The problem is that the method's spec does not expect >>>> case mappings of supplementary characters, possibly because it was >>>> overlooked in the first place, JSR 204 - "Unicode Supplementary >>>> Character support". Similar behavior is observed in other two >>>> case-insensitive methods, i.e., compareToIgnoreCase() and >>>> equalsIgnoreCase(). >>>> >>>> The fix is straightforward to compare strings by code point basis, >>>> instead of code unit (16bit "char") basis. Technically this change >>>> will introduce a backward incompatibility, but I believe it is an >>>> incompatibility to wrong behavior, not true to the meaning of those >>>> methods' expectations. >>>> >>>> Naoto > From brent.christian at oracle.com Wed Jul 22 22:13:39 2020 From: brent.christian at oracle.com (Brent Christian) Date: Wed, 22 Jul 2020 15:13:39 -0700 Subject: RFR: 8248655: Support supplementary characters in String case insensitive operations In-Reply-To: <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> References: <1c9dad7c-bda7-f060-0c97-0bb5f848d0ef@oracle.com> <439b2c80-72d1-92fb-7691-c2a1e59f1aad@oracle.com> Message-ID: <2a1dedcb-28a5-a14c-bb7f-52da57c1a381@oracle.com> Hi, Naoto The latest changes look good to me. -Brent On 7/22/20 10:23 AM, naoto.sato at oracle.com wrote: > Hi, > > I revised the fix again, based on further suggestions: > > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/ > > Changes from v.04 are (all in StringUTF16.java): > > - The short cut now does case insensitive comparison that makes the fix > closer to the previous implementation (for BMP characters). > - Changed the bit operation to negating for detecting needed index > increment. > - Method name is changed to better reflect what it is doing, with more > descriptive comments. >