<i18n dev> RFR: 8365675: Add String Unicode Case-Folding Support [v6]

Wed Oct 29 21:09:33 UTC 2025

On Wed, 29 Oct 2025 20:41:06 GMT, Roger Riggs <rriggs at openjdk.org> wrote:

>> ### Long:  packing  1:M-count + 1-3 folding codepoints
>> 
>> https://cr.openjdk.org/~sherman/casefolding_long/
>> 
>> The performance is slightly better, but not as good as I would have expected. The access to codepoint from the long looks a little clumsy,  but the logic looks smooth. need more work. opinion?
>> 
>> 
>> Benchmark                                    Mode  Cnt   Score   Error  Units
>> StringCompareToFoldCase.asciiLower           avgt   15  15.487 ± 0.298  ns/op
>> StringCompareToFoldCase.asciiLowerEQ         avgt   15  10.005 ± 0.368  ns/op
>> StringCompareToFoldCase.asciiLowerEQFC       avgt   15  10.755 ± 0.160  ns/op
>> StringCompareToFoldCase.asciiLowerFC         avgt   15  10.349 ± 0.155  ns/op
>> StringCompareToFoldCase.asciiUpperLower      avgt   15  12.188 ± 0.278  ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQ    avgt   15  10.901 ± 0.551  ns/op
>> StringCompareToFoldCase.asciiUpperLowerEQFC  avgt   15   9.218 ± 0.165  ns/op
>> StringCompareToFoldCase.asciiUpperLowerFC    avgt   15   9.335 ± 0.404  ns/op
>> StringCompareToFoldCase.asciiWithDFFC        avgt   15  37.010 ± 0.518  ns/op
>> StringCompareToFoldCase.greekLower           avgt   15  39.572 ± 0.098  ns/op
>> StringCompareToFoldCase.greekLowerEQ         avgt   15  39.317 ± 0.104  ns/op
>> StringCompareToFoldCase.greekLowerEQFC       avgt   15  20.428 ± 0.243  ns/op
>> StringCompareToFoldCase.greekLowerFC         avgt   15  19.623 ± 0.141  ns/op
>> StringCompareToFoldCase.greekUpperLower      avgt   15   7.105 ± 0.048  ns/op
>> StringCompareToFoldCase.greekUpperLowerEQ    avgt   15   7.462 ± 0.092  ns/op
>> StringCompareToFoldCase.greekUpperLowerEQFC  avgt   15   6.518 ± 0.128  ns/op
>> StringCompareToFoldCase.greekUpperLowerFC    avgt   15   6.593 ± 0.240  ns/op
>> StringCompareToFoldCase.latin1UTF16          avgt   15  23.130 ± 0.152  ns/op
>> StringCompareToFoldCase.latin1UTF16EQ        avgt   15  22.606 ± 0.089  ns/op
>> StringCompareToFoldCase.latin1UTF16EQFC      avgt   15  29.574 ± 0.348  ns/op
>> StringCompareToFoldCase.latin1UTF16FC        avgt   15  29.691 ± 0.445  ns/op
>> StringCompareToFoldCase.supLower             avgt   15  55.027 ± 0.676  ns/op
>> StringCompareToFoldCase.supLowerEQ           avgt   15  55.784 ± 0.368  ns/op
>> StringCompareToFoldCase.supLowerEQFC         avgt   15  24.984 ± 0.157  ns/op
>> StringCompareToFoldCase.supLowerFC           avgt   15  24.865 ± 0.139  ns/op
>> StringCompareToFoldCase.supUpperLower        avgt   15  14.538 ± 0.144  ns/op
>> StringCompareToFoldCas...
>
>> Experimenting with Arrays.mismatch at the beginning of the array iteration as
>> ...
>> The benchmark results suggest that it does help 'dramatically' when the compared strings share with the same prefix. For example those "UpperLower" test cases (which shares the same upper cases text prefix. However it is also relatively expensive, with a 20%-ish overhead when the strings do not share the same string text but are case-insensitively equals. I would suggest let's leave it out for now?
> 
>> ```
> Ok to leave it out for now.  In similar contexts where System.arraycopy or Arrays.mismatch has some overhead I've suggested doing a simple check (like `size < 8`) to avoid the overhead when the strings/byte arrays are short.
> Thanks for checking.

> The performance is slightly better, but not as good as I would have expected. The access to codepoint from the long looks a little clumsy, but the logic looks smooth. need more work. opinion?
It does look cleaner without the array indexing in the loops.
Can the counting of characters (fcnt1,fcnt2) be eliminated by encoding 3 20-bit characters into the long and then checking `f1 != 0` to indicate there are more characters.  Its a bit of an odd mix of 16-bit characters vs a single 20-bit char. Are there any 20-bit chars from or to folded replacements in the folding mappings?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2475481372