RFR: 8365675: Add String Unicode Case-Folding Support [v4]

Thu Oct 23 18:46:38 UTC 2025

On Sun, 19 Oct 2025 09:12:42 GMT, Xueming Shen <sherman at openjdk.org> wrote:

>> ### Summary
>> 
>> Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
>> 
>> Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
>> 
>> **String.equalsIgnoreCase(String)**
>> 
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>> 
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>> 
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>> 
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>> 
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient and not intended for structural matching.
>> 
>> **1:M mapping example, U+00DF (ß)**
>> 
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>> 
>> 
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>> 
>> jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>> 
>> 
>> ### Motivation & Direction
>> 
>> Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
>> 
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming languages/libraries.
>> 
>> This PR proposes to introduce the following comparison methods in `String` class
>> 
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>> 
>> These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
>> 
>> *Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
>> However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   test case update

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 1:

> 1: /*

Please rename this build tool to avoid ambiguity in the naming of CaseFolding.java.

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 60:

> 58:                             cp,
> 59:                             Arrays.stream(folding)
> 60:                                     .mapToObj(f -> String.format("0x%04x", f))

For each string in fields do the parsing and formatting here; skipping the extra array.
Or just use the string from fields[i]; the parse and format seems like a no-op except for catching file format errors.

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 71:

> 69:                 .map(line -> {
> 70:                     String[] cols = line.split("; ");
> 71:                     return new String[]{cols[0], cols[1], cols[2]};

Seems unnecessary to create a new String array with the same strings.

src/java.base/share/classes/java/lang/String.java line 2237:

> 2235:             return true;
> 2236:         }
> 2237:         if (anotherString == null) {

These tests would be better in UNICODE_CASEFOLD_ORDER and cover more cases including directly using the comparator.
Or omit the extra null checks (to be more like the existing CaseInsensitiveComparator).

src/java.base/share/classes/java/lang/StringLatin1.java line 70:

> 68:         return value[index] & 0xff;
> 69:     }
> 70: 

Seems to be unused and is identical to `getChar`

src/java.base/share/classes/java/lang/StringLatin1.java line 192:

> 190:         int[] folded2 = null;
> 191:         int k1 = off, k2 = ooff, fk1 = 0, fk2 = 0;
> 192:         while ((k1 < last || folded1 != null && fk1 < folded1.length) &&

One less comparison in the loop by checking the common length using `int limit = Math.min(last, olast)`.

src/java.base/share/classes/java/lang/StringLatin1.java line 193:

> 191:         int k1 = off, k2 = ooff, fk1 = 0, fk2 = 0;
> 192:         while ((k1 < last || folded1 != null && fk1 < folded1.length) &&
> 193:                (k2 < olast || folded2 != null && fk2 < folded2.length)) {

Suggest starting the slow path after any matching prefix using `ArraySupport.mismatch(byte[], off1, byte[], off2, length)`.  That should have good/best performance for matching sequences.

src/java.base/share/classes/java/lang/StringLatin1.java line 217:

> 215:             if (c1 != c2) {
> 216:                 return c1 - c2;
> 217:             }

Compute difference only once.
Suggestion:

            if ((c1 - c2) != 0) {
                return c1 - c2;
            }

src/java.base/share/classes/java/lang/StringLatin1.java line 228:

> 226:     }
> 227: 
> 228:     static int compareToFC(byte[] value, byte[] other) {

Quirky special cases (like DF) might be worth a description in a single place since there are multiple paths latin1/UTF16, etc.

src/java.base/share/classes/java/lang/StringLatin1.java line 229:

> 227: 
> 228:     static int compareToFC(byte[] value, byte[] other) {
> 229:         int len = value.length;

The might also be a good place to use ArraySupport.mismatch to find the first non-matching char.

src/java.base/share/classes/java/lang/StringLatin1.java line 244:

> 242:             }
> 243:             return Character.toLowerCase(c1) - Character.toLowerCase(c2);
> 244: 

Extra blank line.

src/java.base/share/classes/java/lang/StringUTF16.java line 605:

> 603:         int k1 = off, k2 = ooff, fk1 = 0, fk2 = 0;
> 604:         while ((k1 < last || folded1 != null && fk1 < folded1.length) &&
> 605:                (k2 < olast || folded2 != null && fk2 < folded2.length)) {

Use ArraySupport.mismatch to quickly scan past identical sequences.  (byte index will need to be converted to char index).

src/java.base/share/classes/java/lang/StringUTF16.java line 645:

> 643:         int tlast = length(value);
> 644:         int olast = length(other);
> 645:         int k = 0;

Arrays.mismatch can be used to skip leading or complete match before starting per character code path.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 1:

> 1: /*

Significant performance improvements could be had by handling single (simple) char -> char folding separately avoiding the overhead of iterating over single character arrays.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 26:

> 24:  */
> 25: 
> 26: package jdk.internal.java.lang;

Please use the jdk.internal.lang package. (And adjust GensrcCharacterData.gmk).

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 57:

> 55:      * @see #fold(int)
> 56:      */
> 57:  	public static boolean isFolded(int cp) {

The name `isFolded` can be confusing, it implies there is a mapping needed, but it is the opposite.
I'd suggest suggest keeping only `isDefined` and perhaps rename to `hasFold` or similar.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 93:

> 91:         if (entry != null)
> 92:           return entry.folding;
> 93:         return new int[] { cp };

Creating a bunch of small arrays is very wasteful.  Single char to single char should not need an allocation.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 131:

> 129:      * @return a {@code String} containing the case-folded form of the input string
> 130:      */
> 131:     public static String fold(String s) {

Save this unused method for another PR. (and the corresponding tests)

test/micro/org/openjdk/bench/java/lang/StringCompareToFoldCase.java line 57:

> 55:     public String supLower = "\ud801\udc28\ud801\udc29\ud801\udc2a\ud801\udc2b\ud801\udc2c";
> 56: 
> 57:     @Benchmark

Add cases also for strings to themselves (latin1, utf16) to see how the fastest path compares.

Is there any body of more representative text with more typical intermixing of normal and folded sequences?  Strings with multiple ßß are not typical.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453279238
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453205161
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453214070
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453295048
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453302601
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453347556
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453321434
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453338377
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453369663
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453366657
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456338704
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456403245
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456423333
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456634925
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2453283972
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456506906
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456514193
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456529794
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2456658965