RFR: 8365675: Add String Unicode Case-Folding Support [v7]

Fri Nov 7 22:36:11 UTC 2025

On Thu, 30 Oct 2025 02:59:45 GMT, Xueming Shen <sherman at openjdk.org> wrote:

>> ### Summary
>> 
>> Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
>> 
>> Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
>> 
>> **String.equalsIgnoreCase(String)**
>> 
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>> 
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>> 
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>> 
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>> 
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient and not intended for structural matching.
>> 
>> **1:M mapping example, U+00DF (ß)**
>> 
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>> 
>> 
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>> 
>> jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>> 
>> 
>> ### Motivation & Direction
>> 
>> Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
>> 
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming languages/libraries.
>> 
>> This PR proposes to introduce the following comparison methods in `String` class
>> 
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>> 
>> These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
>> 
>> *Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
>> However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   update to use value long for folding

Looking good.
I'll look at the javadoc again when the CSR comments are addressed.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 53:

> 51:     public static boolean isDefined(int cp) {
> 52:          return getDefined(cp) != -1;
> 53:      }

Extra space.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 107:

> 105:     * family may appears independently or within a class.
> 106:     * <p>
> 107:     * For loose/case-insensitive matching, the back-refs, slices and singles apply {code toUpperCase} and

Missing at-sign in markup:
Suggestion:

    * For loose/case-insensitive matching, the back-refs, slices and singles apply {@code toUpperCase} and

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 136:

> 134:     *
> 135:     * <p>
> 136:     * @spec https://www.unicode.org/reports/tr18/#Simple_Loose_Matches

I'd put @spec after @return.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 152:

> 150:                 }
> 151:             }
> 152:         }

If expanded_case_cps was sorted, Array.binarySearch could be used to find the index of the first character in the range.
And the loop could break when cp reaches the end of the range.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 163:

> 161:       .stream()
> 162:       .mapToInt(Integer::intValue)
> 163:       .toArray();

It might be worthwhile to sort these to enable skipping a quicker break when the last one in the range is seen.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 169:

> 167:     private static final int HASH_NEXT = 2;
> 168: 
> 169:     private static int[][] hashKeys(int[] keys) {

It may be worthwhile to round up the hash modulo to a prime number to avoid unfortunate hash collisions.

test/jdk/java/lang/String/UnicodeCaseFoldingTest.java line 31:

> 29:  * @compile --add-exports java.base/jdk.internal.lang=ALL-UNNAMED
> 30:  * UnicodeCaseFoldingTest.java
> 31:  * @run junit/othervm --add-exports java.base/jdk.internal.lang=ALL-UNNAMED

The @module directive can replace the explicit --add-exports and the explicit @compile may be unnecessary.

* @modules java.base/jdk.internal.lang:+open

-------------

PR Review: https://git.openjdk.org/jdk/pull/27628#pullrequestreview-3436511645
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505610221
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505623056
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505629880
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505705459
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505699712
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505714395
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505728277