RFR: 8365675: Add String Unicode Case-Folding Support

Xueming Shen sherman at openjdk.org
Fri Oct 3 19:10:20 UTC 2025


### Summary

Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

**String.equalsIgnoreCase(String)**

- Unicode-aware, locale-independent.
- Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
- Limited: does not support 1:M mapping defined in Unicode case folding.

**Character.toLowerCase(int) / Character.toUpperCase(int)**

- Locale-independent, single code point only.
- No support for 1:M mappings.

**String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**

- Based on Unicode SpecialCasing.txt, supports 1:M mappings.
- Intended primarily for presentation/display, not structural case-insensitive matching.
- Requires full string conversion before comparison, which is less efficient and not intended for structural matching.

**1:M mapping example, U+00DF (ß)**

- String.toUpperCase(Locale.ROOT, "ß") → "SS"
- Case folding produces "ss", matching Unicode caseless comparison rules.


jshell> "\u00df".equalsIgnoreCase("ss")
$22 ==> false

jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
$24 ==> true


### Motivation & Direction

Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.

- Unicode-compliant **full** case folding.
- Simpler, stable and more efficient case-less matching without workarounds.
- Brings Java's string comparison handling in line with other programming languages/libraries.

This PR proposes to introduce the following comparison methods in `String` class

- boolean equalsFoldCase(String anotherString)
- int compareToFoldCase(String anotherString)
- Comparator<String> UNICODE_CASEFOLD_ORDER

These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.

*Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.

### The New API


   /**
     * Compares this {@code String} to another {@code String} for equality,
     * using <em>Unicode case folding</em>. Two strings are considered equal
     * by this method if their case-folded forms are identical.
     * <p>
     * Case folding is defined by the Unicode Standard in
     * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
     * including 1:M mappings. For example, {@code "Maße".equalsFoldCase("MASSE")}
     * returns {@code true}, since the character {@code U+00DF} (sharp s) folds
     * to {@code "ss"}.
     * <p>
     * Case folding is locale-independent and language-neutral, unlike
     * locale-sensitive transformations such as {@link #toLowerCase()} or
     * {@link #toUpperCase()}. It is intended for caseless matching,
     * searching, and indexing.
     *
     * @apiNote
     * This method is the Unicode-compliant alternative to
     * {@link #equalsIgnoreCase(String)}. It implements full case folding as
     * defined by the Unicode Standard, which may differ from the simpler
     * per-character mapping performed by {@code equalsIgnoreCase}.
     * For example:
     * <pre>{@snippet lang=java :
     * String a = "Maße";
     * String b = "MASSE";
     * boolean equalsFoldCase = a.equalsFoldCase(b);       // returns true
     * boolean equalsIgnoreCase = a.equalsIgnoreCase(b);   // returns false
     * }</pre>
     *
     * @param  anotherString
     *         The {@code String} to compare this {@code String} against
     *
     * @return  {@code true} if the given object is not {@code null} and represents
     *          the same sequence of characters as this string under Unicode case
     *          folding; {@code false} otherwise.
     *
     * @see     #compareToFoldCase(String)
     * @see     #equalsIgnoreCase(String)
     * @since   26
     */
    public boolean equalsFoldCase(String anotherString)

    /**
     * Compares two strings lexicographically using <em>Unicode case folding</em>.
     * This method returns an integer whose sign is that of calling {@code compareTo}
     * on the Unicode case folded version of the strings. Unicode Case folding
     * eliminates differences in case according to the Unicode Standard, using the
     * mappings defined in
     * <a href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">CaseFolding.txt</a>,
     * including 1:M mappings, such as {@code"ß"} → {@code }"ss"}.
     * <p>
     * Case folding is a locale-independent, language-neutral form of case mapping,
     * primarily intended for caseless matching. Unlike {@link #compareToIgnoreCase(String)},
     * which applies a simpler locale-insensitive uppercase mapping. This method
     * follows the Unicode <em>full</em> case folding, providing stable and
     * consistent results across all environments.
     * <p>
     * Note that this method does <em>not</em> take locale into account, and may
     * produce results that differ from locale-sensitive ordering. Use
     * {@link java.text.Collator} for locale-sensitive comparison.
     *
     * @apiNote
     * This method is the Unicode-compliant alternative to
     * {@link #compareToIgnoreCase(String)}. It implements the <em>full</em> case folding
     * as defined by the Unicode Standard, which may differ from the simpler
     * per-character mapping performed by {@code compareToIgnoreCase}.
     * For example:
     * <pre>{@snippet lang=java :
     * String a = "Maße";
     * String b = "MASSE";
     * int cmpFoldCase = a.compareToFoldCase(b);     // returns 0
     * int cmpIgnoreCase = a.compareToIgnoreCase(b); // returns > 0
     * }</pre>
     *
     * @param   str   the {@code String} to be compared.
     * @return  a negative integer, zero, or a positive integer as the specified
     *          String is greater than, equal to, or less than this String,
     *          ignoring case considerations by case folding.
     * @see     #equalsFoldCase(String)
     * @see     #compareToIgnoreCase(String)
     * @see     java.text.Collator
     * @since   26
     */
    public int compareToFoldCase(String str) 

    /**
     * A Comparator that orders {@code String} objects as by
     * {@link #compareToFoldCase(String) compareToFoldCase()}.
     *
     * @see     #compareToFoldCase(String)
     * @since   26
     */
    public static final Comparator<String> UNICODE_CASEFOLD_ORDER;



### Usage Examples

Sharp s (U+00DF) case-folds to "ss"


    "straße".equalsIgnoreCase("strasse");             // false
    "straße".compareToIgnoreCase("strasse");          // != 0
    "straße".equalsFoldCase("strasse");               // true



### Performance

The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing  compareToIgnoreCase().


Benchmark                                         Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.asciiGreekLower         avgt   15  20.195 ± 0.300  ns/op
StringCompareToIgnoreCase.asciiGreekLowerCF       avgt   15  11.051 ± 0.254  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLower    avgt   15   6.035 ± 0.047  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLowerCF  avgt   15  14.786 ± 0.382  ns/op
StringCompareToIgnoreCase.asciiLower              avgt   15  17.688 ± 1.396  ns/op
StringCompareToIgnoreCase.asciiLowerCF            avgt   15  44.552 ± 0.155  ns/op
StringCompareToIgnoreCase.asciiUpperLower         avgt   15  13.069 ± 0.487  ns/op
StringCompareToIgnoreCase.asciiUpperLowerCF       avgt   15  58.684 ± 0.274  ns/op
StringCompareToIgnoreCase.greekLower              avgt   15  20.642 ± 0.082  ns/op
StringCompareToIgnoreCase.greekLowerCF            avgt   15   7.255 ± 0.271  ns/op
StringCompareToIgnoreCase.greekUpperLower         avgt   15   5.737 ± 0.013  ns/op
StringCompareToIgnoreCase.greekUpperLowerCF       avgt   15  11.100 ± 1.147  ns/op
StringCompareToIgnoreCase.lower                   avgt   15  20.192 ± 0.044  ns/op
StringCompareToIgnoreCase.lowerrCF                avgt   15  11.257 ± 0.259  ns/op
StringCompareToIgnoreCase.supLower                avgt   15  54.801 ± 0.415  ns/op
StringCompareToIgnoreCase.supLowerCF              avgt   15  15.207 ± 0.418  ns/op
StringCompareToIgnoreCase.supUpperLower           avgt   15  14.431 ± 0.188  ns/op
StringCompareToIgnoreCase.supUpperLowerCF         avgt   15  19.149 ± 0.985  ns/op
StringCompareToIgnoreCase.upperLower              avgt   15   5.650 ± 0.051  ns/op
StringCompareToIgnoreCase.upperLowerCF            avgt   15  14.338 ± 0.352  ns/op
StringCompareToIgnoreCase.utf16SubLower           avgt   15  14.774 ± 0.200  ns/op
StringCompareToIgnoreCase.utf16SubLowerCF         avgt   15   2.669 ± 0.041  ns/op
StringCompareToIgnoreCase.utf16SupUpperLower      avgt   15  16.250 ± 0.099  ns/op
StringCompareToIgnoreCase.utf16SupUpperLowerCF    avgt   15  11.524 ± 0.327  ns/op



### Refs

[Unicode Standard 5.18.4 Caseless Matching](https://www.unicode.org/versions/latest/core-spec/chapter-5/#G21790)
[Unicode® Standard Annex #44: 5.6 Case and Case Mapping](https://www.unicode.org/reports/tr44/#Casemapping)
[Unicode Technical Standard #18:  Unicode Regular Expressions RL1.5: Simple Loose Matches](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches)
[Unicode SpecialCasing.txt](https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt)
[Unicode CaseFolding.txt](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt)

### Other Languages

**Python string.casefold()**

The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

**Perl’s fc()**

Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD]. 

**ICU4J UCharacter.foldCase (Java)** 

Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static <A extends Appendable> A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

**u_strFoldCase (C/C++)**

A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.

-------------

Commit messages:
 - 8365675: Add String Unicode Case-Folding Support

Changes: https://git.openjdk.org/jdk/pull/26892/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26892&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8365675
  Stats: 1279 lines in 12 files changed: 1116 ins; 137 del; 26 mod
  Patch: https://git.openjdk.org/jdk/pull/26892.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26892/head:pull/26892

PR: https://git.openjdk.org/jdk/pull/26892


More information about the compiler-dev mailing list