RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char
Chen Liang
liach at openjdk.org
Mon Jul 14 05:12:40 UTC 2025
On Mon, 14 Jul 2025 04:53:13 GMT, Xueming Shen <sherman at openjdk.org> wrote:
> Regex class should conform to **_Level 1_** of [Unicode Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
>
> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, which requires that simple case folding be applied to literals and (optionally) to character classes. When applied to character classes, each class is expected to be closed under simple case folding. See the standard for a detailed explanation of what it means for a class to be “closed.”
>
> To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose Matches, simple case folding must be applied to literals and (optionally) to character classes. When applied to character classes, each character class is expected to **be closed under simple case folding**. See the standard for the detailed explanation and example of "closed".
>
> **RL1.5 states**:
>
> To meet this requirement, an implementation that supports case-sensitive matching should
>
> 1. Provide at least the simple, default Unicode case-insensitive matching, and
> 2. Specify which character properties or constructs are closed under the matching.
>
> **In the Pattern implementation**, 5 types of constructs may be affected by case sensitivity:
>
> 1. back-refs
> 2. string slices (sequences)
> 3. single character,
> 4. character families (Unicode Properties ...), and
> 5. character class ranges
>
> **Note**: Single characters and families may appear independently or within a character class.
>
> For case-insensitive (loose) matching, the implementation already applies Character.toUpperCase() and Character.toLowerCase() to **both the pattern and the input string** for back-refs, slices, and single characters. This effectively makes these constructs closed under case folding.
>
> This has been verified in the newly added test case **_test/jdk/java/util/regex/CaseFoldingTest.java_**.
>
> For example:
>
> Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true
> Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true
>
> The character properties (families) are not "closed" and should remain unchanged. This is acceptable per RL1.5, if the behavior is clearly specified (TBD: update javadoc to reflect this).
>
> **Current Non-Conformance: Character Class Ranges**, as reported in the original bug report.
>
> Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false
> vs
> Pattern.compile("(?ui)[S-S]")....
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 45:
> 43: var caseFoldingTxt = Paths.get(args[1]);
> 44: var genSrcFile = Paths.get(args[2]);
> 45: var supportedTypes = "^.*; [CTS]; .*$";
Do we still need T here given you already have a hardcoded special case?
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 60:
> 58: .map(cols -> String.format(" entry(0x%s, 0x%s),", cols[0], cols[2]))
> 59: .collect(Collectors.joining("\n"))
> 60: .replaceFirst(",$", ""); // remove the last ','
Suggestion:
.map(cols -> String.format(" entry(0x%s, 0x%s)", cols[0], cols[2]))
.collect(Collectors.joining(",\n", "", "\n")); // remove the last ','
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 74:
> 72: StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
> 73: } catch (IOException e) {
> 74: e.printStackTrace();
I recommend removing this catch and add `throws Throwable` in the signature of `main`
src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template line 36:
> 34: public final class CaseFolding {
> 35:
> 36: private static Map<Integer, Integer> expanded_casefolding = Map.ofEntries(
Suggestion:
private static final Map<Integer, Integer> expanded_casefolding = Map.ofEntries(
src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template line 99:
> 97: */
> 98: public static int[] getClassRangeClosingCharacters(int start, int end) {
> 99: int[] expanded = new int[expanded_casefolding.size()];
Can be `Math.min(expanded_casefolding.size(), end - start)` in case the table grows large, and update the `off < expanded.length` check below too.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203858280
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203854636
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203852720
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203850027
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203851719
More information about the build-dev
mailing list