<i18n dev> RFR: 8363972: Loose matching of dash/minusSign in number parsing [v6]

Fri Aug 1 23:25:57 UTC 2025

On Fri, 1 Aug 2025 22:19:22 GMT, Justin Lu <jlu at openjdk.org> wrote:

>> Naoto Sato has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>>  - Spec update
>>  - Supplementary/CanonEq tests
>>  - flipped again, which was correct
>>  - flipped the size check
>>  - Address review comments
>>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>>  - tidying up
>>  - test location
>>  - spec update
>>  - ... and 6 more: https://git.openjdk.org/jdk/compare/8e921aee...3682484d
>
> src/java.base/share/classes/java/text/DecimalFormat.java line 3543:
> 
>> 3541:             var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ);
>> 3542:             var a = lmsp.matcher(affix).replaceAll("-");
>> 3543:             var t = lmsp.matcher(text.substring(position, Math.min(tlen, position + alen)))
> 
> I don't think this works. If you see my other comment regarding the test case and swap it such that `lenientMinusSign` is "-\u00E4" and the pattern (affix) includes "a\u0308" and we parse some text such as "\u00E41.5".
> 
> Then we have the following,
> 
> 
> var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ); // Returns pattern [-ä]
> var a = lmsp.matcher(affix).replaceAll("-"); // Gets correctly normalized to "-"
> // Incorrectly matches against "ä1" and normalized to "-1" since the substring returned is indexed from 0 to 2.
> var t = lmsp.matcher(text.substring(position, Math.min(tlen, position + alen)))
> 
> 
> That is, I think we need to substring after we have performed the normalization. Something such as,
> 
> 
> var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ);
> var a = lmsp.matcher(affix).replaceAll("-");
> var t = lmsp.matcher(text).replaceAll("-");
> return t.startsWith(a, position);
> 
> 
> However, we will still run into issues later in DecimalFormat, as the position is incremented by the original prefix length.
> 
> 
>         } else if (gotNegative) {
>             position += negativePrefix.length();
> 
> 
> So for "ä1.5" we would start parsing at position = 2, erroneously returning "0.5". So further thought may be needed.

The current implementations assume the prefix/suffix lengths in a lot of places (no wonder), should be re-examined.

> test/jdk/java/text/Format/NumberFormat/LenientMinusSignTest.java line 128:
> 
>> 126:             .set(dfs, "-\u00E5");
>> 127:         var df = new DecimalFormat("#.#;a\u0308#.#", dfs);
>> 128:         assertEquals(df.parse("a\u03081.5"), -1.5);
> 
> This one confuses me, should it not be parsing "\u00E51.5" such that the test can check if canonical equivalence occurs when matching the text "\u00E5" to the affix "a\u0308"? Otherwise, we are just parsing the exact pattern we set? Also I think it should be "\u00E4", not "\u00E5".

That's correct. I think the simple case is not that simple. Need to rethink.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2249005477
PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2249005036