RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern [v3]
Justin Lu
jlu at openjdk.org
Tue Jan 23 22:23:34 UTC 2024
On Fri, 19 Jan 2024 23:30:43 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:
>> Please consider this fix to ensure that going from `MessageFormat` to pattern string via `toPattern()` and then back via `new MessageFormat()` results in a format that is equivalent to the original.
>>
>> The quoting and escaping rules for `MessageFormat` pattern strings are really tricky. I admit not completely understanding them. At a high level, they work like this: The normal way one would "nest" strings containing special characters is with straightforward recursive escaping like with the `bash` command line. For example, if you want to echo `a "quoted string" example` then you enter `echo "a "quoted string" example"`. With this scheme it's always the "outer" layer's job to (un)escape special characters as needed. That is, the echo command never sees the backslash characters.
>>
>> In contrast, with `MessageFormat` and friends, nested subformat pattern strings are always provided "pre-escaped". So to build an "outer" string (e.g., for `ChoiceFormat`) the "inner" subformat pattern strings are more or less just concatenated, and then only the `ChoiceFormat` option separator characters (e.g., `<`, `#`, `|`, etc.) are escaped.
>>
>> The "pre-escape" escaping algorithm escapes `{` characters, because `{` indicates the beginning of a format argument. However, it doesn't escape `}` characters. This is OK because the format string parser treats any "extra" closing braces (where "extra" means not matching an opening brace) as plain characters.
>>
>> So far, so good... at least, until a format string containing an extra closing brace is nested inside a larger format string, where the extra closing brace, which was previously "extra", can now suddenly match an opening brace in the outer pattern containing it, thus truncating it by "stealing" the match from some subsequent closing brace.
>>
>> An example is the `MessageFormat` string `"{0,choice,0.0#option A: {1}|1.0#option B: {1}'}'}"`. Note the second option format string has a trailing closing brace in plain text. If you create a `MessageFormat` with this string, you see a trailing `}` only with the second option.
>>
>> However, if you then invoke `toPattern()`, the result is `"{0,choice,0.0#option A: {1}|1.0#option B: {1}}}"`. Oops, now because the "extra" closing brace is no longer quoted, it matches the opening brace at the beginning of the string, and the following closing brace, which was the previous match, is now just plain text in the outer `MessageFormat` string.
>>
>> As a result, invoking `f.format(new ...
>
> Archie Cobbs has updated the pull request incrementally with one additional commit since the last revision:
>
> Add @implNote to Javadoc for toPattern().
Hi @archiecobbs, left some initial comments.
But I think the logic is sound so far. Quote all quotable characters that need to be quoted, at the MessageFormat level this is simply opening and closing bracket, which you accomplished with the 2 pass copyAndQuoteBraces method. The goal is to create a resultant String that can create a MessageFormat that formats equivalently as the original, (even if calling toPattern() on both MessageFormats returns different String values).
Let's also try and get the CSR approved first.
src/java.base/share/classes/java/text/MessageFormat.java line 1:
> 1: /*
For this file, please also bump the latter copyright year to 2024.
src/java.base/share/classes/java/text/MessageFormat.java line 556:
> 554: * does not necessarily equal the previously applied pattern.
> 555: *
> 556: * @implNote The string returned by this method can be used to create
Hmm, I'm not sure about the current note, because its not true in all cases (for example, some unknown subclass of Format). Maybe @naotoj has thoughts?
src/java.base/share/classes/java/text/MessageFormat.java line 585:
> 583: if (fmt instanceof DecimalFormat) {
> 584: result.append(",number");
> 585: subformatPattern = ((DecimalFormat)fmt).toPattern();
Could use pattern matching `instanceof` here and in the other instances, but I understand if you're trying to stay consistent with the existing code.
test/jdk/java/text/Format/MessageFormat/MessageFormatToPatternTest.java line 56:
> 54:
> 55: private static final int NUM_RANDOM_TEST_CASES = 1000;
> 56: private static final int MAX_FORMAT_NESTING = 3;
Can you add a comment defining `MAX_FORMAT_TESTING`, might not be immediately understood without further investigation
test/jdk/java/text/Format/MessageFormat/MessageFormatToPatternTest.java line 90:
> 88: // A few more test cases from the PR#17416 discussion
> 89: testRoundTrip(new MessageFormat("Test: {0,number,foo'{'#.00}"));
> 90: testRoundTrip(new MessageFormat("Test: {0,number,foo'}'#.00}"));
Can we add some more concrete test cases, for example, escaped quotes within a MessageFormat pattern (represented by doubled single quotes).
Another funny case is something like a MessageFormat created by `"{0,number,' abc }'' ' 0.00}"` which becomes `"{0,number, abc '}''' #0.00}"` after created from the value returned by toPattern() from the first MessageFormat. The Strings appear really different but format equivalently, it would be nice to have suspicious cases like these explicitly defined.
test/jdk/java/text/Format/MessageFormat/MessageFormatToPatternTest.java line 96:
> 94: @ParameterizedTest
> 95: @MethodSource("testCases")
> 96: public void testRoundTrip(MessageFormat format1) {
Can we also include the original String pattern that created format1, to help with debugging. I find myself wondering what the original String was.
Since technically, the full round trip is _pattern string -> MessageFormat1 -> pattern string -> MessageFormat2_
test/jdk/java/text/Format/MessageFormat/MessageFormatToPatternTest.java line 154:
> 152: // Generate a "random" MessageFormat. We do this by creating a MessageFormat with "{0}" placeholders
> 153: // and then substituting in random DecimalFormat, DateFormat, and ChoiceFormat subformats. The goal here
> 154: // is to avoid using pattern strings to construct formats, because they're what we're trying to check.
Can you add a general example of a randomly generated MessageFormat in the comments, (I know they vary by quite a lot), but it would be hard for someone to piece if together without digging into the code. And unfortunately the current toString() of MessageFormat is not implemented, so although JUnit prints the MessageFormat out, it's just gibberish.
-------------
PR Review: https://git.openjdk.org/jdk/pull/17416#pullrequestreview-1840037706
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464059010
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464059073
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464059360
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464060206
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464061279
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464062599
PR Review Comment: https://git.openjdk.org/jdk/pull/17416#discussion_r1464063513
More information about the core-libs-dev
mailing list