RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern [v2]
Justin Lu
jlu at openjdk.org
Thu Jan 18 23:05:26 UTC 2024
On Wed, 17 Jan 2024 21:31:22 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:
>> Please consider this fix to ensure that going from `MessageFormat` to pattern string via `toPattern()` and then back via `new MessageFormat()` results in a format that is equivalent to the original.
>>
>> The quoting and escaping rules for `MessageFormat` pattern strings are really tricky. I admit not completely understanding them. At a high level, they work like this: The normal way one would "nest" strings containing special characters is with straightforward recursive escaping like with the `bash` command line. For example, if you want to echo `a "quoted string" example` then you enter `echo "a "quoted string" example"`. With this scheme it's always the "outer" layer's job to (un)escape special characters as needed. That is, the echo command never sees the backslash characters.
>>
>> In contrast, with `MessageFormat` and friends, nested subformat pattern strings are always provided "pre-escaped". So to build an "outer" string (e.g., for `ChoiceFormat`) the "inner" subformat pattern strings are more or less just concatenated, and then only the `ChoiceFormat` option separator characters (e.g., `<`, `#`, `|`, etc.) are escaped.
>>
>> The "pre-escape" escaping algorithm escapes `{` characters, because `{` indicates the beginning of a format argument. However, it doesn't escape `}` characters. This is OK because the format string parser treats any "extra" closing braces (where "extra" means not matching an opening brace) as plain characters.
>>
>> So far, so good... at least, until a format string containing an extra closing brace is nested inside a larger format string, where the extra closing brace, which was previously "extra", can now suddenly match an opening brace in the outer pattern containing it, thus truncating it by "stealing" the match from some subsequent closing brace.
>>
>> An example is the `MessageFormat` string `"{0,choice,0.0#option A: {1}|1.0#option B: {1}'}'}"`. Note the second option format string has a trailing closing brace in plain text. If you create a `MessageFormat` with this string, you see a trailing `}` only with the second option.
>>
>> However, if you then invoke `toPattern()`, the result is `"{0,choice,0.0#option A: {1}|1.0#option B: {1}}}"`. Oops, now because the "extra" closing brace is no longer quoted, it matches the opening brace at the beginning of the string, and the following closing brace, which was the previous match, is now just plain text in the outer `MessageFormat` string.
>>
>> As a result, invoking `f.format(new ...
>
> Archie Cobbs has updated the pull request incrementally with one additional commit since the last revision:
>
> Quote '{' and '}' in subformat patterns, but only it not already quoted.
Hi Archie,
> The XXX that you see inside a MessageFormat pattern string as part of a {0,choice,XXX} subformat.
I think we both agree that XXX may not be the same as the Subformat pattern string.
Sorry if I'm going over stuff you already know, but to clarify my previous statement, MessageFormat escapes brackets so that they are syntactically not evaluated and simply injected into the pattern as a literal bracket. Why I brought up ChoiceFormat, is that it is unique in the sense that you can insert a `{ FormatElement }` into a subFormat pattern, which causes recursion.
Take the example from the class specification,
`new MessageFormat(There {0,choice,0#are no files|1#is one file|1<are {0,number,integer} files}.”);`
Formatting the value `5`, MessageFormat will first evaluate it as `are {0,number,integer} files}` and then recursively create another MessageFormat to evaluate it as `5 files`.
This is because `1<are {0,number,integer} files.` has unquoted brackets, signaling that MessageFormat should interpret them as syntactically significant. Had they been quoted such as, `1<are '{'0,number,integer'}' files.}`, **I** would expect that MessageFormat will first evaluate it as `are ‘{'0,number,integer’}' files.`, and as the brackets are quoted, interpret them as syntactically meaningless and simply return the value `are {0,number,integer} files.` with no recursive MessageFormat created.
So bringing it back to my example in the previous comment, if we had `new MessageFormat("{0,choice,0.0#foo {1} {2} {3} }")` I we would expect it to format `5 6 7 8` as `foo 6 7 8`.
But if we had `new MessageFormat("{0,choice,0.0#foo '{'1'}' '{'2'}' '{'3'}' }")` which implies the brackets are syntactically meaningless, I would expect the class to treat them not as nested `FormatElement`s and expect the returned value `foo {0} {1} {2}`
But this is not the case, since by the time the recursive MessageFormat is created the quotes are lost, so it has no means of knowing to treat them as syntactically insignificant, and thus parses the quote and treats them as injected `FormatElement`s which is why in reality the most recent example returns the value `foo 6 7 8`.
So with this change, although calling `toPattern()` on `new MessageFormat("{0,choice,0.0#foo {1} {2} {3} }")` would return the String `"{0,choice,0.0#foo '{'1'}' '{'2'}' '{'3'}' }"` which **I** would think, formats differently, **in reality**, they format equivalently, so I think it is okay. I brought it up not necessarily as an issue, but just something to make apparent. Although I suppose something like this is up for interpretation, so perhaps its only wrong to me.
I'm taking a look at the CSR you filed, but you may also want someone with Reviewer status to take a look as well.
I'll also test your changes against our internal tests. Hope this helped.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17416#issuecomment-1899355910
More information about the core-libs-dev
mailing list