RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern

Archie Cobbs acobbs at openjdk.org
Wed Jan 17 21:34:49 UTC 2024


On Tue, 16 Jan 2024 22:05:03 GMT, Justin Lu <jlu at openjdk.org> wrote:

>> Please consider this fix to ensure that going from `MessageFormat` to pattern string via `toPattern()` and then back via `new MessageFormat()` results in a format that is equivalent to the original.
>> 
>> The quoting and escaping rules for `MessageFormat` pattern strings are really tricky. I admit not completely understanding them. At a high level, they work like this: The normal way one would "nest" strings containing special characters is with straightforward recursive escaping like with the `bash` command line. For example, if you want to echo `a "quoted string" example` then you enter `echo "a "quoted string" example"`. With this scheme it's always the "outer" layer's job to (un)escape special characters as needed. That is, the echo command never sees the backslash characters.
>> 
>> In contrast, with `MessageFormat` and friends, nested subformat pattern strings are always provided "pre-escaped". So to build an "outer" string (e.g., for `ChoiceFormat`) the "inner" subformat pattern strings are more or less just concatenated, and then only the `ChoiceFormat` option separator characters (e.g., `<`, `#`, `|`, etc.) are escaped.
>> 
>> The "pre-escape" escaping algorithm escapes `{` characters, because `{` indicates the beginning of a format argument. However, it doesn't escape `}` characters. This is OK because the format string parser treats any "extra" closing braces (where "extra" means not matching an opening brace) as plain characters.
>> 
>> So far, so good... at least, until a format string containing an extra closing brace is nested inside a larger format string, where the extra closing brace, which was previously "extra", can now suddenly match an opening brace in the outer pattern containing it, thus truncating it by "stealing" the match from some subsequent closing brace.
>> 
>> An example is the `MessageFormat` string `"{0,choice,0.0#option A: {1}|1.0#option B: {1}'}'}"`. Note the second option format string has a trailing closing brace in plain text. If you create a `MessageFormat` with this string, you see a trailing `}` only with the second option.
>> 
>> However, if you then invoke `toPattern()`, the result is `"{0,choice,0.0#option A: {1}|1.0#option B: {1}}}"`. Oops, now because the "extra" closing brace is no longer quoted, it matches the opening brace at the beginning of the string, and the following closing  brace, which was the previous match, is now just plain text in the outer `MessageFormat` string.
>> 
>> As a result, invoking `f.format(new ...
>
> Hi Archie, thanks for the proposed fix. I am still taking a look, but I wanted to demonstrate a current issue,
> 
> (Jshell with your patch)
> 
> 
> var pattIn = "Test: {0,number,foo'{'#.00}";
> MessageFormat mFmt = new MessageFormat(pattIn);
> var pattOut  = mFmt.toPattern(); // returns "Test: {0,number,foo{#.00}";
> 
> 
> 
> var pattIn = "Test: {0,number,foo'}'#.00}";
> MessageFormat mFmt = new MessageFormat(pattIn);
> var pattOut  = mFmt.toPattern(); // returns "Test: {0,number,foo'}'#.00}";
> 
> 
> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted. 
> 
> Also in response to your earlier question on core-libs-dev, ideally invoking toPattern() can roundtrip, but there are known issues, such as a custom user defined Format subclass, or one of the newer Format subclasses that do not implement the toPattern() method. I am working on making this apparent in the specification of the method in a separate issue.

Hi @justin-curtis-lu,

Thanks a lot for taking a look, I am glad for any other set of eyes on this tricky stuff!

> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted.

You're right - the problem exists with both `{` and `}`, as is shown here (unmodified jshell):


$ jshell
|  Welcome to JShell -- Version 17.0.9
|  For an introduction type: /help intro

jshell> import java.text.*;

jshell> new MessageFormat("Test: {0,number,foo'{'#.00}");
$2 ==> java.text.MessageFormat at 951c5b58

jshell> $2.toPattern()
$3 ==> "Test: {0,number,foo{#.00}"

jshell> new MessageFormat($3)
|  Exception java.lang.IllegalArgumentException: Unmatched braces in the pattern.
|        at MessageFormat.applyPattern (MessageFormat.java:521)
|        at MessageFormat.<init> (MessageFormat.java:371)
|        at (#4:1)


I've been missing the forest for the trees a bit and I think the fix can be simpler now.

For the record, here is my latest understanding of what's going on...

1. When `MessageFormat.toPattern()` constructs the combined pattern string, it concatenates the plain text bits, suitably escaped, and the pattern strings from each subformat.
1. The subformat strings are already escaped, in the sense that you can take (for example) a `ChoiceFormat` format string and use it as-is to recreate that same `ChoiceFormat`, but they are escaped _only for their own purposes_.
1. The original example is an example of where this matters - `ChoiceFormat` needs to escape `#`, `|`, etc., but doesn't need to escape `{` or `}` - but a `MessageFormat` format string does need to escape those.
1. Therefore, the correct fix is for `MessageFormat.toPattern()` to modify the subformat strings by quoting any _unquoted_ `{` or `}` characters while leaving any already-quoted characters alone.

Updated in a9d78c76f5f. I've also updated the test so it does more thorough round-trip checks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17416#issuecomment-1896840300


More information about the core-libs-dev mailing list