RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern
Justin Lu
jlu at openjdk.org
Thu Jan 18 20:11:10 UTC 2024
On Wed, 17 Jan 2024 21:31:49 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:
>> Hi Archie, thanks for the proposed fix. I am still taking a look, but I wanted to demonstrate a current issue,
>>
>> (Jshell with your patch)
>>
>>
>> var pattIn = "Test: {0,number,foo'{'#.00}";
>> MessageFormat mFmt = new MessageFormat(pattIn);
>> var pattOut = mFmt.toPattern(); // returns "Test: {0,number,foo{#.00}";
>>
>>
>>
>> var pattIn = "Test: {0,number,foo'}'#.00}";
>> MessageFormat mFmt = new MessageFormat(pattIn);
>> var pattOut = mFmt.toPattern(); // returns "Test: {0,number,foo'}'#.00}";
>>
>>
>> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted.
>>
>> Also in response to your earlier question on core-libs-dev, ideally invoking toPattern() can roundtrip, but there are known issues, such as a custom user defined Format subclass, or one of the newer Format subclasses that do not implement the toPattern() method. I am working on making this apparent in the specification of the method in a separate issue.
>
> Hi @justin-curtis-lu,
>
> Thanks a lot for taking a look, I am glad for any other set of eyes on this tricky stuff!
>
>> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted.
>
> You're right - the problem exists with both `{` and `}`, as is shown here (unmodified jshell):
>
>
> $ jshell
> | Welcome to JShell -- Version 17.0.9
> | For an introduction type: /help intro
>
> jshell> import java.text.*;
>
> jshell> new MessageFormat("Test: {0,number,foo'{'#.00}");
> $2 ==> java.text.MessageFormat at 951c5b58
>
> jshell> $2.toPattern()
> $3 ==> "Test: {0,number,foo{#.00}"
>
> jshell> new MessageFormat($3)
> | Exception java.lang.IllegalArgumentException: Unmatched braces in the pattern.
> | at MessageFormat.applyPattern (MessageFormat.java:521)
> | at MessageFormat.<init> (MessageFormat.java:371)
> | at (#4:1)
>
>
> I've been missing the forest for the trees a bit and I think the fix can be simpler now.
>
> For the record, here is my latest understanding of what's going on...
>
> 1. When `MessageFormat.toPattern()` constructs the combined pattern string, it concatenates the plain text bits, suitably escaped, and the pattern strings from each subformat.
> 1. The subformat strings are already escaped, in the sense that you can take (for example) a `ChoiceFormat` format string and use it as-is to recreate that same `ChoiceFormat`, but they are escaped _only for their own purposes_.
> 1. The original example is an example of where this matters - `ChoiceFormat` needs to escape `#`, `|`, etc., but doesn't need to escape `{` or `}` - but a `MessageFormat` format string does need to escape those.
> 1. Therefore, the correct fix is for `MessageFormat.toPattern()` to modify the subformat strings by quoting any _unquoted_ `{` or `}` characters while leaving any already-quoted characters alone.
>
> Updated in a9d78c76f5f. I've also updated the test so it does more thorough round-trip checks.
Hi @archiecobbs , I think the recent commit is good progress.
First off I think this change will need a CSR as there are behavioral concerns, although minimal. Please let me know if you have access on JBS to file one, otherwise I can file one for you.
> Therefore, the correct fix is for MessageFormat.toPattern() to modify the subformat strings by quoting any unquoted {or } characters while leaving any already-quoted characters alone.
Yes, I think this behavior allows for the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat. Although to clarify, the original String pattern will not be guaranteed to be equivalent to the one returned by `toPattern()` as we are adding quotes to all brackets in the subformatPattern, regardless if they were quoted in the original String. I think the former is more important here, so the latter is okay.
For DecimalFormat and SimpleDateFormat subformats, adding quotes to brackets found in the subformatPattern is always right, those subformats can not be used to create recursive format behavior. Thus you would **never** except a nested ArgumentIndex in the subformatPattern (ex: `"{0,number,{1}foo0.0"}`), and can confidently escape all brackets found for these subFormats (which you do).
To me, ChoiceFormat is where there is some concern. Consider the following,
`new MessageFormat("{0,choice,0# {1} {2} {3} }”)`
With this change, invoking `toPattern()` on the created MessageFormat would return
`"{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }"`.
This would technically be incorrect. One would think instead of allowing 3 nested elements, we are now printing the literal `{1} {2} {3}` since the brackets are escaped. But, this is not actually the case. Escaping brackets with a choice subFormat does not function properly. This is due to the fact that a recursive messageFormat is created, but the pattern passed has already lost the quotes.
This means that `new MessageFormat("{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }")`.
is still equivalent to `new MessageFormat("{0,choice,0# {1} {2} {3} }”).`
So the behavior of quoting all brackets would still guarantee _the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat_ but only because the current behavior of formatting with a choice subFormat is technically wrong. I think this is okay, since this incorrect behavior is long-standing, but a CSR would be good to address this.
Please let me know if these points made sense, I’m happy to discuss further.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17416#issuecomment-1899133247
More information about the core-libs-dev
mailing list