RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern
Archie Cobbs
acobbs at openjdk.org
Thu Jan 18 21:11:11 UTC 2024
On Thu, 18 Jan 2024 20:08:27 GMT, Justin Lu <jlu at openjdk.org> wrote:
>> Hi @justin-curtis-lu,
>>
>> Thanks a lot for taking a look, I am glad for any other set of eyes on this tricky stuff!
>>
>>> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted.
>>
>> You're right - the problem exists with both `{` and `}`, as is shown here (unmodified jshell):
>>
>>
>> $ jshell
>> | Welcome to JShell -- Version 17.0.9
>> | For an introduction type: /help intro
>>
>> jshell> import java.text.*;
>>
>> jshell> new MessageFormat("Test: {0,number,foo'{'#.00}");
>> $2 ==> java.text.MessageFormat at 951c5b58
>>
>> jshell> $2.toPattern()
>> $3 ==> "Test: {0,number,foo{#.00}"
>>
>> jshell> new MessageFormat($3)
>> | Exception java.lang.IllegalArgumentException: Unmatched braces in the pattern.
>> | at MessageFormat.applyPattern (MessageFormat.java:521)
>> | at MessageFormat.<init> (MessageFormat.java:371)
>> | at (#4:1)
>>
>>
>> I've been missing the forest for the trees a bit and I think the fix can be simpler now.
>>
>> For the record, here is my latest understanding of what's going on...
>>
>> 1. When `MessageFormat.toPattern()` constructs the combined pattern string, it concatenates the plain text bits, suitably escaped, and the pattern strings from each subformat.
>> 1. The subformat strings are already escaped, in the sense that you can take (for example) a `ChoiceFormat` format string and use it as-is to recreate that same `ChoiceFormat`, but they are escaped _only for their own purposes_.
>> 1. The original example is an example of where this matters - `ChoiceFormat` needs to escape `#`, `|`, etc., but doesn't need to escape `{` or `}` - but a `MessageFormat` format string does need to escape those.
>> 1. Therefore, the correct fix is for `MessageFormat.toPattern()` to modify the subformat strings by quoting any _unquoted_ `{` or `}` characters while leaving any already-quoted characters alone.
>>
>> Updated in a9d78c76f5f. I've also updated the test so it does more thorough round-trip checks.
>
> Hi @archiecobbs , I think the recent commit is good progress.
>
> First off I think this change will need a CSR as there are behavioral concerns, although minimal. Please let me know if you have access on JBS to file one, otherwise I can file one for you.
>
>> Therefore, the correct fix is for MessageFormat.toPattern() to modify the subformat strings by quoting any unquoted {or } characters while leaving any already-quoted characters alone.
>
> Yes, I think this behavior allows for the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat. Although to clarify, the original String pattern will not be guaranteed to be equivalent to the one returned by `toPattern()` as we are adding quotes to all brackets in the subformatPattern, regardless if they were quoted in the original String. I think the former is more important here, so the latter is okay.
>
> For DecimalFormat and SimpleDateFormat subformats, adding quotes to brackets found in the subformatPattern is always right, those subformats can not be used to create recursive format behavior. Thus you would **never** except a nested ArgumentIndex in the subformatPattern (ex: `"{0,number,{1}foo0.0"}`), and can confidently escape all brackets found for these subFormats (which you do).
>
> To me, ChoiceFormat is where there is some concern. Consider the following,
>
> `new MessageFormat("{0,choice,0# {1} {2} {3} }”)`
>
> With this change, invoking `toPattern()` on the created MessageFormat would return
>
> `"{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }"`.
>
> This would technically be incorrect. One would think instead of allowing 3 nested elements, we are now printing the literal `{1} {2} {3}` since the brackets are escaped. But, this is not actually the case. Escaping brackets with a choice subFormat does not function properly. This is due to the fact that a recursive messageFormat is created, but the pattern passed has already lost the quotes.
>
> This means that `new MessageFormat("{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }")`.
>
> is still equivalent to `new MessageFormat("{0,choice,0# {1} {2} {3} }”).`
>
> So the behavior of quoting all brackets would still guarantee _the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat_ but only because the current behavior of formatting with a choice subFormat is technically wrong. I think this is okay, since this incorrect behavior is long-standing, but a CSR would be good to ad...
Hi @justin-curtis-lu,
Thanks for your comments.
> First off I think this change will need a CSR as there are behavioral concerns, although minimal. Please let me know if you have access on JBS to file one, otherwise I can file one for you.
No problem, I'll add one.
> To me, ChoiceFormat is where there is some concern. Consider the following,
OK I think I understand what you're saying.
The first goal here is to ensure the round trip operation `MessageFormat` → Pattern string from `toPattern()` → `MessageFormat` takes you back to where you started.
The patch achieves this so we're good so far. I think we agree on this.
> This would technically be incorrect.
I'm not quite understanding what you meant there (and therefore slightly worried)...
In my mind, a key observation here is that the following two things are not equal and therefore shouldn't be conflated:
* A `ChoiceFormat` pattern string (e.g., what you get from invoking `ChoiceFormat.toPattern()`)
* The `XXX` that you see inside a `MessageFormat` pattern string as part of a `{0,choice,XXX}` subformat.
Those are not the same, even modulo quoting. As you point out, `MessageFormat` has this quirky logic ([here](https://github.com/openjdk/jdk/blob/81df265e41d393cdde87729e091dd465934071fd/src/java.base/share/classes/java/text/MessageFormat.java#L1327-L1333)) in which, after it evaluates a `ChoiceFormat` subformat, if the returned string contains a `{` character, then that string is magically reinterpreted as a `MessageFormat` pattern string and evaluated again, otherwise it's left alone.
So the `XXX` is not a "`ChoiceFormat` pattern string" but something else, so it can't be "incorrect" for not looking like a normal `ChoiceFormat` pattern string.
Maybe it's just a manner of speaking? In any case I want to be careful I'm not missing something in what you're saying.
Thanks.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17416#issuecomment-1899206551
More information about the core-libs-dev
mailing list