RFR: 8323699: MessageFormat.toPattern() generates non-equivalent MessageFormat pattern

Archie Cobbs acobbs at openjdk.org
Thu Jan 18 21:11:11 UTC 2024


On Thu, 18 Jan 2024 20:08:27 GMT, Justin Lu <jlu at openjdk.org> wrote:

>> Hi @justin-curtis-lu,
>> 
>> Thanks a lot for taking a look, I am glad for any other set of eyes on this tricky stuff!
>> 
>>> As it stands, it would be inconsistent to have the closing bracket quoted and the opening bracket not quoted.
>> 
>> You're right - the problem exists with both `{` and `}`, as is shown here (unmodified jshell):
>> 
>> 
>> $ jshell
>> |  Welcome to JShell -- Version 17.0.9
>> |  For an introduction type: /help intro
>> 
>> jshell> import java.text.*;
>> 
>> jshell> new MessageFormat("Test: {0,number,foo'{'#.00}");
>> $2 ==> java.text.MessageFormat at 951c5b58
>> 
>> jshell> $2.toPattern()
>> $3 ==> "Test: {0,number,foo{#.00}"
>> 
>> jshell> new MessageFormat($3)
>> |  Exception java.lang.IllegalArgumentException: Unmatched braces in the pattern.
>> |        at MessageFormat.applyPattern (MessageFormat.java:521)
>> |        at MessageFormat.<init> (MessageFormat.java:371)
>> |        at (#4:1)
>> 
>> 
>> I've been missing the forest for the trees a bit and I think the fix can be simpler now.
>> 
>> For the record, here is my latest understanding of what's going on...
>> 
>> 1. When `MessageFormat.toPattern()` constructs the combined pattern string, it concatenates the plain text bits, suitably escaped, and the pattern strings from each subformat.
>> 1. The subformat strings are already escaped, in the sense that you can take (for example) a `ChoiceFormat` format string and use it as-is to recreate that same `ChoiceFormat`, but they are escaped _only for their own purposes_.
>> 1. The original example is an example of where this matters - `ChoiceFormat` needs to escape `#`, `|`, etc., but doesn't need to escape `{` or `}` - but a `MessageFormat` format string does need to escape those.
>> 1. Therefore, the correct fix is for `MessageFormat.toPattern()` to modify the subformat strings by quoting any _unquoted_ `{` or `}` characters while leaving any already-quoted characters alone.
>> 
>> Updated in a9d78c76f5f. I've also updated the test so it does more thorough round-trip checks.
>
> Hi @archiecobbs , I think the recent commit is good progress.
> 
> First off I think this change will need a CSR as there are behavioral concerns, although minimal. Please let me know if you have access on JBS to file one, otherwise I can file one for you.
> 
>> Therefore, the correct fix is for MessageFormat.toPattern() to modify the subformat strings by quoting any unquoted {or } characters while leaving any already-quoted characters alone.
> 
> Yes, I think this behavior allows for the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat. Although to clarify, the original String pattern will not be guaranteed to be equivalent to the one returned by `toPattern()` as we are adding quotes to all brackets in the subformatPattern, regardless if they were quoted in the original String. I think the former is more important here, so the latter is okay.
> 
> For DecimalFormat and SimpleDateFormat subformats, adding quotes to brackets found in the subformatPattern is always right, those subformats can not be used to create recursive format behavior. Thus you would **never** except a nested ArgumentIndex in the subformatPattern (ex: `"{0,number,{1}foo0.0"}`), and can confidently escape all brackets found for these subFormats (which you do).
> 
> To me, ChoiceFormat is where there is some concern. Consider the following,
> 
> `new MessageFormat("{0,choice,0# {1} {2} {3} }”)`
> 
> With this change, invoking `toPattern()` on the created MessageFormat would return 
> 
> `"{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }"`.
> 
> This would technically be incorrect. One would think instead of allowing 3 nested elements, we are now printing the literal `{1} {2} {3}` since the brackets are escaped. But, this is not actually the case. Escaping brackets with a choice subFormat does not function properly. This is due to the fact that a recursive messageFormat is created, but the pattern passed has already lost the quotes.
> 
> This means that `new MessageFormat("{0,choice,0.0# '{'1'}' '{'2'}' '{'3'}' }")`.
> 
> is still equivalent to `new MessageFormat("{0,choice,0# {1} {2} {3} }”).`
> 
> So the behavior of quoting all brackets would still guarantee _the String returned by `toPattern()` to create a MessageFormat that can format equivalently to the original MessageFormat_ but only because the current behavior of formatting with a choice subFormat is technically wrong. I think this is okay, since this incorrect behavior is long-standing, but a CSR would be good to ad...

Hi @justin-curtis-lu,

Thanks for your comments.

> First off I think this change will need a CSR as there are behavioral concerns, although minimal. Please let me know if you have access on JBS to file one, otherwise I can file one for you.

No problem, I'll add one.

> To me, ChoiceFormat is where there is some concern. Consider the following,

OK I think I understand what you're saying.

The first goal here is to ensure the round trip operation `MessageFormat` → Pattern string from `toPattern()` → `MessageFormat` takes you back to where you started.

The patch achieves this so we're good so far. I think we agree on this.

> This would technically be incorrect.

I'm not quite understanding what you meant there (and therefore slightly worried)...

In my mind, a key observation here is that the following two things are not equal and therefore shouldn't be conflated:
* A `ChoiceFormat` pattern string (e.g., what you get from invoking `ChoiceFormat.toPattern()`)
* The `XXX` that you see inside a `MessageFormat` pattern string as part of a `{0,choice,XXX}` subformat.

Those are not the same, even modulo quoting. As you point out, `MessageFormat` has this quirky logic ([here](https://github.com/openjdk/jdk/blob/81df265e41d393cdde87729e091dd465934071fd/src/java.base/share/classes/java/text/MessageFormat.java#L1327-L1333)) in which, after it evaluates a `ChoiceFormat` subformat, if the returned string contains a `{` character, then that string is magically reinterpreted as a `MessageFormat` pattern string and evaluated again, otherwise it's left alone.

So the `XXX` is not a "`ChoiceFormat` pattern string" but something else, so it can't be "incorrect" for not looking like a normal `ChoiceFormat` pattern string.

Maybe it's just a manner of speaking? In any case I want to be careful I'm not missing something in what you're saying.

Thanks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17416#issuecomment-1899206551


More information about the core-libs-dev mailing list