Code Review Request, JDK-8146600 AVA Normalizer.Form issue

Tue Sep 20 00:58:34 UTC 2016

On 9/20/2016 5:53 AM, Wang Weijun wrote:
> Sorry. Whenever I wrote NFC, I meant NFD. Typo.
>
> 在 2016年9月19日，23:16，Xuelei Fan <xuelei.fan at oracle.com
> <mailto:xuelei.fan at oracle.com>> 写道：
>
>> On 9/19/2016 11:03 PM, Wang Weijun wrote:
>>> After some thinking, my current opinion is.
>>>
>>> 1. Maybe NFC is better than NFKD, but I am not a Unicode expert.
>>>
>> It is updated from NFKD to NFD.  I did not get the point.  Do you mean
>> NFC is better than NFD?
>>
>>> 2. I think the real bug is the order of escaping and normalization.
>>> The normalization (if a must) should be performed earlier right after
>>> valStr is created and only performed on valStr. Otherwise the NFKD
>>> normalization would generate new chars that need to be escaped. Again
>>> I am not a Unicode expert and I don't know if NFC will also do the same.
>>>
>> I don't get the point.  The update is moving from NFKD to NFD.  No
>> NFKD normalization any more.
>>
>>> If 2) is fixed, whatever is correct in 1) does not matter much.
>>>
>> If we continue to use NFKD, normalization before escaping would result
>> in unexpected string as we talked for the hello-world example.
>
> I this case, a comma appears but then it is escaped. You might say it is
> unexpected, but at least after escaping, it becomes a legal string.
>
I did not get the point.  A comma (",") should be escaped and it does 
get escaped and the string is legal.  Do you mean "，" (double bytes 
comma) should be converted to ","?  Can you have more details?

>> It is something I want to avoid, so that it is fixed to use NFD
>> instead.  I think if we are moving to use NFD, it is does not matter
>> to escaping first or normalization first if I understand the UTF-8
>> correctly.
>
> Maybe, but IMO this is not the correct fix. The ultimate reason of the
> bug is not the form chosen, but the order.
>
I'm not with you for this bug. The bug is complain about the escaping 
issue, but actually the character should not be escaped.  So it is not 
an issue of escaping.  So this fix is not trying to fix the escaping 
issue, but trying to fix the normalization issue.

Thanks,
Xuelei

> --Max
>
>>
>> Thanks,
>> Xuelei
>>
>>> Thanks
>>> Max
>>>
>>>> On Sep 19, 2016, at 10:32 AM, Xuelei Fan <xuelei.fan at oracle.com
>>>> <mailto:xuelei.fan at oracle.com>> wrote:
>>>>
>>>>> 4. Is it possible to perform normalization before escaping special
>>>>> characters?
>>>>>
>>>> Yes.  I though about this case.  The current fix comes from the fact
>>>> that UTF-8 "Hello, world!" and "Hello， world!" should be different.
>>>> Parsing them as the same thing may result in unexpected serious issues.
>>>