<Swing Dev> <Swind Dev> [7] Review request for 6836089: Swing HTML parser can't properly decode codepoints outside the Unicode Plane 0 into a surrogate pair

Tue Sep 4 17:31:01 UTC 2012

Hi Vladislav,

Looks good for me

Regards, Pavel
> Hello,
>
> could anyone review this please?
>
> Thanks in advance,
> - Vlad
>
> On 8/27/2012 6:51 PM, Vladislav Karnaukhov wrote:
>> Hello,
>>
>> could you please review a new version?
>> Please find webrev here: 
>> http://cr.openjdk.java.net/~vkarnauk/6836089/webrev.04/
>>
>> Regards,
>> - Vlad
>>
>> On 6/29/2012 2:09 AM, Phil Race wrote:
>>> That would work but I think its cleaner to just move it all into 
>>> mapNumericReference
>>> as below.
>>>
>>> -phil.
>>>
>>> diff --git 
>>> a/src/share/classes/javax/swing/text/html/parser/Parser.java 
>>> b/src/share/classes/javax/swing/text/html/parser/Parser.java
>>> --- a/src/share/classes/javax/swing/text/html/parser/Parser.java
>>> +++ b/src/share/classes/javax/swing/text/html/parser/Parser.java
>>> @@ -952,7 +952,7 @@
>>>                          ch = readCh();
>>>                          break;
>>>                  }
>>> -                char data[] = {mapNumericReference((char) n)};
>>> +                char data[] = mapNumericReference(n);
>>>                  return data;
>>>              }
>>>              addString('#');
>>> @@ -1021,7 +1021,7 @@
>>>      }
>>>
>>>      /**
>>> -     * Converts numeric character reference to Unicode character.
>>> +     * Converts numeric character reference to char array.
>>>       *
>>>       * Normally the code in a reference should be always converted
>>>       * to the Unicode character with the same code, but due to
>>> @@ -1030,13 +1030,21 @@
>>>       * to displayable characters with other codes.
>>>       *
>>>       * @param c the code of numeric character reference.
>>> -     * @return the character corresponding to the reference code.
>>> +     * @return a char array corresponding to the reference code.
>>>       */
>>> -    private char mapNumericReference(char c) {
>>> -        if (c < 130 || c > 159) {
>>> -            return c;
>>> +    private char[] mapNumericReference(int c) {
>>> +        char[] data;
>>> +        if (c >= 0xffff) { // outside unicode BMP.
>>> +            try {
>>> +                data = Character.toChars(c);
>>> +            } catch (IllegalArgumentException e) {
>>> +                data = new char[0];
>>>          }
>>> -        return cp1252Map[c - 130];
>>> +        } else {
>>> +            data = new char[1];
>>> +            data[0] = (c < 130 || c > 159) ? (char)c : cp1252Map[c 
>>> - 130];
>>> +        }
>>> +        return data;
>>>      }
>>>
>>>
>>> On 06/28/12 07:58 AM, Vladislav Karnaukhov wrote:
>>>> Hello Phil, Pavel,
>>>>
>>>> thank you for your comments. I've reworked fix and testcase, please 
>>>> find new webrev here: 
>>>> http://cr.openjdk.java.net/~vkarnauk/6836089/webrev.03/
>>>>
>>>> Regards,
>>>> - Vlad
>>>>
>>>> On 6/27/2012 10:13 PM, Phil Race wrote:
>>>>> Well its not only unnecessary but is likely wrong .. I don't think
>>>>> you looked at what mapNumericReference() does.
>>>>>
>>>>> -phil.
>>>>>
>>>>> On 6/27/2012 11:03 AM, Vladislav Karnaukhov wrote:
>>>>>> Hello Phil,
>>>>>>
>>>>>> I used Character.toChars() in both branches because I wanted to 
>>>>>> delegate code point conversion to char or surrogate pair entirely 
>>>>>> to Character class...
>>>>>>
>>>>>> Regards,
>>>>>> - Vlad
>>>>>>
>>>>>> On 6/26/2012 9:49 PM, Phil Race wrote:
>>>>>>> I don't understand why you call Character.toChars() if you've 
>>>>>>> just determined
>>>>>>> you don't need to ?
>>>>>>>
>>>>>>> ie what was wrong with
>>>>>>>
>>>>>>> data = ( n >>> 16 == 0) ? {mapNumericReference((char) n)} : 
>>>>>>> Character.toChars(n);
>>>>>>>
>>>>>>> ?
>>>>>>>
>>>>>>> In the case of an invalid supplementary pair, maybe it would be 
>>>>>>> safer to  return { ' ' } ?
>>>>>>>
>>>>>>>
>>>>>>> One thing I see in the parsing code that is not new or changed 
>>>>>>> here, that
>>>>>>> may bear examination, is that there's a loop that keeps on 
>>>>>>> reading so long
>>>>>>> so long there are new digits. I am not sure its wise to keep 
>>>>>>> going once
>>>>>>> you overflow.
>>>>>>>
>>>>>>> -phil.
>>>>>>>
>>>>>>> On 6/26/2012 12:37 AM, Vladislav Karnaukhov wrote:
>>>>>>>> Hello Pavel,
>>>>>>>>
>>>>>>>> I can provide you with the link to 6u19, but this is direct 
>>>>>>>> forward-port and no code changes were made.
>>>>>>>>
>>>>>>>> I'll make changes as you've pointed out in 1) and 2)
>>>>>>>>
>>>>>>>> About 3) - is it a requirement to use "? :" operator? I 
>>>>>>>> personally prefer single-line if-else, but I don't want to 
>>>>>>>> argue over code style, and surely I'll follow code design 
>>>>>>>> practices.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> - Vlad
>>>>>>>>
>>>>>>>> On 6/25/2012 6:43 PM, Pavel Porvatov wrote:
>>>>>>>>> Hi Vladislav,
>>>>>>>>>
>>>>>>>>> Do you have a link to the fix for 6u19?
>>>>>>>>>
>>>>>>>>> I didn't investigate the fix deeply, but
>>>>>>>>>
>>>>>>>>> 1.
>>>>>>>>>  private final int MAX_BMP_BOUND = 65535;
>>>>>>>>> should be static (otherwise variable name should be in lower 
>>>>>>>>> case)
>>>>>>>>>
>>>>>>>>> 2. Add a space in single line comments
>>>>>>>>>
>>>>>>>>> 3.
>>>>>>>>> +                    char data[];
>>>>>>>>> +                    if (n <= MAX_BMP_BOUND) {
>>>>>>>>> +                        data = 
>>>>>>>>> Character.toChars(mapNumericReference((char) n));
>>>>>>>>> +                    } else {
>>>>>>>>> +                        data = Character.toChars(n);
>>>>>>>>> +                    }
>>>>>>>>> +
>>>>>>>>>                  return data;
>>>>>>>>>
>>>>>>>>> can be written in one line via "? :" operator and looks more 
>>>>>>>>> readable for me
>>>>>>>>>
>>>>>>>>> Thanks, Pavel
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> please review the fix for 6836089: Swing HTML parser can't 
>>>>>>>>>> properly decode codepoints outside the Unicode Plane 0 into a 
>>>>>>>>>> surrogate pair. This is a forward port from JDK6 (fixed 
>>>>>>>>>> escalated issue, fix integrated) to JDK7.
>>>>>>>>>>
>>>>>>>>>> The issue is a defect in Swing HTML Parser: if the codepoint 
>>>>>>>>>> is outside BMP (Unicode Plain 0), Parser incorrectly decodes 
>>>>>>>>>> codepoint into surrogate pair. The fix is to use 
>>>>>>>>>> Character.toChars() method if codepoint value is greater than 
>>>>>>>>>> upper bound of BMP.
>>>>>>>>>>
>>>>>>>>>> Webrev: http://cr.openjdk.java.net/~vkarnauk/6836089/webrev.00/
>>>>>>>>>> Bug description: 
>>>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6836089
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> - Vlad
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>