RFR(JDK12/JAXP/java.xml) 8207760: SAXException: Invalid UTF-16 surrogate detected: d83c ?

Mon Sep 17 18:44:26 UTC 2018

Yikes, right, the process must avoid advancing the index if the pair is 
written when the current char is low surrogate.

webrev02:
http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev02/

Thanks,
Joe

On 9/17/18, 3:10 AM, Daniel Fuchs wrote:
> Hi Joe,
>
> hmmm... I'm not sure I'm completely convinced.
>
> Let's take an example, and assume your input
> has been split in two strings of 4 characters each
> (the code I'm seeing implies that this is possible,
> right?) In the sketch below the first line represents
> the loop iterations, the second line represents
> the two strings as char buffers (hs = high surrogate,
> ls = low surrogate, x,y = just regular chars), the third
> line represents index increments, and the fourth line
> represents the result of the writeUTF16Surrogate method:
>
> L1:   1   -  2    3        1      2    3   -
> L2: [ hs  ls x    hs ]   [ ls     y    hs  ls ]
> L3:   \i+2/  i++  i++      i++    i++  \i+2/
> L4:   cp||0       -1       cp||-1      cp||0
>
> We have two calls to writeAttrString, the first
> call writes the first 4 chars string, the second
> write the next 4 chars.
>
> In the sketch above, i++ indicates a regular
> increment from the loop, and i+2 indicates
> where i++ must be called a second time within
> the loop because two characters were consumed
> in a single iteration.
>
> So it seems to me that the only time you want to
> increment the index a second time is when you read
> (and wrote) both surrogate in one shot.
>
> In the pathological case - where the surrogate pair
> was split at the string limit, there will be two
> calls to writeUTF16Surrogate(...). One with the
> high surrogate, that will return -1, and one with
> the low surrogate that will either return -1 or a
> code point. And in that case you don't want to
> increment i a second time - or the character
> 'y' will be skipped.
>
> Am I still missing something?
>
> best regards,
>
> -- daniel
>
>
> On 14/09/2018 18:08, Joe Wang wrote:
>> Hi Daniel,
>>
>> The additional advance is made when the pair has already been 
>> written, which is indicated by the return value of 
>> writeUTF16Surrogate being >= 0*. Encodings.isHighUTF16Surrogate(ch) 
>> therefore would be redundant.
>>
>> * Note that the return value is: -1 when nothing is written, 0 when 
>> the pair is written, so the condition of >= 0 means no matter there 
>> is a codepoint value or not, the index increment as long as the pair 
>> is written (the low surrogate is consumed).
>>
>> Best,
>> Joe
>>
>> On 9/14/18, 3:00 AM, Daniel Fuchs wrote:
>>> Hi Joe,
>>>
>>> Thanks for doing that - it's a far better solution.
>>>
>>> ToHTMLStream.java:
>>>
>>> 1432    // move the index if the surrogate pair has been written.
>>> 1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
>>> 1434        i++;
>>> 1435    }
>>>
>>> shouldn't this be:
>>>
>>> 1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
>>>             if (Encodings.isHighUTF16Surrogate(ch)) {
>>>                 // two input characters processed, increase
>>>                 // the index again.
>>>                 i++;
>>>             }
>>>
>>> IIUC you only want to increase the index if the ch was the
>>> high surrogate and the function has advanced to the low
>>> surrogate?
>>>
>>> I mean - a codepoint could have been returned if ch was the
>>> low surrogate, and in that case you don't want to increase
>>> the index twice as only one character has been consumed.
>>>
>>> I guess there's the same issue in ToStream.java at lines
>>> 1154-1156 and in ToTextStream.java at line 303...
>>>
>>> Or am I missing something?
>>>
>>> best regards,
>>>
>>> -- daniel
>>>
>>>
>>> On 14/09/2018 04:13, Joe Wang wrote:
>>>> Thanks Daniel.
>>>>
>>>> I changed the return of writeUTF16Surrogate so that it is clearer 
>>>> within writeUTF16Surrogate when an additional index increment is 
>>>> needed. Other corresponding changes are in ToHTMLStream and 
>>>> ToTextStream where similar calls to the method are made. It's been 
>>>> an issue in some part of JAXP impl where error or warning messages 
>>>> are printed out to the console (e.g. JDK-8000621). But I kept it as 
>>>> is in ToTextStream for this patch.
>>>>
>>>> webrev01:
>>>> http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev01/
>>>>
>>>> Best,
>>>> Joe
>>>>
>>>> On 9/13/18, 2:23 AM, Daniel Fuchs wrote:
>>>>> Hi Joe,
>>>>>
>>>>> On 13/09/2018 00:25, Lance Andersen wrote:
>>>>>> Hi Joe,
>>>>>>
>>>>>> The change  seems reasonable
>>>>>
>>>>> Agreed. However the following condition in ToStream::handleEscaping
>>>>> is a bit cryptic:
>>>>>
>>>>> 1155             if ((ihs && (i + 1 < end)) || (ils && i != 0)) {
>>>>> 1156                 i++ ; // process two input characters
>>>>> 1157             }
>>>>>
>>>>> could the comment be fleshed out to explain it?
>>>>>
>>>>> I suspect that: `(ihs && (i + 1 < end))` means that
>>>>> `writeUTF16Surrogate(c, ch, i, end);` has written the two 
>>>>> surrogate, in
>>>>> which case i should be incremented in order to skip the low surrogate
>>>>> which has just been written.
>>>>>
>>>>> I am not sure what `(ils && i != 0)` means, though...
>>>>>
>>>>> best regards
>>>>>
>>>>> -- daniel
>>>>>
>>>>>>
>>>>>>> On Sep 12, 2018, at 2:11 PM, Joe Wang <huizhe.wang at oracle.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Please review a patch for a situation where a surrogate pair is 
>>>>>>> at the edge of a buffer. What the existing impl did was to 
>>>>>>> report it as an error. This patch fixes it by caching the high 
>>>>>>> surrogate and prints it out along with the low surrogate. 
>>>>>>> Similar issue exists also in the CDATA section and is fixed in 
>>>>>>> this patch. The CDATA impl had a couple of bugs where an indent 
>>>>>>> could be written inside the CDATA and an unicode character 
>>>>>>> written in between two CDATA sections. Both are fixed in this 
>>>>>>> patch.
>>>>>>>
>>>>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8207760
>>>>>>> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev/
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif> 
>>>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance 
>>>>>> Andersen| Principal Member of Technical Staff | +1.781.442.2037
>>>>>> Oracle Java Engineering
>>>>>> 1 Network Drive
>>>>>> Burlington, MA 01803
>>>>>> Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>