RFR(JDK12/JAXP/java.xml) 8207760: SAXException: Invalid UTF-16 surrogate detected: d83c ?

Joe Wang huizhe.wang at oracle.com
Fri Sep 14 17:08:52 UTC 2018


Hi Daniel,

The additional advance is made when the pair has already been written, 
which is indicated by the return value of writeUTF16Surrogate being >= 
0*. Encodings.isHighUTF16Surrogate(ch) therefore would be redundant.

* Note that the return value is: -1 when nothing is written, 0 when the 
pair is written, so the condition of >= 0 means no matter there is a 
codepoint value or not, the index increment as long as the pair is 
written (the low surrogate is consumed).

Best,
Joe

On 9/14/18, 3:00 AM, Daniel Fuchs wrote:
> Hi Joe,
>
> Thanks for doing that - it's a far better solution.
>
> ToHTMLStream.java:
>
> 1432    // move the index if the surrogate pair has been written.
> 1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
> 1434        i++;
> 1435    }
>
> shouldn't this be:
>
> 1433    if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
>             if (Encodings.isHighUTF16Surrogate(ch)) {
>                 // two input characters processed, increase
>                 // the index again.
>                 i++;
>             }
>
> IIUC you only want to increase the index if the ch was the
> high surrogate and the function has advanced to the low
> surrogate?
>
> I mean - a codepoint could have been returned if ch was the
> low surrogate, and in that case you don't want to increase
> the index twice as only one character has been consumed.
>
> I guess there's the same issue in ToStream.java at lines
> 1154-1156 and in ToTextStream.java at line 303...
>
> Or am I missing something?
>
> best regards,
>
> -- daniel
>
>
> On 14/09/2018 04:13, Joe Wang wrote:
>> Thanks Daniel.
>>
>> I changed the return of writeUTF16Surrogate so that it is clearer 
>> within writeUTF16Surrogate when an additional index increment is 
>> needed. Other corresponding changes are in ToHTMLStream and 
>> ToTextStream where similar calls to the method are made. It's been an 
>> issue in some part of JAXP impl where error or warning messages are 
>> printed out to the console (e.g. JDK-8000621). But I kept it as is in 
>> ToTextStream for this patch.
>>
>> webrev01:
>> http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev01/
>>
>> Best,
>> Joe
>>
>> On 9/13/18, 2:23 AM, Daniel Fuchs wrote:
>>> Hi Joe,
>>>
>>> On 13/09/2018 00:25, Lance Andersen wrote:
>>>> Hi Joe,
>>>>
>>>> The change  seems reasonable
>>>
>>> Agreed. However the following condition in ToStream::handleEscaping
>>> is a bit cryptic:
>>>
>>> 1155             if ((ihs && (i + 1 < end)) || (ils && i != 0)) {
>>> 1156                 i++ ; // process two input characters
>>> 1157             }
>>>
>>> could the comment be fleshed out to explain it?
>>>
>>> I suspect that: `(ihs && (i + 1 < end))` means that
>>> `writeUTF16Surrogate(c, ch, i, end);` has written the two surrogate, in
>>> which case i should be incremented in order to skip the low surrogate
>>> which has just been written.
>>>
>>> I am not sure what `(ils && i != 0)` means, though...
>>>
>>> best regards
>>>
>>> -- daniel
>>>
>>>>
>>>>> On Sep 12, 2018, at 2:11 PM, Joe Wang <huizhe.wang at oracle.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Please review a patch for a situation where a surrogate pair is at 
>>>>> the edge of a buffer. What the existing impl did was to report it 
>>>>> as an error. This patch fixes it by caching the high surrogate and 
>>>>> prints it out along with the low surrogate. Similar issue exists 
>>>>> also in the CDATA section and is fixed in this patch. The CDATA 
>>>>> impl had a couple of bugs where an indent could be written inside 
>>>>> the CDATA and an unicode character written in between two CDATA 
>>>>> sections. Both are fixed in this patch.
>>>>>
>>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8207760
>>>>> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev/
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>>
>>>>>
>>>>
>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif> 
>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance 
>>>> Andersen| Principal Member of Technical Staff | +1.781.442.2037
>>>> Oracle Java Engineering
>>>> 1 Network Drive
>>>> Burlington, MA 01803
>>>> Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>
>>>>
>>>>
>>>>
>>>
>


More information about the core-libs-dev mailing list