RFR(JDK12/JAXP/java.xml) 8207760: SAXException: Invalid UTF-16 surrogate detected: d83c ?
Daniel Fuchs
daniel.fuchs at oracle.com
Fri Sep 14 10:00:08 UTC 2018
Hi Joe,
Thanks for doing that - it's a far better solution.
ToHTMLStream.java:
1432 // move the index if the surrogate pair has been written.
1433 if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
1434 i++;
1435 }
shouldn't this be:
1433 if (writeUTF16Surrogate(ch, chars, i, end) >= 0) {
if (Encodings.isHighUTF16Surrogate(ch)) {
// two input characters processed, increase
// the index again.
i++;
}
IIUC you only want to increase the index if the ch was the
high surrogate and the function has advanced to the low
surrogate?
I mean - a codepoint could have been returned if ch was the
low surrogate, and in that case you don't want to increase
the index twice as only one character has been consumed.
I guess there's the same issue in ToStream.java at lines
1154-1156 and in ToTextStream.java at line 303...
Or am I missing something?
best regards,
-- daniel
On 14/09/2018 04:13, Joe Wang wrote:
> Thanks Daniel.
>
> I changed the return of writeUTF16Surrogate so that it is clearer within
> writeUTF16Surrogate when an additional index increment is needed. Other
> corresponding changes are in ToHTMLStream and ToTextStream where similar
> calls to the method are made. It's been an issue in some part of JAXP
> impl where error or warning messages are printed out to the console
> (e.g. JDK-8000621). But I kept it as is in ToTextStream for this patch.
>
> webrev01:
> http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev01/
>
> Best,
> Joe
>
> On 9/13/18, 2:23 AM, Daniel Fuchs wrote:
>> Hi Joe,
>>
>> On 13/09/2018 00:25, Lance Andersen wrote:
>>> Hi Joe,
>>>
>>> The change seems reasonable
>>
>> Agreed. However the following condition in ToStream::handleEscaping
>> is a bit cryptic:
>>
>> 1155 if ((ihs && (i + 1 < end)) || (ils && i != 0)) {
>> 1156 i++ ; // process two input characters
>> 1157 }
>>
>> could the comment be fleshed out to explain it?
>>
>> I suspect that: `(ihs && (i + 1 < end))` means that
>> `writeUTF16Surrogate(c, ch, i, end);` has written the two surrogate, in
>> which case i should be incremented in order to skip the low surrogate
>> which has just been written.
>>
>> I am not sure what `(ils && i != 0)` means, though...
>>
>> best regards
>>
>> -- daniel
>>
>>>
>>>> On Sep 12, 2018, at 2:11 PM, Joe Wang <huizhe.wang at oracle.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Please review a patch for a situation where a surrogate pair is at
>>>> the edge of a buffer. What the existing impl did was to report it as
>>>> an error. This patch fixes it by caching the high surrogate and
>>>> prints it out along with the low surrogate. Similar issue exists
>>>> also in the CDATA section and is fixed in this patch. The CDATA impl
>>>> had a couple of bugs where an indent could be written inside the
>>>> CDATA and an unicode character written in between two CDATA
>>>> sections. Both are fixed in this patch.
>>>>
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8207760
>>>> webrevs: http://cr.openjdk.java.net/~joehw/jdk12/8207760/webrev/
>>>>
>>>> Thanks,
>>>> Joe
>>>>
>>>>
>>>
>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
>>> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
>>> Andersen| Principal Member of Technical Staff | +1.781.442.2037
>>> Oracle Java Engineering
>>> 1 Network Drive
>>> Burlington, MA 01803
>>> Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>
>>>
>>>
>>>
>>
More information about the core-libs-dev
mailing list