RFR: 8043592: The basic XML parser based on UKit fails to read XML files encoded in UTF-16BE or LE
Xueming Shen
xueming.shen at oracle.com
Tue May 27 23:42:07 UTC 2014
looks good.
-Sherman
On 5/27/14 3:36 PM, huizhe wang wrote:
> Thanks Sherman!
>
> On 5/27/2014 1:46 PM, Xueming Shen wrote:
>> One more nit,
>>
>> ln#2876-2879
>>
>> Do we really need to create a new ReaderUTF8, if the encoding is
>> indeed is utf8?
>> which I would assume should be true for most use scenarios. Maybe the
>> following
>> would be better?
>>
>> // Encoding is defined by the xml text decl.
>> reader = enc("UTF-8", is.getByteStream());
>> expenc = xml(reader);
>> if (!expenc.equals("UTF-8")) {
>> if (expenc.startsWith("UTF-16")) {
>> panic(FAULT); // UTF-16 must have BOM [#4.3.3]
>> }
>> reader = enc(expenc, is.getByteStream());
>> }
>
> Updated to reflect the above suggestion:
> http://cr.openjdk.java.net/~joehw/jdk9/8043592/webrev/.
>
>
> For the performance improvement, I'll create a new bug to track it.
> Reading is buffered in the regular jaxp parser than the UKit one. It
> would be nice if the benchmark had a separate measurement in parsing
> performance. We currently have indirect measurement through validation
> and transform.
>
> Buffer size will affect performance, UKit sets the default to 512 (but
> actually read byte by byte from the underlying stream as you noted),
> while jaxp parser default to 8k. For a small parser such as UKit, it
> may make sense to use a smaller buffer.
>
> -Joe
>
>>
>> -Sherman
>>
>> On 05/27/2014 12:54 PM, Xueming Shen wrote:
>>> On 05/27/2014 10:46 AM, huizhe wang wrote:
>>>> Hi,
>>>>
>>>> Are you okay with the updated patch?
>>>>
>>>> Thanks,
>>>> Joe
>>>>
>>>
>>> looks fine for me.
>>>
>>> Btw, if I took a quick look at the UTF8 reader, my observation
>>> suggests read byte by byte
>>> from the underlying stream probably is the bottleneck of the overall
>>> "parsing". Attached
>>> is a buffered the version, my simple test (just the parsing, use the
>>> default handler do noting)
>>> indicates it might double the parsing speed. Sure the overall
>>> performance will depends on
>>> the individual handler, but it might worth considering, any second
>>> counts :-) The code is
>>> not fully tested though, just for your reference.
>>>
>>> -Sherman
>>>
>>> package jdk.internal.util.xml.impl;
>>>
>>> import java.io.Reader;
>>> import java.io.InputStream;
>>> import java.io.IOException;
>>> import java.io.UnsupportedEncodingException;
>>>
>>> /**
>>> * UTF-8 transformed UCS-2 character stream reader.
>>> *
>>> * This reader converts UTF-8 transformed UCS-2 characters to Java
>>> characters.
>>> * The UCS-2 subset of UTF-8 transformation is described in RFC-2279 #2
>>> * "UTF-8 definition":
>>> * 0000 0000-0000 007F 0xxxxxxx
>>> * 0000 0080-0000 07FF 110xxxxx 10xxxxxx
>>> * 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
>>> *
>>> * This reader will return incorrect last character on broken UTF-8
>>> stream.
>>> */
>>> public class ReaderUTF8 extends Reader {
>>>
>>> private InputStream is;
>>>
>>> private static int DEFAULT_BUFFER_SIZE = 8192;
>>> private byte buf[];
>>> private int pos, limit;
>>>
>>> /**
>>> * Constructor.
>>> *
>>> * @param is A byte input stream.
>>> */
>>> public ReaderUTF8(InputStream is) {
>>> this.is = is;
>>> this.buf = new byte[DEFAULT_BUFFER_SIZE];
>>> this.pos = limit = 0;
>>> }
>>>
>>> private void fill() throws IOException {
>>> if (pos >= buf.length) { // no room left in buffer
>>> pos = limit = 0;
>>> }
>>> int n = is.read(buf, pos, buf.length - pos);
>>> if (n > 0) {
>>> limit = n + pos;
>>> }
>>> }
>>>
>>> /**
>>> * Reads characters into a portion of an array.
>>> *
>>> * @param cbuf Destination buffer.
>>> * @param off Offset at which to start storing characters.
>>> * @param len Maximum number of characters to read.
>>> * @exception IOException If any IO errors occur.
>>> * @exception UnsupportedEncodingException If UCS-4 character
>>> occur in the stream.
>>> */
>>> public int read(char[] cbuf, int off, int len) throws IOException {
>>> int off0 = off;
>>> int end = off + len;
>>> while (off < len) {
>>> if (pos >= limit) {
>>> fill();
>>> if (pos >= limit) {
>>> return (off != off0) ? off - off0 : -1;
>>> }
>>> }
>>> int val = buf[pos] & 0xff;
>>> if (val >= 0x80) {
>>> break;
>>> }
>>> cbuf[off++] = (char) val;
>>> pos++;
>>> }
>>> while (off < end) {
>>> if (pos >= limit) {
>>> fill();
>>> if (pos >= limit) {
>>> return (off != off0) ? off - off0 : -1;
>>> }
>>> }
>>> int val = buf[pos++] & 0xff;
>>> switch (val & 0xf0) {
>>> case 0xc0:
>>> case 0xd0:
>>> if (pos >= limit) {
>>> fill();
>>> }
>>> if (pos >= limit) {
>>> cbuf[off++] = (char) (((val & 0x1f) << 6) |
>>> (is.read() & 0x3f));
>>> } else {
>>> cbuf[off++] = (char) (((val & 0x1f) << 6) |
>>> (buf[pos++] & 0x3f));
>>> }
>>> break;
>>> case 0xe0:
>>> if (pos >= limit) {
>>> fill();
>>> }
>>> val = (val & 0x0f) << 12;
>>> if (pos >= limit) {
>>> val |= ((is.read() & 0x3f) << 6);
>>> } else {
>>> val |= ((buf[pos++] & 0x3f) << 6);
>>> }
>>> if (pos >= limit) {
>>> val |= (buf[pos++] & 0x3f);
>>> } else {
>>> val |= (is.read() & 0x3f);
>>> }
>>> cbuf[off++] = (char) val;
>>> break;
>>> case 0xf0: // UCS-4 character
>>> throw new UnsupportedEncodingException("UTF-32
>>> (or UCS-4) encoding not supported.");
>>> default:
>>> cbuf[off++] = (char) val;
>>> break;
>>> }
>>> }
>>> return off - off0;
>>>
>>> }
>>>
>>> /**
>>> * Reads a single character.
>>> *
>>> * @return The character read, as an integer in the range 0 to 65535
>>> * (0x00-0xffff), or -1 if the end of the stream has been reached.
>>> * @exception IOException If any IO errors occur.
>>> * @exception UnsupportedEncodingException If UCS-4 character
>>> occur in the stream.
>>> */
>>> public int read() throws IOException {
>>> int val;
>>> if (pos >= limit) {
>>> val = is.read();
>>> } else {
>>> val = buf[pos++] & 0xff;
>>> }
>>> switch (val & 0xf0) {
>>> case 0xc0:
>>> case 0xd0:
>>> if (pos >= limit) {
>>> val = ((val & 0x1f) << 6) | (is.read() & 0x3f);
>>> } else {
>>> val = ((val & 0x1f) << 6) | (buf[pos++] & 0x3f);
>>> }
>>> break;
>>> case 0xe0:
>>> val = (val & 0x0f) << 12;
>>> if (pos >= limit) {
>>> val |= ((is.read() & 0x3f) << 6);
>>> } else {
>>> val |= ((buf[pos++] & 0x3f) << 6);
>>> }
>>> if (pos >= limit) {
>>> val |= (is.read() & 0x3f);
>>> } else {
>>> val |= (buf[pos++] & 0x3f);
>>> }
>>> break;
>>> case 0xf0: // UCS-4 character
>>> throw new UnsupportedEncodingException();
>>> default:
>>> break;
>>> }
>>> return val;
>>> }
>>>
>>> /**
>>> * Closes the stream.
>>> *
>>> * @exception IOException If any IO errors occur.
>>> */
>>> public void close() throws IOException {
>>> is.close();
>>> }
>>> }
>>>
>>
>
More information about the core-libs-dev
mailing list