RFR: 8043592: The basic XML parser based on UKit fails to read XML files encoded in UTF-16BE or LE

Xueming Shen xueming.shen at oracle.com
Tue May 27 23:42:07 UTC 2014

looks good.


On 5/27/14 3:36 PM, huizhe wang wrote:
> Thanks Sherman!
> On 5/27/2014 1:46 PM, Xueming Shen wrote:
>> One more nit,
>> ln#2876-2879
>> Do we really need to create a new ReaderUTF8, if the encoding is 
>> indeed is utf8?
>> which I would assume should be true for most use scenarios. Maybe the 
>> following
>> would be better?
>>                       //          Encoding is defined by the xml text decl.
>>                       reader = enc("UTF-8", is.getByteStream());
>>                       expenc = xml(reader);
>>                       if (!expenc.equals("UTF-8")) {
>>                           if (expenc.startsWith("UTF-16")) {
>>                               panic(FAULT);  // UTF-16 must have BOM [#4.3.3]
>>                           }
>>                           reader = enc(expenc, is.getByteStream());
>>                                        }
> Updated to reflect the above suggestion: 
> http://cr.openjdk.java.net/~joehw/jdk9/8043592/webrev/.
> For the performance improvement, I'll create a new bug to track it. 
> Reading is buffered in the regular jaxp parser than the UKit one. It 
> would be nice if the benchmark had a separate measurement in parsing 
> performance. We currently have indirect measurement through validation 
> and transform.
> Buffer size will affect performance, UKit sets the default to 512 (but 
> actually read byte by byte from the underlying stream as you noted), 
> while jaxp parser default to 8k.  For a small parser such as UKit, it 
> may make sense to use a smaller buffer.
> -Joe
>> -Sherman
>> On 05/27/2014 12:54 PM, Xueming Shen wrote:
>>> On 05/27/2014 10:46 AM, huizhe wang wrote:
>>>> Hi,
>>>> Are you okay with the updated patch?
>>>> Thanks,
>>>> Joe
>>> looks fine for me.
>>> Btw, if I took a quick look at the UTF8 reader, my observation 
>>> suggests read byte by byte
>>> from the underlying stream probably is the bottleneck of the overall 
>>> "parsing". Attached
>>> is a buffered the version, my simple test (just the parsing, use the 
>>> default handler do noting)
>>> indicates it might double the parsing speed. Sure the overall 
>>> performance will depends on
>>> the individual handler, but it might worth considering, any second 
>>> counts :-) The code is
>>> not fully tested though, just for your reference.
>>> -Sherman
>>> package jdk.internal.util.xml.impl;
>>> import java.io.Reader;
>>> import java.io.InputStream;
>>> import java.io.IOException;
>>> import java.io.UnsupportedEncodingException;
>>> /**
>>>  * UTF-8 transformed UCS-2 character stream reader.
>>>  *
>>>  * This reader converts UTF-8 transformed UCS-2 characters to Java 
>>> characters.
>>>  * The UCS-2 subset of UTF-8 transformation is described in RFC-2279 #2
>>>  * "UTF-8 definition":
>>>  *  0000 0000-0000 007F   0xxxxxxx
>>>  *  0000 0080-0000 07FF   110xxxxx 10xxxxxx
>>>  *  0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
>>>  *
>>>  * This reader will return incorrect last character on broken UTF-8 
>>> stream.
>>>  */
>>> public class ReaderUTF8 extends Reader {
>>>     private InputStream is;
>>>     private static int DEFAULT_BUFFER_SIZE = 8192;
>>>     private byte buf[];
>>>     private int pos, limit;
>>>     /**
>>>      * Constructor.
>>>      *
>>>      * @param is A byte input stream.
>>>      */
>>>     public ReaderUTF8(InputStream is) {
>>>         this.is = is;
>>>         this.buf = new byte[DEFAULT_BUFFER_SIZE];
>>>         this.pos = limit = 0;
>>>     }
>>>     private void fill() throws IOException {
>>>         if (pos >= buf.length) {  // no room left in buffer
>>>             pos = limit = 0;
>>>         }
>>>         int n = is.read(buf, pos, buf.length - pos);
>>>         if (n > 0) {
>>>             limit = n + pos;
>>>         }
>>>     }
>>>     /**
>>>      * Reads characters into a portion of an array.
>>>      *
>>>      * @param cbuf Destination buffer.
>>>      * @param off Offset at which to start storing characters.
>>>      * @param len Maximum number of characters to read.
>>>      * @exception IOException If any IO errors occur.
>>>      * @exception UnsupportedEncodingException If UCS-4 character 
>>> occur in the stream.
>>>      */
>>>     public int read(char[] cbuf, int off, int len) throws IOException {
>>>         int off0 = off;
>>>         int end = off + len;
>>>         while (off < len) {
>>>             if (pos >= limit) {
>>>                 fill();
>>>                 if (pos >= limit) {
>>>                     return (off != off0) ? off - off0 : -1;
>>>                 }
>>>             }
>>>             int val = buf[pos] & 0xff;
>>>             if (val >= 0x80) {
>>>                 break;
>>>             }
>>>             cbuf[off++] = (char) val;
>>>             pos++;
>>>         }
>>>         while (off < end) {
>>>             if (pos >= limit) {
>>>                 fill();
>>>                 if (pos >= limit) {
>>>                     return (off != off0) ? off - off0 : -1;
>>>                 }
>>>             }
>>>             int val = buf[pos++] & 0xff;
>>>             switch (val & 0xf0) {
>>>                 case 0xc0:
>>>                 case 0xd0:
>>>                     if (pos >= limit) {
>>>                         fill();
>>>                     }
>>>                     if (pos >= limit) {
>>>                         cbuf[off++] = (char) (((val & 0x1f) << 6) | 
>>> (is.read() & 0x3f));
>>>                     } else {
>>>                         cbuf[off++] = (char) (((val & 0x1f) << 6) | 
>>> (buf[pos++] & 0x3f));
>>>                     }
>>>                     break;
>>>                 case 0xe0:
>>>                     if (pos >= limit) {
>>>                         fill();
>>>                     }
>>>                     val = (val & 0x0f) << 12;
>>>                     if (pos >= limit) {
>>>                         val |= ((is.read() & 0x3f) << 6);
>>>                     } else {
>>>                         val |= ((buf[pos++] & 0x3f) << 6);
>>>                     }
>>>                     if (pos >= limit) {
>>>                         val |= (buf[pos++] & 0x3f);
>>>                     } else {
>>>                         val |= (is.read() & 0x3f);
>>>                     }
>>>                     cbuf[off++] = (char) val;
>>>                     break;
>>>                 case 0xf0:      // UCS-4 character
>>>                     throw new UnsupportedEncodingException("UTF-32 
>>> (or UCS-4) encoding not supported.");
>>>                 default:
>>>                     cbuf[off++] = (char) val;
>>>                     break;
>>>             }
>>>         }
>>>         return off - off0;
>>>     }
>>>     /**
>>>      * Reads a single character.
>>>      *
>>>      * @return The character read, as an integer in the range 0 to 65535
>>>      *  (0x00-0xffff), or -1 if the end of the stream has been reached.
>>>      * @exception IOException If any IO errors occur.
>>>      * @exception UnsupportedEncodingException If UCS-4 character 
>>> occur in the stream.
>>>      */
>>>     public int read() throws IOException {
>>>         int val;
>>>         if (pos >= limit) {
>>>             val = is.read();
>>>         } else {
>>>             val = buf[pos++] & 0xff;
>>>         }
>>>         switch (val & 0xf0) {
>>>             case 0xc0:
>>>             case 0xd0:
>>>                 if (pos >= limit) {
>>>                     val = ((val & 0x1f) << 6) | (is.read() & 0x3f);
>>>                 } else {
>>>                     val = ((val & 0x1f) << 6) | (buf[pos++] & 0x3f);
>>>                 }
>>>                 break;
>>>             case 0xe0:
>>>                 val = (val & 0x0f) << 12;
>>>                 if (pos >= limit) {
>>>                     val |= ((is.read() & 0x3f) << 6);
>>>                 } else {
>>>                     val |= ((buf[pos++] & 0x3f) << 6);
>>>                 }
>>>                 if (pos >= limit) {
>>>                     val |= (is.read() & 0x3f);
>>>                 } else {
>>>                     val |= (buf[pos++] & 0x3f);
>>>                 }
>>>                 break;
>>>             case 0xf0:  // UCS-4 character
>>>                 throw new UnsupportedEncodingException();
>>>             default:
>>>                 break;
>>>         }
>>>         return val;
>>>     }
>>>     /**
>>>      * Closes the stream.
>>>      *
>>>      * @exception IOException If any IO errors occur.
>>>      */
>>>     public void close() throws IOException {
>>>         is.close();
>>>     }
>>> }

More information about the core-libs-dev mailing list