8219196: DataOutputStream.writeUTF may throw unexpected exceptions

Mon Mar 18 20:03:13 UTC 2019

You've again caught me only half paying attention.
Below is another attempt at micro-optimization (might be too tricky!), BUT:
- consider documenting the UTFDataFormatException and the 64k length limit
- error message would actually be more useful if it contained a few chars
from head and tail instead of length

    static int writeUTF(String str, DataOutput out) throws IOException {
        final int strlen = str.length();
        int utflen = strlen;    // optimized for ASCII

        for (int i = 0; i < strlen; i++) {
            int c = str.charAt(i);
            if (c >= 0x80 || c == 0)
                utflen += (c >= 0x800) ? 2 : 1;
        }

        if (utflen > 65535 || /* overflow */ utflen < strlen)
            throw new UTFDataFormatException(tooLongMsg(utflen));

        final byte[] bytearr;
        if (out instanceof DataOutputStream) {
            DataOutputStream dos = (DataOutputStream)out;
            if (dos.bytearr == null || (dos.bytearr.length < (utflen + 2)))
                dos.bytearr = new byte[(utflen*2) + 2];
            bytearr = dos.bytearr;
        } else {
            bytearr = new byte[utflen + 2];
        }

        int count = 0;
        bytearr[count++] = (byte) ((utflen >>> 8) & 0xFF);
        bytearr[count++] = (byte) ((utflen >>> 0) & 0xFF);

        int i = 0;
        for (i = 0; i < strlen; i++) { // optimized for initial run of ASCII
            int c = str.charAt(i);
            if (c >= 0x80 || c == 0) break;
            bytearr[count++] = (byte) c;
        }

        for (; i < strlen; i++) {
            int c = str.charAt(i);
            if (c < 0x80 && c != 0) {
                bytearr[count++] = (byte) c;
            } else if (c >= 0x800) {
                bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F));
                bytearr[count++] = (byte) (0x80 | ((c >>  6) & 0x3F));
                bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
            } else {
                bytearr[count++] = (byte) (0xC0 | ((c >>  6) & 0x1F));
                bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
            }
        }
        out.write(bytearr, 0, utflen + 2);
        return utflen + 2;
    }

    private static String tooLongMsg(int bits32) {
        long actualLength = (bits32 > 0xFFFF) ? bits32
            // handle int overflow
            : ((bits32 < 0)
               ? ((long)bits32 & 0xFFFFL)
               : bits32 + 0x10000L);
        return "encoded string too long: " + actualLength + " bytes";
    }

On Mon, Mar 18, 2019 at 11:36 AM Roger Riggs <Roger.Riggs at oracle.com> wrote:

> Hi,
>
> Collapsing a couple of emails.
>
> On 03/18/2019 01:43 PM, Martin Buchholz wrote:
>
> Hmmmm ....
>
> I took another look at this and Roger tickled my UTF-8 optimization spidey
> sense.  Rewritten using Martin style:
>
>     static int writeUTF(String str, DataOutput out) throws IOException {
>         int strlen = str.length();
>         int utflen = strlen;    // optimized for ASCII
>
>         for (int i = 0; i < strlen; i++) {
>             int c = str.charAt(i);
>             if (c >= 0x80 || c == 0)
>                 utflen += (c >= 0x800) ? 2 : 1;
>         }
>
> @Martin, you'd need uftlen to be a long to avoid the original overflow to
> negative problem
> for an input string length > MAXINT / 3.  And that might be the cleanest.
> Compared to adding
> extra conditions in the loop.
>
> Otherwise, this code looks good, it might optimize to fewer branches than
> the original.
>
>
>         if (utflen > 65535)
>             throw new UTFDataFormatException(
>                 "encoded string too long: " + utflen + " bytes");
>
>         final byte[] bytearr;
>         if (out instanceof DataOutputStream) {
>             DataOutputStream dos = (DataOutputStream)out;
>             if (dos.bytearr == null || (dos.bytearr.length < (utflen + 2)))
>                 dos.bytearr = new byte[(utflen*2) + 2];
>             bytearr = dos.bytearr;
>         } else {
>             bytearr = new byte[utflen + 2];
>         }
>
>         int count = 0;
>         bytearr[count++] = (byte) ((utflen >>> 8) & 0xFF);
>         bytearr[count++] = (byte) ((utflen >>> 0) & 0xFF);
>
>         int i = 0;
>         for (i = 0; i < strlen; i++) { // optimized for initial run of
> ASCII
>             int c = str.charAt(i);
>             if (c >= 0x80 || c == 0) break;
>             bytearr[count++] = (byte) c;
>         }
>
>         for (; i < strlen; i++) {
>             int c = str.charAt(i);
>             if (c < 0x80 && c != 0) {
>                 bytearr[count++] = (byte) c;
>             }
>             else if (c >= 0x800) {
>                 bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F));
>                 bytearr[count++] = (byte) (0x80 | ((c >>  6) & 0x3F));
>                 bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
>             } else {
>                 bytearr[count++] = (byte) (0xC0 | ((c >>  6) & 0x1F));
>                 bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
>             }
>         }
>         out.write(bytearr, 0, utflen + 2);
>         return utflen + 2;
>     }
>
>
> On Mon, Mar 18, 2019 at 6:41 AM Roger Riggs <Roger.Riggs at oracle.com>
> wrote:
>
>> Hi Brian,
>>
>> On 03/15/2019 05:06 PM, Brian Burkhalter wrote:
>>
>> Updated: http://cr.openjdk.java.net/~bpb/8219196/webrev.01/
>>
>>
>> I'm at odds with Martin on including utflen in the loop.  (Or that's not
>> what he meant).
>> Now it has to do two comparisons for every input byte instead of 1.
>>
>> There is no need to have an early exit for input strings that are too
>> long.
>> It is a very unusual case, probably never except in testing.
>>
>> There might be a case for checking that  strlen <= 65535,
>> (even at 1 input byte per output byte it would be too large).
>> But even that is probably not worth the compare and branch.
>>
>
> The test only needs enough memory to create the input string  (MAXINT/3+1)
> and a bit more for the default sized ByteArrayOutputStream.
> So it should run in 2G configurations.
>
> So yes, removing the enabled = false is ok.
>
> Roger
>
>
>
>>
>> Instead of disabling the test statically, you could make it conditional
>> on Runtime.maxMemory but the test will fail quickly anyway.
>>
>>
>> For this I simply added the requirement to the jtreg tags instead.
>>
>> With the fix, it will throw immediately before any allocation, so there
>> is
>> no need for a restriction on memory size.
>> The only time that would be necessary would be running the test on a
>> runtime without the fix.
>>
>> The @requires >4g restriction is unnecessary.
>>
>> $.02, Roger
>>
>>
>>
>