String encodeUTF8 latin1 with negatives

Brett Okken brett.okken.os at gmail.com
Sun Jul 27 20:45:46 UTC 2025


In String.encodeUTF8, when the coder is latin1, there is a call to
StringCoding.hasNegatives to determine if any special handling is needed.
If not, a clone of the val is returned.
If there are negative values, it then loops, from the beginning, through
all the values to handle any individual negative values.

Would it be better to call StringCoding.countPositives? If the result
equals the length, the clone can still be returned. But if it does not, all
the values which are positive can be simply copied to the target byte[] and
only values beyond that point need to be checked again.

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L1287-L1300

        if (!StringCoding.hasNegatives(val, 0, val.length)) {
            return val.clone();
        }

        int dp = 0;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        for (byte c : val) {
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }


Can be changed to look like:

        int positives = StringCoding.countPositives(val, 0, val.length);
        if (positives == val.length) {
            return val.clone();
        }

        int dp = positives;
        byte[] dst = StringUTF16.newBytesFor(val.length);
        if (positives > 0) {
            System.arraycopy(val, 0, dst, 0, positives);
        }
        for (int i=dp; i<val.length; ++i) {
            byte c = val[i];
            if (c < 0) {
                dst[dp++] = (byte) (0xc0 | ((c & 0xff) >> 6));
                dst[dp++] = (byte) (0x80 | (c & 0x3f));
            } else {
                dst[dp++] = c;
            }
        }



I have done a bit of testing with the StringEncode jmh benchmark on my
local windows device.

encodeLatin1LongEnd speeds up significantly (~70%)
encodeLatin1LongStart slows down (~20%)
encodeLatin1Mixed speeds up by ~30%

The remaining tests do not show much difference either way.

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20250727/1907c5d5/attachment.htm>


More information about the core-libs-dev mailing list