RFR: 8314774: Optimize URLEncoder
Glavo
duke at openjdk.org
Wed Aug 23 23:58:28 UTC 2023
On Wed, 23 Aug 2023 18:51:37 GMT, Daniel Fuchs <dfuchs at openjdk.org> wrote:
> I don't particularly like the idea of embedding the logic of encoding UTF-8 into that class though, that increases the complexity significantly, and Charset encoders are there for that.
Unfortunately, the `CharsetEncoder` is too generic. Due to our knowledge of UTF-8, implementing it inline eliminates unnecessary temporary objects. There are already some places that do this, such as `String`.
I'm thinking we might be able to extract this logic into a static helper class.
public class UTF8EncodeUtils {
public static boolean isSingleByte(char c) { return c < 0x80; }
public static boolean isDoubleBytes(char c) { return c < 0x800; }
public static int encodeDoubleBytes(char c) {
byte b0 = (byte) (0xc0 | (c >> 6));
byte b1 = (byte) (0x80 | (c & 0x3f));
return ((b0 & 0xff) << 8) | b1;
}
public static int encodeThreeBytes(char c) {
byte b0 = (byte) (0xe0 | c >> 12);
byte b1 = (byte) (0x80 | c >> 6 & 0x3f);
byte b2 = (byte) (0x80 | c & 0x3f);
return ((b0 & 0xff) << 16) | ((b1 & 0xff) << 8) | b2;
}
public static int encodeCodePoint(int uc) {
byte b0 = (byte) (0xf0 | ((uc >> 18)));
byte b1 = (byte) (0x80 | ((uc >> 12) & 0x3f));
byte b2 = (byte) (0x80 | ((uc >> 6) & 0x3f));
byte b3 = (byte) (0x80 | (uc & 0x3f));
return ((b0 & 0xff) << 24) | ((b1 & 0xff) << 16) | ((b2 & 0xff) << 8) | b3;
}
}
We can use this helper class to reimplement `String` and the UTF-8 `CharsetEncoder` (after we make sure it has no overhead), then use it to implement more UTF-8 fast paths.
I've also been doing some work on `OutputStreamWriter` recently. By implementing a fast path for UTF-8, there are over 20x speedups in some cases. I think maybe we can get exciting improvements in more places.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/15354#issuecomment-1690789474
More information about the net-dev
mailing list