RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v5]
Volkan Yazici
vyazici at openjdk.org
Thu Jan 15 16:33:30 UTC 2026
On Thu, 15 Jan 2026 14:42:34 GMT, Liam Miller-Cushon <cushon at openjdk.org> wrote:
>> This implements an API to return the byte length of a String encoded in a given charset. See [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background.
>>
>> ---
>>
>>
>> Benchmark (encoding) (stringLength) Mode Cnt Score Error Units
>> StringLoopJmhBenchmark.getBytes ASCII 10 thrpt 5 406782650.595 ± 16960032.852 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 100 thrpt 5 172936926.189 ± 4532029.201 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 1000 thrpt 5 38830681.232 ± 2413274.766 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 100000 thrpt 5 458881.155 ± 12818.317 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 10 thrpt 5 37193762.990 ± 3962947.391 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 100 thrpt 5 55400876.236 ± 1267331.434 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 1000 thrpt 5 11104514.001 ± 41718.545 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 100000 thrpt 5 182535.414 ± 10296.120 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 10 thrpt 5 113474681.457 ± 8326589.199 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 100 thrpt 5 37854103.127 ± 4808526.773 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 1000 thrpt 5 4139833.009 ± 70636.784 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 100000 thrpt 5 57644.637 ± 1887.112 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 10 thrpt 5 946701647.247 ± 76938927.141 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 100 thrpt 5 396615374.479 ± 15167234.884 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 1000 thrpt 5 100464784.979 ± 794027.897 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 100000 thrpt 5 1215487.689 ± 1916.468 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 10 thrpt 5 221265102.323 ± 17013983.056 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 100 thrpt 5 137617873.887 ± 5842185.781 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 1000 thrpt 5 92540259.1...
>
> Liam Miller-Cushon has updated the pull request incrementally with one additional commit since the last revision:
>
> Update tests
src/java.base/share/classes/java/lang/String.java line 1097:
> 1095: return dp;
> 1096: }
> 1097:
Verified that this is indeed identical to `encodeASCII` such that only the length computation is taken into account.
src/java.base/share/classes/java/lang/String.java line 1512:
> 1510: return dp;
> 1511: }
> 1512:
Verified that this is indeed identical to `encodeUTF8` such that only the length computation is taken into account.
src/java.base/share/classes/java/lang/String.java line 1585:
> 1583:
> 1584: // This follows the implementation of encodeUTF8_UTF16
> 1585: private static int encodedLengthUTF8_UTF16(byte[] val) {
Doesn't this duplicate the `computeSizeUTF8_UTF16`?
AFAICS, `computeSizeUTF8_UTF16` is missing the ASCII fast loop, but we can enhance it.
FWIW, if we decide reuse `computeSizeUTF8_UTF16`, it might be better to rename it to `encodedLengthUTF8_UTF16`, which will be in line with the introduced `encodedLength*` method family.
src/java.base/share/classes/java/lang/String.java line 2143:
> 2141: */
> 2142: public int getBytesLength(Charset cs) {
> 2143: if (cs == UTF_8.INSTANCE) {
It'd be nice to catch null values as early as possible. I suggest adding a `Objects.requireNonNull(cs)` along with `@throws NullPointerException If {@code cs} is null` in docs.
src/java.base/share/classes/java/lang/String.java line 2148:
> 2146: if (isLatin1()) {
> 2147: return value.length;
> 2148: }
Any particular reason you avoided introducing a `encodedLength8859_1` here? (There is a `encode8859_1` method.)
src/java.base/share/classes/java/lang/String.java line 2151:
> 2149: } else if (cs == US_ASCII.INSTANCE) {
> 2150: return encodedLengthASCII(coder, value);
> 2151: } else if (cs instanceof sun.nio.cs.UTF_16LE || cs instanceof sun.nio.cs.UTF_16BE) {
I see that `sun.nio.cs.UTF_16{LE,BE}` specialization is suggested by @ExE-Boss [here]. Though I'm not really sure if this is really needed. I cannot spot any other usage of these constants in `java.base`, except `jdk.internal.foreign.StringSupport`, which is irrelevant.
[here]: https://github.com/openjdk/jdk/pull/28454/files#r2552768341
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695007692
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695024667
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695036430
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695065623
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695076034
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695105076
More information about the core-libs-dev
mailing list