<i18n dev> RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v17]
Eirik Bjørsnøs
eirbjo at openjdk.org
Mon Feb 9 21:31:47 UTC 2026
On Fri, 30 Jan 2026 15:56:20 GMT, Liam Miller-Cushon <cushon at openjdk.org> wrote:
>> This implements an API to return the byte length of a String encoded in a given charset. See [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background.
>>
>> ---
>>
>>
>> Benchmark (encoding) (stringLength) Mode Cnt Score Error Units
>> StringLoopJmhBenchmark.getBytes ASCII 10 thrpt 5 406782650.595 ± 16960032.852 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 100 thrpt 5 172936926.189 ± 4532029.201 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 1000 thrpt 5 38830681.232 ± 2413274.766 ops/s
>> StringLoopJmhBenchmark.getBytes ASCII 100000 thrpt 5 458881.155 ± 12818.317 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 10 thrpt 5 37193762.990 ± 3962947.391 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 100 thrpt 5 55400876.236 ± 1267331.434 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 1000 thrpt 5 11104514.001 ± 41718.545 ops/s
>> StringLoopJmhBenchmark.getBytes LATIN1 100000 thrpt 5 182535.414 ± 10296.120 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 10 thrpt 5 113474681.457 ± 8326589.199 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 100 thrpt 5 37854103.127 ± 4808526.773 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 1000 thrpt 5 4139833.009 ± 70636.784 ops/s
>> StringLoopJmhBenchmark.getBytes UTF16 100000 thrpt 5 57644.637 ± 1887.112 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 10 thrpt 5 946701647.247 ± 76938927.141 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 100 thrpt 5 396615374.479 ± 15167234.884 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 1000 thrpt 5 100464784.979 ± 794027.897 ops/s
>> StringLoopJmhBenchmark.getBytesLength ASCII 100000 thrpt 5 1215487.689 ± 1916.468 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 10 thrpt 5 221265102.323 ± 17013983.056 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 100 thrpt 5 137617873.887 ± 5842185.781 ops/s
>> StringLoopJmhBenchmark.getBytesLength LATIN1 1000 thrpt 5 92540259.1...
>
> Liam Miller-Cushon has updated the pull request incrementally with one additional commit since the last revision:
>
> Rename getBytesLength to getByteLength
Should we also consider the inverse operation, that is to compute the length of a String had it been decoded from a sequence of bytes?
`new String(byte[], Charset).length()`
Someone will eventually ask for this. I see some potential use case for it in the `ZipFile` implementation where knowing the length ahead of decoding could provide efficient rejection of strings without decoding and without looking at String contents.
Not saying we need to add it now, just that the name chosen here should leave room for a future addition of this inverse operation.
Something like:
str.getEncodedLength(Charset); // Encoded length of this string
String.getDecodedLength(byte[], Charset); // Decoded length of byte sequence
or, with the current scheme:
str.getByteLength(Charset); // Encoded length of this string
String.getStringLength(byte[], Charset); // Decoded length of byte sequence
EDIT:
Moving this out of `java.lang.String` unlocks:
* Symmetry in that both can be instance methods
* We would be free to support `ByteBuffer` and any `CharSequence`, not just strings:
Charset cs = StandardCharsets.UTF_8;
String h = "hello";
byte[] bytes = h.getBytes(cs);
cs.encodedLength(CharBuffer.wrap(h));
cs.encodedLength(new StringBuilder(h));
cs.decodedLength(bytes);
cs.decodedLength(ByteBuffer.wrap(bytes));
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28454#issuecomment-3869643511
More information about the i18n-dev
mailing list