<i18n dev> RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v17]

Eirik Bjørsnøs eirbjo at openjdk.org
Tue Feb 10 12:23:45 UTC 2026


On Fri, 30 Jan 2026 15:56:20 GMT, Liam Miller-Cushon <cushon at openjdk.org> wrote:

>> This implements an API to return the byte length of a String encoded in a given charset. See [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background.
>> 
>> ---
>> 
>> 
>> Benchmark                              (encoding)  (stringLength)   Mode  Cnt          Score          Error  Units
>> StringLoopJmhBenchmark.getBytes             ASCII              10  thrpt    5  406782650.595 ± 16960032.852  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII             100  thrpt    5  172936926.189 ±  4532029.201  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII            1000  thrpt    5   38830681.232 ±  2413274.766  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII          100000  thrpt    5     458881.155 ±    12818.317  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1              10  thrpt    5   37193762.990 ±  3962947.391  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1             100  thrpt    5   55400876.236 ±  1267331.434  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1            1000  thrpt    5   11104514.001 ±    41718.545  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1          100000  thrpt    5     182535.414 ±    10296.120  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16              10  thrpt    5  113474681.457 ±  8326589.199  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16             100  thrpt    5   37854103.127 ±  4808526.773  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16            1000  thrpt    5    4139833.009 ±    70636.784  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16          100000  thrpt    5      57644.637 ±     1887.112  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII              10  thrpt    5  946701647.247 ± 76938927.141  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII             100  thrpt    5  396615374.479 ± 15167234.884  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII            1000  thrpt    5  100464784.979 ±   794027.897  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII          100000  thrpt    5    1215487.689 ±     1916.468  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1              10  thrpt    5  221265102.323 ± 17013983.056  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1             100  thrpt    5  137617873.887 ±  5842185.781  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1            1000  thrpt    5   92540259.1...
>
> Liam Miller-Cushon has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename getBytesLength to getByteLength

> For completeness, here's a demo of it in `CharsetEncoder` (#29639). As expected it's possible to implement it that way and preserve equivalent performance, by adding a package visibility method to `String` and using `JavaLangAccess`. With that change, `string.getByteLength(UTF_8)` could be expressed as:
> 
> ```java
>     try {
>         int byteLength = StandardCharsets.UTF_8.newEncoder()
>                 .onUnmappableCharacter(CodingErrorAction.REPLACE)
>                 .onMalformedInput(CodingErrorAction.REPLACE)
>                 .getByteLength(stringData);
>     } catch (CharacterCodingException e) {
>         throw new IllegalStateException(e);
>     }
> ```
> 
> I can update the CSR to document this as an alternative.

This looks verbose at first sight. But I like how it allows control over coding error actions. This enables input validation and computing length in a single pass.

Your demo seems to optimize only for `CodingErrorAction.REPLACE`, but that's probably more of an implementation detail than a limiting factor of API design, right?

The demo focuses on the encoding side, but for completeness I guess the decoding side (with validation) could look like:

> ```java
>     try {
>         int stringLength = StandardCharsets.UTF_8.newDecoder()
>                 .onUnmappableCharacter(CodingErrorAction.REPORT)
>                 .onMalformedInput(CodingErrorAction.REPORT)
>                 .getDecodedLength(stringData);
>     } catch (CharacterCodingException e) {
>         throw new IllegalStateException(e);
>     }
> ``` 

Did the stateful `CharsetEncoder` created meaningfully affect your performance benchmarking?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28454#issuecomment-3877259235


More information about the i18n-dev mailing list