RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v5]
Liam Miller-Cushon
cushon at openjdk.org
Thu Jan 15 18:29:42 UTC 2026
On Thu, 15 Jan 2026 16:17:22 GMT, Volkan Yazici <vyazici at openjdk.org> wrote:
>> Liam Miller-Cushon has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Update tests
>
> src/java.base/share/classes/java/lang/String.java line 2143:
>
>> 2141: */
>> 2142: public int getBytesLength(Charset cs) {
>> 2143: if (cs == UTF_8.INSTANCE) {
>
> It'd be nice to catch null values as early as possible. I suggest adding a `Objects.requireNonNull(cs)` along with `@throws NullPointerException If {@code cs} is null` in docs.
I added the `requireNonNull`, omitting the `@throws` as suggested in https://github.com/openjdk/jdk/pull/28454/changes#r2695394410
> src/java.base/share/classes/java/lang/String.java line 2148:
>
>> 2146: if (isLatin1()) {
>> 2147: return value.length;
>> 2148: }
>
> Any particular reason you avoided introducing a `encodedLength8859_1` here? (There is a `encode8859_1` method.)
I have tentatively added `encodedLength8859_1`
`encode8859_1` is implemented in terms of the `implEncodeISOArray`, so it is less similar than the other examples. In general I figured there was a tradeoff between the performance benefit and the additional code to have fast paths for each charset, and UTF-8 may be more frequently used.
> src/java.base/share/classes/java/lang/String.java line 2151:
>
>> 2149: } else if (cs == US_ASCII.INSTANCE) {
>> 2150: return encodedLengthASCII(coder, value);
>> 2151: } else if (cs instanceof sun.nio.cs.UTF_16LE || cs instanceof sun.nio.cs.UTF_16BE) {
>
> I see that `sun.nio.cs.UTF_16{LE,BE}` specialization is suggested by @ExE-Boss [here]. Though I'm not really sure if this is really needed. I cannot spot any other usage of these constants in `java.base`, except `jdk.internal.foreign.StringSupport`, which is irrelevant.
>
> [here]: https://github.com/openjdk/jdk/pull/28454/files#r2552768341
I don't have a strong opinion about these charsets. It's nice that the encoded length for them can be calculated in constant time, but on the other hand if they are less frequently used and there isn't precedent for special casing them in `java.base`, then this part could be dropped.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695394410
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695468134
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695471999
More information about the core-libs-dev
mailing list