CharsetEncoder.maxBytesPerChar()
naoto.sato at oracle.com
naoto.sato at oracle.com
Fri Sep 20 20:25:38 UTC 2019
Hello,
I am looking at the following bug:
https://bugs.openjdk.java.net/browse/JDK-8230531
and hoping someone who is familiar with the encoder will clear things
out. As in the bug report, the method description reads:
--
Returns the maximum number of bytes that will be produced for each
character of input. This value may be used to compute the worst-case
size of the output buffer required for a given input sequence.
--
Initially I thought it would return the maximum number of encoded bytes
for an arbitrary input "char" value, i.e. a code unit of UTF-16
encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and
UTF-16LE) would return 2 from the method, as the code unit is a 16 bit
value. In reality, the encoder of UTF-16 Charset returns 4, which
accounts for the initial byte-order mark (2 bytes for a code unit, plus
size of the BOM). This is justifiable because it is meant to be the
worst case scenario, though. I believe this implementation has been
there since the inception of java.nio, i.e., JDK1.4.
Obviously I can clarify the spec of maxBytesPerChar() to account for the
conversion independent prefix (or suffix) bytes, such as BOM, but I am
not sure the initial intent of the method. If it intends to return pure
max bytes for a single input char, UTF-16 should also have been
returning 2. But in that case, caller would not be able to calculate the
worst case byte buffer size as in the bug report.
Naoto
More information about the core-libs-dev
mailing list