CharsetEncoder.maxBytesPerChar()

Fri Sep 20 20:25:38 UTC 2019

Hello,

I am looking at the following bug:

https://bugs.openjdk.java.net/browse/JDK-8230531

and hoping someone who is familiar with the encoder will clear things 
out. As in the bug report, the method description reads:

--
Returns the maximum number of bytes that will be produced for each 
character of input. This value may be used to compute the worst-case 
size of the output buffer required for a given input sequence.
--

Initially I thought it would return the maximum number of encoded bytes 
for an arbitrary input "char" value, i.e. a code unit of UTF-16 
encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and 
UTF-16LE) would return 2 from the method, as the code unit is a 16 bit 
value. In reality, the encoder of UTF-16 Charset returns 4, which 
accounts for the initial byte-order mark (2 bytes for a code unit, plus 
size of the BOM). This is justifiable because it is meant to be the 
worst case scenario, though. I believe this implementation has been 
there since the inception of java.nio, i.e., JDK1.4.

Obviously I can clarify the spec of maxBytesPerChar() to account for the 
conversion independent prefix (or suffix) bytes, such as BOM, but I am 
not sure the initial intent of the method. If it intends to return pure 
max bytes for a single input char, UTF-16 should also have been 
returning 2. But in that case, caller would not be able to calculate the 
worst case byte buffer size as in the bug report.

Naoto