CharsetEncoder.maxBytesPerChar()

Fri Sep 20 22:03:39 UTC 2019

2019/9/20 13:25:38 -0700, naoto.sato at oracle.com:
> I am looking at the following bug:
> 
> https://bugs.openjdk.java.net/browse/JDK-8230531
> 
> and hoping someone who is familiar with the encoder will clear things 
> out. As in the bug report, the method description reads:
> 
> --
> Returns the maximum number of bytes that will be produced for each 
> character of input. This value may be used to compute the worst-case 
> size of the output buffer required for a given input sequence.
> --
> 
> Initially I thought it would return the maximum number of encoded bytes 
> for an arbitrary input "char" value, i.e. a code unit of UTF-16 
> encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and 
> UTF-16LE) would return 2 from the method, as the code unit is a 16 bit 
> value. In reality, the encoder of UTF-16 Charset returns 4, which 
> accounts for the initial byte-order mark (2 bytes for a code unit, plus 
> size of the BOM).

Exactly.  A comment in the implementation, in sun.nio.cs.UnicodeEncoder,
mentions this (perhaps you already saw it):

    protected UnicodeEncoder(Charset cs, int bo, boolean m) {
        super(cs, 2.0f,
*             // Four bytes max if you need a BOM
*             m ? 4.0f : 2.0f,
              // Replacement depends upon byte order
              ((bo == BIG)
               ? new byte[] { (byte)0xff, (byte)0xfd }
               : new byte[] { (byte)0xfd, (byte)0xff }));
        usesMark = needsMark = m;
        byteOrder = bo;
    }

>                   This is justifiable because it is meant to be the 
> worst case scenario, though. I believe this implementation has been 
> there since the inception of java.nio, i.e., JDK 1.4.

Yes, it has.

> Obviously I can clarify the spec of maxBytesPerChar() to account for the 
> conversion independent prefix (or suffix) bytes, such as BOM, but I am 
> not sure the initial intent of the method. If it intends to return pure 
> max bytes for a single input char, UTF-16 should also have been 
> returning 2. But in that case, caller would not be able to calculate the 
> worst case byte buffer size as in the bug report.

The original intent is that the return value of this method can be used
to allocate a buffer that is guaranteed to be large enough for any
possible output.  Returning 2 for UTF-16 would, as you observe, not work
for that purpose.

To avoid this confusion, a more verbose specification might read:

     * Returns the maximum number of $otype$s that will be produced for each
     * $itype$ of input.  This value may be used to compute the worst-case size
     * of the output buffer required for a given input sequence. This value
     * accounts for any necessary content-independent prefix or suffix
#if[encoder]
     * $otype$s, such as byte-order marks.
#end[encoder]
#if[decoder]
     * $otype$s.
#end[decoder]

(The example of byte-order marks applies only to CharsetEncoders, so
 I’ve conditionalized that text for Charset-X-Coder.java.template.)

- Mark