CharsetEncoder.maxBytesPerChar()

Fri Sep 20 22:18:58 UTC 2019

Hi Mark,

Thank you for the crystal clear explanation. I will go ahead and clarify 
the method description.

Naoto

On 9/20/19 3:03 PM, mark.reinhold at oracle.com wrote:
> 2019/9/20 13:25:38 -0700, naoto.sato at oracle.com:
>> I am looking at the following bug:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8230531
>>
>> and hoping someone who is familiar with the encoder will clear things
>> out. As in the bug report, the method description reads:
>>
>> --
>> Returns the maximum number of bytes that will be produced for each
>> character of input. This value may be used to compute the worst-case
>> size of the output buffer required for a given input sequence.
>> --
>>
>> Initially I thought it would return the maximum number of encoded bytes
>> for an arbitrary input "char" value, i.e. a code unit of UTF-16
>> encoding. For example, any UTF-16 Charset (UTF-16, UTF-16BE, and
>> UTF-16LE) would return 2 from the method, as the code unit is a 16 bit
>> value. In reality, the encoder of UTF-16 Charset returns 4, which
>> accounts for the initial byte-order mark (2 bytes for a code unit, plus
>> size of the BOM).
> 
> Exactly.  A comment in the implementation, in sun.nio.cs.UnicodeEncoder,
> mentions this (perhaps you already saw it):
> 
>      protected UnicodeEncoder(Charset cs, int bo, boolean m) {
>          super(cs, 2.0f,
> *             // Four bytes max if you need a BOM
> *             m ? 4.0f : 2.0f,
>                // Replacement depends upon byte order
>                ((bo == BIG)
>                 ? new byte[] { (byte)0xff, (byte)0xfd }
>                 : new byte[] { (byte)0xfd, (byte)0xff }));
>          usesMark = needsMark = m;
>          byteOrder = bo;
>      }
> 
>>                    This is justifiable because it is meant to be the
>> worst case scenario, though. I believe this implementation has been
>> there since the inception of java.nio, i.e., JDK 1.4.
> 
> Yes, it has.
> 
>> Obviously I can clarify the spec of maxBytesPerChar() to account for the
>> conversion independent prefix (or suffix) bytes, such as BOM, but I am
>> not sure the initial intent of the method. If it intends to return pure
>> max bytes for a single input char, UTF-16 should also have been
>> returning 2. But in that case, caller would not be able to calculate the
>> worst case byte buffer size as in the bug report.
> 
> The original intent is that the return value of this method can be used
> to allocate a buffer that is guaranteed to be large enough for any
> possible output.  Returning 2 for UTF-16 would, as you observe, not work
> for that purpose.
> 
> To avoid this confusion, a more verbose specification might read:
> 
>       * Returns the maximum number of $otype$s that will be produced for each
>       * $itype$ of input.  This value may be used to compute the worst-case size
>       * of the output buffer required for a given input sequence. This value
>       * accounts for any necessary content-independent prefix or suffix
> #if[encoder]
>       * $otype$s, such as byte-order marks.
> #end[encoder]
> #if[decoder]
>       * $otype$s.
> #end[decoder]
> 
> (The example of byte-order marks applies only to CharsetEncoders, so
>   I’ve conditionalized that text for Charset-X-Coder.java.template.)
> 
> - Mark
>