CharsetEncoder.maxBytesPerChar()

Ulf Zibis Ulf.Zibis at CoSoCo.de
Mon Sep 30 09:35:16 UTC 2019


Hey Martin,

great, that you got my issue. The link you shared is an interesting
basis for this discussion.

Maybe at some places e.g. in the "upfront specifications", additionally
the term "UTF-16 char" or "UTF-16 code unit" could be helpful and then
determining "char" or "{@code char}" as a short cut.

-Ulf

Am 27.09.19 um 15:04 schrieb Martin Buchholz:
> Like Ulf, I am sometimes annoyed by the use of the "character"
> misnomer throughout the API docs, and would support an effort to use
> "character" the way that unicode.org <http://unicode.org> uses it.
> "char" no longer represents a Unicode character, but at least it
> provides a short clear name, in the Java language, for "UTF-16 code
> unit" - if we use it consistently!
> https://unicode.org/faq/utf_bom.html#utf16-1
>
> On Thu, Sep 26, 2019 at 2:24 PM <mark.reinhold at oracle.com
> <mailto:mark.reinhold at oracle.com>> wrote:
>
>     2019/9/24 13:00:21 -0700, ulf.zibis at cosoco.de
>     <mailto:ulf.zibis at cosoco.de>:
>     > Am 21.09.19 um 00:03 schrieb mark.reinhold at oracle.com
>     <mailto:mark.reinhold at oracle.com>:
>     >> To avoid this confusion, a more verbose specification might read:
>     >>     * Returns the maximum number of $otype$s that will be
>     produced for each
>     >>     * $itype$ of input.  This value may be used to compute the
>     worst-case size
>     >>     * of the output buffer required for a given input sequence.
>     This value
>     >>     * accounts for any necessary content-independent prefix or
>     suffix
>     >> #if[encoder]
>     >>     * $otype$s, such as byte-order marks.
>     >> #end[encoder]
>     >> #if[decoder]
>     >>     * $otype$s.
>     >> #end[decoder]
>     >
>     > wouldn't it be more clear to use "char" or even "{@code char}"
>     instead
>     > "character" as replacment for the $xtype$ parameters?
>
>     The specifications of the Charset{De,En}coder classes make it clear
>     up front that “character” means “sixteen-bit Unicode character,” so
>     I don’t think changing “character” everywhere to “{@code char}” is
>     necessary.
>
>     This usage of “character” is common throughout the API specification.
>     With the introduction of 32-bit Unicode characters we started calling
>     those “code points,” but kept on calling sixteen-bit characters just
>     “characters.”  (I don’t think the official term “Unicode code unit”
>     ever caught on, and it’s a bit of a mouthful anyway.)
>
>     - Mark
>


More information about the core-libs-dev mailing list