StandardCharset vs. StandardCharsets

Sat May 7 18:33:06 UTC 2011

Hi Ulf,
the javadoc doesn't say explicitly that the result of 
charset.newDecoder() will be used,
so I don't see the point.

I even see that the last sentence:
   "The|||| 
<http://download.java.net/jdk7/docs/api/java/nio/charset/CharsetDecoder.html>CharsetDecoder 
class should be used when more control over the decoding process is 
required."
as a way to say that it's ok to reuse a previously existing decoder.

Rémi

On 05/07/2011 07:55 PM, Ulf Zibis wrote:
> Rémi, thanks for your feedback.
>
> Am 07.05.2011 18:00, schrieb Rémi Forax:
>> On 05/07/2011 02:17 PM, Ulf Zibis wrote:
>>> Hi all,
>>>
>>> please excuse, that I have still problems to accept this additional 
>>> class, but +1 for the plural name.
>>>
>>> If those charset constants are there, people _will use_ them without 
>>> respect on the existing _performance disadvantages_.
>>> A common typical use case should be: String.getBytes(...)
>>> On small strings there is a performance lost up to 25 % using the 
>>> charset variant vs. the charset name variant. See:
>>> http://cr.openjdk.java.net/~sherman/7040220/client
>>> http://markmail.org/message/2tbas5skgkve52mz
>>> http://markmail.org/thread/lnrozcbnpcl5kmzs
>>>
>>> So I still think, we should have the standard charset names as 
>>> constants in class j.n.c.Charset:
>>>     public static final String UTF_8 = "UTF-8";  etc... 
>>
>> Using objects instead of string is a better design.
>
> I agree 50 %.
> 100 % would be to have:
>     byte[] String.getBytes(CharsetEncoder encoder)
>     String(byte[] bytes, CharsetDecoder decoder)
> So for convenience in consequence we should introduce constants for 
> CharsetDecoder's and CharsetEncoder's in j.n.c.StandardCharsets, which 
> would result in 12 additional classes to be loaded and instatiated at 
> one time, if only one of them becomes in use.
>
> But anyway, it would be better to have the canonical names of the 
> standard charsets declared in 1 place, not in 3 (Charset, 
> j.n.c.StandardCharsets, s.n.c.StandardCharsets)
>
>> I see the fact that the String method variants that takes a Charset 
>> are slower that the ones that use a String
>> as a performance bug, not as a design issue.
>>
>> The String method that takes a Charset should reuse the class-local 
>> decoder
>> and the performance problem will go away.
>> (The analysis in StringCoding.decode(Charset, ...) (point 1) forget 
>> that initializing a decoder has also a cost)
>
> Unfortunately this is not possible.
> See following discussion (my last post from 26.03.2009 - 00:52 CET, 
> unfortunately this was a private conversation):
>
>
> Am 19.03.2009 20:02, Xueming Shen schrieb:
>> Ulf Zibis wrote:
>>>
>>> Isn't there any way even to avoid instantiating new ..Array-X-coder 
>>> for each invocation of StringCoding.x-code(Charset cs, ...)?
>>> Method x-code(byte/char[]) seems to be threadsafe, if replacement 
>>> isn't changed, so I suppose, we could cache the ..Array-X-coder.
>>>
>> no. an "external" charset can do whatever it likes, it might be still 
>> the same "object", the de/encoder it "creates" might
>> be still the same "object' or looks like the same object you might 
>> have cahced,  but do total different thing.
>
>
> At first assumption user could think, that String#getBytes(byte[] buf, 
> Charset cs) might be faster than String#getBytes(byte[] buf, String 
> csn), because he assumes, that Charset would be internally created 
> from csn.
> As this is only true for the first call, there should be a *note* in 
> JavaDoc about cost of those methods in comparision. Don't forget 
> (byte[] ...) constructor's JavaDoc too.
>
> Secondly I think, that ASCII and ISO-8859-1 have high percentage here 
> especially for CORBA applications, so why not have a fast shortcut in 
> class String without internally using Charset-X-coder like 
> getASCIIbytes() + getISO_8859_1Bytes(), or more general and 
> sophisticated:
>    int getBytes(byte[] buf, byte mask) {
>        int j = 0;
>        for (int i=0; i<values.length; i++, j++) {
>            if (values[i] | mask == mask)
>                buf[j] = (byte)values[i];
>                continue;
>            if (isHighSurrogate(values[i])
>                 i++;
>            buf[j] = '?'; // or default replacement
>        }
>        return j;
>    }
>
> -Ulf
>
>