StandardCharset vs. StandardCharsets

Ulf Zibis Ulf.Zibis at gmx.de
Sat May 7 17:55:07 UTC 2011


Rémi, thanks for your feedback.

Am 07.05.2011 18:00, schrieb Rémi Forax:
> On 05/07/2011 02:17 PM, Ulf Zibis wrote:
>> Hi all,
>>
>> please excuse, that I have still problems to accept this additional class, but +1 for the plural 
>> name.
>>
>> If those charset constants are there, people _will use_ them without respect on the existing 
>> _performance disadvantages_.
>> A common typical use case should be: String.getBytes(...)
>> On small strings there is a performance lost up to 25 % using the charset variant vs. the charset 
>> name variant. See:
>> http://cr.openjdk.java.net/~sherman/7040220/client
>> http://markmail.org/message/2tbas5skgkve52mz
>> http://markmail.org/thread/lnrozcbnpcl5kmzs
>>
>> So I still think, we should have the standard charset names as constants in class j.n.c.Charset:
>>     public static final String UTF_8 = "UTF-8";  etc... 
>
> Using objects instead of string is a better design.

I agree 50 %.
100 % would be to have:
     byte[] String.getBytes(CharsetEncoder encoder)
     String(byte[] bytes, CharsetDecoder decoder)
So for convenience in consequence we should introduce constants for CharsetDecoder's and 
CharsetEncoder's in j.n.c.StandardCharsets, which would result in 12 additional classes to be loaded 
and instatiated at one time, if only one of them becomes in use.

But anyway, it would be better to have the canonical names of the standard charsets declared in 1 
place, not in 3 (Charset, j.n.c.StandardCharsets, s.n.c.StandardCharsets)

> I see the fact that the String method variants that takes a Charset are slower that the ones that 
> use a String
> as a performance bug, not as a design issue.
>
> The String method that takes a Charset should reuse the class-local decoder
> and the performance problem will go away.
> (The analysis in StringCoding.decode(Charset, ...) (point 1) forget that initializing a decoder 
> has also a cost)

Unfortunately this is not possible.
See following discussion (my last post from 26.03.2009 - 00:52 CET, unfortunately this was a private 
conversation):


Am 19.03.2009 20:02, Xueming Shen schrieb:
> Ulf Zibis wrote:
>>
>> Isn't there any way even to avoid instantiating new ..Array-X-coder for each invocation of 
>> StringCoding.x-code(Charset cs, ...)?
>> Method x-code(byte/char[]) seems to be threadsafe, if replacement isn't changed, so I suppose, we 
>> could cache the ..Array-X-coder.
>>
> no. an "external" charset can do whatever it likes, it might be still the same "object", the 
> de/encoder it "creates" might
> be still the same "object' or looks like the same object you might have cahced,  but do total 
> different thing.


At first assumption user could think, that String#getBytes(byte[] buf, Charset cs) might be faster 
than String#getBytes(byte[] buf, String csn), because he assumes, that Charset would be internally 
created from csn.
As this is only true for the first call, there should be a *note* in JavaDoc about cost of those 
methods in comparision. Don't forget (byte[] ...) constructor's JavaDoc too.

Secondly I think, that ASCII and ISO-8859-1 have high percentage here especially for CORBA 
applications, so why not have a fast shortcut in class String without internally using 
Charset-X-coder like getASCIIbytes() + getISO_8859_1Bytes(), or more general and sophisticated:
    int getBytes(byte[] buf, byte mask) {
        int j = 0;
        for (int i=0; i<values.length; i++, j++) {
            if (values[i] | mask == mask)
                buf[j] = (byte)values[i];
                continue;
            if (isHighSurrogate(values[i])
                 i++;
            buf[j] = '?'; // or default replacement
        }
        return j;
    }

-Ulf





More information about the core-libs-dev mailing list