StandardCharset vs. StandardCharsets
Rémi Forax
forax at univ-mlv.fr
Sat May 7 18:33:06 UTC 2011
Hi Ulf,
the javadoc doesn't say explicitly that the result of
charset.newDecoder() will be used,
so I don't see the point.
I even see that the last sentence:
"The||||
<http://download.java.net/jdk7/docs/api/java/nio/charset/CharsetDecoder.html>CharsetDecoder
class should be used when more control over the decoding process is
required."
as a way to say that it's ok to reuse a previously existing decoder.
Rémi
On 05/07/2011 07:55 PM, Ulf Zibis wrote:
> Rémi, thanks for your feedback.
>
> Am 07.05.2011 18:00, schrieb Rémi Forax:
>> On 05/07/2011 02:17 PM, Ulf Zibis wrote:
>>> Hi all,
>>>
>>> please excuse, that I have still problems to accept this additional
>>> class, but +1 for the plural name.
>>>
>>> If those charset constants are there, people _will use_ them without
>>> respect on the existing _performance disadvantages_.
>>> A common typical use case should be: String.getBytes(...)
>>> On small strings there is a performance lost up to 25 % using the
>>> charset variant vs. the charset name variant. See:
>>> http://cr.openjdk.java.net/~sherman/7040220/client
>>> http://markmail.org/message/2tbas5skgkve52mz
>>> http://markmail.org/thread/lnrozcbnpcl5kmzs
>>>
>>> So I still think, we should have the standard charset names as
>>> constants in class j.n.c.Charset:
>>> public static final String UTF_8 = "UTF-8"; etc...
>>
>> Using objects instead of string is a better design.
>
> I agree 50 %.
> 100 % would be to have:
> byte[] String.getBytes(CharsetEncoder encoder)
> String(byte[] bytes, CharsetDecoder decoder)
> So for convenience in consequence we should introduce constants for
> CharsetDecoder's and CharsetEncoder's in j.n.c.StandardCharsets, which
> would result in 12 additional classes to be loaded and instatiated at
> one time, if only one of them becomes in use.
>
> But anyway, it would be better to have the canonical names of the
> standard charsets declared in 1 place, not in 3 (Charset,
> j.n.c.StandardCharsets, s.n.c.StandardCharsets)
>
>> I see the fact that the String method variants that takes a Charset
>> are slower that the ones that use a String
>> as a performance bug, not as a design issue.
>>
>> The String method that takes a Charset should reuse the class-local
>> decoder
>> and the performance problem will go away.
>> (The analysis in StringCoding.decode(Charset, ...) (point 1) forget
>> that initializing a decoder has also a cost)
>
> Unfortunately this is not possible.
> See following discussion (my last post from 26.03.2009 - 00:52 CET,
> unfortunately this was a private conversation):
>
>
> Am 19.03.2009 20:02, Xueming Shen schrieb:
>> Ulf Zibis wrote:
>>>
>>> Isn't there any way even to avoid instantiating new ..Array-X-coder
>>> for each invocation of StringCoding.x-code(Charset cs, ...)?
>>> Method x-code(byte/char[]) seems to be threadsafe, if replacement
>>> isn't changed, so I suppose, we could cache the ..Array-X-coder.
>>>
>> no. an "external" charset can do whatever it likes, it might be still
>> the same "object", the de/encoder it "creates" might
>> be still the same "object' or looks like the same object you might
>> have cahced, but do total different thing.
>
>
> At first assumption user could think, that String#getBytes(byte[] buf,
> Charset cs) might be faster than String#getBytes(byte[] buf, String
> csn), because he assumes, that Charset would be internally created
> from csn.
> As this is only true for the first call, there should be a *note* in
> JavaDoc about cost of those methods in comparision. Don't forget
> (byte[] ...) constructor's JavaDoc too.
>
> Secondly I think, that ASCII and ISO-8859-1 have high percentage here
> especially for CORBA applications, so why not have a fast shortcut in
> class String without internally using Charset-X-coder like
> getASCIIbytes() + getISO_8859_1Bytes(), or more general and
> sophisticated:
> int getBytes(byte[] buf, byte mask) {
> int j = 0;
> for (int i=0; i<values.length; i++, j++) {
> if (values[i] | mask == mask)
> buf[j] = (byte)values[i];
> continue;
> if (isHighSurrogate(values[i])
> i++;
> buf[j] = '?'; // or default replacement
> }
> return j;
> }
>
> -Ulf
>
>
More information about the core-libs-dev
mailing list