StandardCharset vs. StandardCharsets

Sun May 8 13:09:23 UTC 2011

On 05/07/2011 08:54 PM, Xueming Shen wrote:
> On 05-07-2011 上午 9:00, Rémi Forax wrote:
>> On 05/07/2011 02:17 PM, Ulf Zibis wrote:
>>> Hi all,
>>>
>>> please excuse, that I have still problems to accept this additional 
>>> class, but +1 for the plural name.
>>>
>>> If those charset constants are there, people _will use_ them without 
>>> respect on the existing _performance disadvantages_.
>>> A common typical use case should be: String.getBytes(...)
>>> On small strings there is a performance lost up to 25 % using the 
>>> charset variant vs. the charset name variant. See:
>>> http://cr.openjdk.java.net/~sherman/7040220/client
>>> http://markmail.org/message/2tbas5skgkve52mz
>>> http://markmail.org/thread/lnrozcbnpcl5kmzs
>>>
>>> So I still think, we should have the standard charset names as 
>>> constants in class j.n.c.Charset:
>>> public static final String UTF_8 = "UTF-8"; etc... 
>>
>> Using objects instead of string is a better design.
>> I see the fact that the String method variants that takes a Charset 
>> are slower that the ones that use a String
>> as a performance bug, not as a design issue.
>>
>> The String method that takes a Charset should reuse the class-local 
>> decoder
>> and the performance problem will go away.
>> (The analysis in StringCoding.decode(Charset, ...) (point 1) forget 
>> that initializing a decoder has also a cost)
>
> I do know the "slowness" is from initializing cs.newDe/Encoder():-) 
> But it is just not "easy" to cache
> the de/encoder in this case. There is no guarantee that the cs passed 
> in this time is the same one
> you had last time, even the name might be the same.

I agree. You can load a new charset this will override an existing one.

> Or even the cs this time is indeed the same
> instance you had last time (you did the cache), there is no guarantee 
> the dec/enc returned from
> newDecoder()/Encoder() this time will be the same one in your cache, 
> until you invoke the
> newDecoder()/Encoder() , 

Here, you don't care if it's the same encoder/decoder.
Any decoder/encoder will be valid, if the charset is the same (==) as 
the thread-local one.
The spec allows to reuse encoder/decoder, so it's valid to reuse one 
created from the same
charset. As I said, the javadoc offers no guarantee that we should 
always call newEncoder/newDecoder
and even says that if you want control over the encoder/decoder you 
should call newEncoder/newDecoder
yourself.

> get the enc/dec and compare to the on in your cache, but then why cache
> it:-) Something you can do is to do the cache if the cs passed in is 
> indeed the one from our own
> charset repository (can be trusted that would not do something 
> tricky), you can do this by invoking
> getClassLoader0() == null, which is expensive, I kinda remember the 
> measure showed this might
> not be something worthing doing last time when I was there. 

Yes, weirdly, getClassLoader() is not a VM intrinsic even if by example 
getSuperClass() which is
far more complex than getClassLoader() is an intrinsic.
It should be a good idea to open a bug to create an intrinsic for 
getClassLoader,
checking if a classloader of a class is null or not is a common pattern.

> Sure, those charsets in
> StandardChsets can be treated specially, if desirable, probably only 
> the ascii, iso8859-1 and utf8,
> such as
>
> if (cs == StandardCharsets.UTF_8 || cs == StandardCharsets.US_ASCII...) {
> ...
> }

I think it's better to reuse the caching mechanism used if the charset 
is a String.

>
> Will do some measurement later to see if to separate the "else" part 
> in a side method will speed
> up a little, we can do that if the inline does help, but not for 7:-)

Yes, not for 7, but 8 is around the corner.

>
> Thanks!
> -Sherman

cheers,
Rémi