RFR [10]: 8186517: sun.nio.cs.StandardCharsets$Aliases and ClassMap can be lazily loaded

Mon Aug 21 22:14:10 UTC 2017

I think the internal representation and handling in 
sun...StandardCharset has little effect on
performance of the public method to get at charsets, i.e., 
Charset.forName, since we're relying
on a two-entry cache to give us the recently resolved charset as quickly 
as possible.

There might be some issues at that level that would hurt applications 
that are *inconsistent*,
oruse more than two charsets, though:

     @Benchmark
     public void getCharsetMixedSome(Blackhole bh) {
         bh.consume(Charset.forName("utf-8"));
         bh.consume(Charset.forName("utf-8"));
         bh.consume(Charset.forName("UTF-8"));
     }

     @Benchmark
     public void getCharsetMixedFull(Blackhole bh) {
         bh.consume(Charset.forName("utf-8"));
         bh.consume(Charset.forName("Utf-8"));
         bh.consume(Charset.forName("UTF-8"));
     }

     @Benchmark
     public void getCharsetSameUpper(Blackhole bh) {
         bh.consume(Charset.forName("UTF-8"));
         bh.consume(Charset.forName("UTF-8"));
         bh.consume(Charset.forName("UTF-8"));
     }

     @Benchmark
     public void getCharsetSameLower(Blackhole bh) {
         bh.consume(Charset.forName("utf-8"));
         bh.consume(Charset.forName("utf-8"));
         bh.consume(Charset.forName("utf-8"));
     }

Benchmark                          Mode  Cnt   Score Error   Units
CharsetMicro.getCharsetMixedFull  thrpt    4   4.391 ± 0.457  ops/us
CharsetMicro.getCharsetMixedSome  thrpt    4  36.202 ± 0.324  ops/us
CharsetMicro.getCharsetSameLower  thrpt    4  89.987 ± 1.667  ops/us
CharsetMicro.getCharsetSameUpper  thrpt    4  89.957 ± 4.962  ops/us

Using equalsIgnoreCase would definitely help the worst cases above, but 
might incur a penalty to
well-behaved, consistent code.

TL;DR: Consistently use Charset.forName("UTF-8"), lest ye be judged!

/Claes

On 2017-08-21 22:57, Martin Buchholz wrote:
> RFC https://tools.ietf.org/html/rfc3629and wikipedia 
> https://en.wikipedia.org/wiki/UTF-8 agree that the name is "UTF-8".
>
> On Mon, Aug 21, 2017 at 1:24 PM, Xueming Shen <xueming.shen at oracle.com 
> <mailto:xueming.shen at oracle.com>>wrote:
>
>     On 8/21/17, 1:07 PM, Martin Buchholz wrote:
>>     xUEMING, we should assume by default that people will use proper
>>     names, especially in real code, where sloppiness should be left
>>     behind.  The real name of the UTF-8 charset is "UTF-8".  Reward
>>     careful coders; punish sloppy ones.
>>
>
>     both our spec and iana spec say the charset name is NOT "
>     case-sensitive", so strictly speaking it's not
>     "sloppy" to use lowercase for the charset name.
>
>     but i have to admit it does look sloppy to spell the name in this
>     case combination :-)
>
>
>
>>     On Mon, Aug 21, 2017 at 12:45 PM, Xueming Shen
>>     <xueming.shen at oracle.com <mailto:xueming.shen at oracle.com>>wrote:
>>
>>         On 8/21/17, 12:04 PM, Martin Buchholz wrote:
>>>         OK, but ...
>>>
>>>         I'd like to see further improvements here later, like
>>>         switching to upper case.
>>
>>         what's the benefit of switching to upper case? i would assume
>>         the original
>>         assumption is that people tends to use lower case charset
>>         name in their
>>         code, in that case (if the assumption is correct) the
>>         "toLower()" then needs to
>>         do nothing.
>>
>>         the aliases and classes mapping are generated during the
>>         build time, so it
>>         does not matter it's lowercase or uppercase
>>
>>
>>>
>>>         I just realized we have
>>>         java/nio/charset/StandardCharsets.java
>>>         sun/nio/cs/StandardCharsets.java
>>>
>>>         and they both have a UTF_8 field !
>>>
>>>
>>>
>>>         On Mon, Aug 21, 2017 at 11:53 AM, Claes Redestad
>>>         <claes.redestad at oracle.com
>>>         <mailto:claes.redestad at oracle.com>>wrote:
>>>
>>>
>>>             On 2017-08-21 20:05, Martin Buchholz wrote:
>>>
>>>                 I agree we should optimize for common charset names,
>>>                 in part to help the world move to UTF-8.
>>>
>>>
>>>             Agreed.
>>>
>>>
>>>                 It's *weird* to canonicalize to lower case, when the
>>>                 canonical charset names are all uppercase ("UTF-8"
>>>                 instead of "utf-8").
>>>
>>>
>>>             A pre-existing weirdness, and it goes deep enough that I
>>>             haven't dared changing it.
>>>
>>>
>>>                 ---
>>>                    62     public static final String UTF_8 = "UTF-8";
>>>                 Is this still used?
>>>
>>>                 Maybe the very first thing lookup() should do is check
>>>                 charsetName == UTF_8
>>>
>>>
>>>             Subsequent lookups are very likely to hit the
>>>             two-element cache in
>>>             Charset, so I've not seen this add up.
>>>
>>>
>>>                 ---
>>>
>>>                 Is switching from char[] to StringBuilder really an
>>>                 improvement?  Charset names are all short, so the
>>>                 cost of copying the char[] to a byte[] is negligible.
>>>
>>>
>>>             This allows us to not load and touch the code to deflate
>>>             a char[] to a byte[] (StringUTF16), so a tiny, tiny
>>>             startup win. Throughput-wise it's likely no different.
>>>
>>>             /Claes
>>>
>>>
>>>
>>>                 On Mon, Aug 21, 2017 at 6:46 AM, Claes Redestad
>>>                 <claes.redestad at oracle.com
>>>                 <mailto:claes.redestad at oracle.com>
>>>                 <mailto:claes.redestad at oracle.com
>>>                 <mailto:claes.redestad at oracle.com>>> wrote:
>>>
>>>                     Hi,
>>>
>>>                     the Aliases and Classes inner classes in
>>>                 StandardCharsets can be
>>>                     lazily-loaded by restructuring how we check for
>>>                 the three
>>>                     default-loaded charsets. This removes some
>>>                 classloading and
>>>                     work from happening during critical phases of
>>>                 the VM startup,
>>>                     as well as a net gain on any systems that
>>>                 default to any of the
>>>                     three standard charsets (UTF-8, Latin-1, ASCII).
>>>
>>>                     Webrev:
>>>                 http://cr.openjdk.java.net/~redestad/8186517/jdk.00/
>>>                 <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/>
>>>                    
>>>                 <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/
>>>                 <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/>>
>>>                     Bug:
>>>                 https://bugs.openjdk.java.net/browse/JDK-8186517
>>>                 <https://bugs.openjdk.java.net/browse/JDK-8186517>
>>>                    
>>>                 <https://bugs.openjdk.java.net/browse/JDK-8186517
>>>                 <https://bugs.openjdk.java.net/browse/JDK-8186517>>
>>>
>>>                     I'm not sure if the pre-existing optimization to
>>>                 allow
>>>                 StandardCharsets.charsets() unsynchronized access to
>>>                 internals
>>>                     is really necessary (or even 100% correct), but
>>>                 by ensuring we
>>>                     retrieve the Aliases and Classes instances in a
>>>                 synchronized block
>>>                     we should be no worse off semantically here.
>>>
>>>                     /Claes
>>>
>>>
>>>
>>>
>>
>>
>
>