RFR [10]: 8186517: sun.nio.cs.StandardCharsets$Aliases and ClassMap can be lazily loaded
Claes Redestad
claes.redestad at oracle.com
Mon Aug 21 22:14:10 UTC 2017
I think the internal representation and handling in
sun...StandardCharset has little effect on
performance of the public method to get at charsets, i.e.,
Charset.forName, since we're relying
on a two-entry cache to give us the recently resolved charset as quickly
as possible.
There might be some issues at that level that would hurt applications
that are *inconsistent*,
oruse more than two charsets, though:
@Benchmark
public void getCharsetMixedSome(Blackhole bh) {
bh.consume(Charset.forName("utf-8"));
bh.consume(Charset.forName("utf-8"));
bh.consume(Charset.forName("UTF-8"));
}
@Benchmark
public void getCharsetMixedFull(Blackhole bh) {
bh.consume(Charset.forName("utf-8"));
bh.consume(Charset.forName("Utf-8"));
bh.consume(Charset.forName("UTF-8"));
}
@Benchmark
public void getCharsetSameUpper(Blackhole bh) {
bh.consume(Charset.forName("UTF-8"));
bh.consume(Charset.forName("UTF-8"));
bh.consume(Charset.forName("UTF-8"));
}
@Benchmark
public void getCharsetSameLower(Blackhole bh) {
bh.consume(Charset.forName("utf-8"));
bh.consume(Charset.forName("utf-8"));
bh.consume(Charset.forName("utf-8"));
}
Benchmark Mode Cnt Score Error Units
CharsetMicro.getCharsetMixedFull thrpt 4 4.391 ± 0.457 ops/us
CharsetMicro.getCharsetMixedSome thrpt 4 36.202 ± 0.324 ops/us
CharsetMicro.getCharsetSameLower thrpt 4 89.987 ± 1.667 ops/us
CharsetMicro.getCharsetSameUpper thrpt 4 89.957 ± 4.962 ops/us
Using equalsIgnoreCase would definitely help the worst cases above, but
might incur a penalty to
well-behaved, consistent code.
TL;DR: Consistently use Charset.forName("UTF-8"), lest ye be judged!
/Claes
On 2017-08-21 22:57, Martin Buchholz wrote:
> RFC https://tools.ietf.org/html/rfc3629and wikipedia
> https://en.wikipedia.org/wiki/UTF-8 agree that the name is "UTF-8".
>
> On Mon, Aug 21, 2017 at 1:24 PM, Xueming Shen <xueming.shen at oracle.com
> <mailto:xueming.shen at oracle.com>>wrote:
>
> On 8/21/17, 1:07 PM, Martin Buchholz wrote:
>> xUEMING, we should assume by default that people will use proper
>> names, especially in real code, where sloppiness should be left
>> behind. The real name of the UTF-8 charset is "UTF-8". Reward
>> careful coders; punish sloppy ones.
>>
>
> both our spec and iana spec say the charset name is NOT "
> case-sensitive", so strictly speaking it's not
> "sloppy" to use lowercase for the charset name.
>
> but i have to admit it does look sloppy to spell the name in this
> case combination :-)
>
>
>
>> On Mon, Aug 21, 2017 at 12:45 PM, Xueming Shen
>> <xueming.shen at oracle.com <mailto:xueming.shen at oracle.com>>wrote:
>>
>> On 8/21/17, 12:04 PM, Martin Buchholz wrote:
>>> OK, but ...
>>>
>>> I'd like to see further improvements here later, like
>>> switching to upper case.
>>
>> what's the benefit of switching to upper case? i would assume
>> the original
>> assumption is that people tends to use lower case charset
>> name in their
>> code, in that case (if the assumption is correct) the
>> "toLower()" then needs to
>> do nothing.
>>
>> the aliases and classes mapping are generated during the
>> build time, so it
>> does not matter it's lowercase or uppercase
>>
>>
>>>
>>> I just realized we have
>>> java/nio/charset/StandardCharsets.java
>>> sun/nio/cs/StandardCharsets.java
>>>
>>> and they both have a UTF_8 field !
>>>
>>>
>>>
>>> On Mon, Aug 21, 2017 at 11:53 AM, Claes Redestad
>>> <claes.redestad at oracle.com
>>> <mailto:claes.redestad at oracle.com>>wrote:
>>>
>>>
>>> On 2017-08-21 20:05, Martin Buchholz wrote:
>>>
>>> I agree we should optimize for common charset names,
>>> in part to help the world move to UTF-8.
>>>
>>>
>>> Agreed.
>>>
>>>
>>> It's *weird* to canonicalize to lower case, when the
>>> canonical charset names are all uppercase ("UTF-8"
>>> instead of "utf-8").
>>>
>>>
>>> A pre-existing weirdness, and it goes deep enough that I
>>> haven't dared changing it.
>>>
>>>
>>> ---
>>> 62 public static final String UTF_8 = "UTF-8";
>>> Is this still used?
>>>
>>> Maybe the very first thing lookup() should do is check
>>> charsetName == UTF_8
>>>
>>>
>>> Subsequent lookups are very likely to hit the
>>> two-element cache in
>>> Charset, so I've not seen this add up.
>>>
>>>
>>> ---
>>>
>>> Is switching from char[] to StringBuilder really an
>>> improvement? Charset names are all short, so the
>>> cost of copying the char[] to a byte[] is negligible.
>>>
>>>
>>> This allows us to not load and touch the code to deflate
>>> a char[] to a byte[] (StringUTF16), so a tiny, tiny
>>> startup win. Throughput-wise it's likely no different.
>>>
>>> /Claes
>>>
>>>
>>>
>>> On Mon, Aug 21, 2017 at 6:46 AM, Claes Redestad
>>> <claes.redestad at oracle.com
>>> <mailto:claes.redestad at oracle.com>
>>> <mailto:claes.redestad at oracle.com
>>> <mailto:claes.redestad at oracle.com>>> wrote:
>>>
>>> Hi,
>>>
>>> the Aliases and Classes inner classes in
>>> StandardCharsets can be
>>> lazily-loaded by restructuring how we check for
>>> the three
>>> default-loaded charsets. This removes some
>>> classloading and
>>> work from happening during critical phases of
>>> the VM startup,
>>> as well as a net gain on any systems that
>>> default to any of the
>>> three standard charsets (UTF-8, Latin-1, ASCII).
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~redestad/8186517/jdk.00/
>>> <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/>
>>>
>>> <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/
>>> <http://cr.openjdk.java.net/%7Eredestad/8186517/jdk.00/>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8186517
>>> <https://bugs.openjdk.java.net/browse/JDK-8186517>
>>>
>>> <https://bugs.openjdk.java.net/browse/JDK-8186517
>>> <https://bugs.openjdk.java.net/browse/JDK-8186517>>
>>>
>>> I'm not sure if the pre-existing optimization to
>>> allow
>>> StandardCharsets.charsets() unsynchronized access to
>>> internals
>>> is really necessary (or even 100% correct), but
>>> by ensuring we
>>> retrieve the Aliases and Classes instances in a
>>> synchronized block
>>> we should be no worse off semantically here.
>>>
>>> /Claes
>>>
>>>
>>>
>>>
>>
>>
>
>
More information about the nio-dev
mailing list