StandardCharset vs. StandardCharsets

Sat May 7 12:17:42 UTC 2011

Hi all,

please excuse, that I have still problems to accept this additional class, but +1 for the plural name.

If those charset constants are there, people _will use_ them without respect on the existing 
_performance disadvantages_.
A common typical use case should be: String.getBytes(...)
On small strings there is a performance lost up to 25 % using the charset variant vs. the charset 
name variant. See:
http://cr.openjdk.java.net/~sherman/7040220/client
http://markmail.org/message/2tbas5skgkve52mz
http://markmail.org/thread/lnrozcbnpcl5kmzs

So I still think, we should have the standard charset names as constants in class j.n.c.Charset:
     public static final String UTF_8 = "UTF-8";  etc...
To avoid duplicate declarations, class sun.nio.cs.StandardCharsets should refer them. (After 
automatic build generation of s.n.c.StandardCharsets.java generation replace "UTF-8" by 
java.nio.charset.Charset.UTF_8 etc...)

If I understand right, the reasonable for this j.n.c.StandardCharsets class is to avoid too much 
classes to be loaded while booting the VM.

(1.) This could easily be achieved by lazy initialized static final methods: 
j.n.c.Charset.charsetUTF_8() etc... or more short j.n.c.Charset.UTF_8() etc... (unconventional 
upper-case method name would indicate that a constant is returned). Aside it would remember that the 
returned constants are platform vendor implementation dependent.
So there is no need for an additional class j.n.c.StandardCharsets to be maintained and loaded.

(2.) Alternatively/Additionally ...
In ***Bug 100098* <https://bugs.openjdk.java.net/show_bug.cgi?id=100098> - Make sun.nio.cs.* charset 
objects light-weight ** I have demonstrated that there is no need to have an *individual class for 
each charset*, so we won't need to care about 6 charset classes to be loaded during VM boot.
Besides the lookup of a charset should be much faster then before, as it doesn't anymore need a 
Class.forName(...) call, as you can see there, in my FastCharsetProvider class.
This change seems to be too invasive in the current state of JDK-7, but in the meantime we could 
have something (which demonstrates the advantage to have the canonical names as constant) like:

package sun.nio.cs;
...
import static java.nio.charset.Charset.*;
import static sun.nio.cs.StandardCharsets.*;
public class FastCharsetProvider extends CharsetProvider {
     ...

     private Charset lookup(String charsetName) {

         String csn = canonicalize(toLower(charsetName));

         // Check cache first
         Charset cs = cache.get(csn);
         if (cs != null)
             return cs;

         StandardCharset: {
             switch (charsetName) {
                 case US_ASCII : cs = new StandardCharset(US_ASCII, aliases_US_ASCII); break;
                 case ISO_8859_1 : cs = new StandardCharset(ISO_8859_1, aliases_ISO_8859_1); break;
                 case UTF_8 : cs = new StandardCharset(UTF_8, aliases_UTF_8); break;
                 case UTF_16 : cs = return new StandardCharset(UTF_16, aliases_UTF_16); break;
                 case UTF_16BE : cs = new StandardCharset(UTF_16BE, aliases_UTF_16BE); break;
                 case UTF_16LE : cs = new StandardCharset(UTF_16LE, aliases_UTF_16LE); break;
                 default : break StandardCharset;
             }
             cache.put(csn, cs);
             return cs;
         }

         // Do we even support this charset?
         String cln = classMap.get(csn);
         if (cln == null)
             return null;

         // Instantiate the charset and cache it
         ...
     }

     private final class StandardCharset extends Charset {
         private StandardCharset(String canonicalName, String[] aliases) {
             super(canonicalName, aliases);
         }
         newDecoder() {
             switch (name()) {
                 case US_ASCII : return new sun.nio.cs.US_ASCII.Decoder();
                 case ISO_8859_1 : return new sun.nio.cs.ISO_8859_1.Decoder();
                 case UTF_8 : return new sun.nio.cs.UTF_8.Decoder();
                 case UTF_16 : return new sun.nio.cs.UTF_16.Decoder();
                 case UTF_16BE : return new sun.nio.cs.UTF_16BE.Decoder();
                 case UTF_16LE : return new sun.nio.cs.UTF_16LE.Decoder();
                 default : return Charset.defaultCharset().newDecoder();
         }
         newEncoder() { ... }
         ...
     }
     ...

}

So I tend to prefer the original request from 4884238 (have the canonical names as constants), as 
the lookup via Charset.forName(...) then could be very fast compared to the anyway following heavy 
de/encoding work.

In that case we additionally could avoid such weird constructs
(potentially superfluously forces the UTF-8 charset class to be loaded):
  114     private ZipCoder(Charset cs) {
  115         this.cs = cs;
  116         this.isUTF8 = cs.name().equals(StandardCharset.UTF_8.name());
  117     }

Instead we could have:
  116         this.isUTF8 = cs.name().equals(Charset.UTF_8);

-Ulf

Am 02.05.2011 23:16, schrieb mark.reinhold at oracle.com:
> 2011/4/29 16:28 -0700, mike.duigou at oracle.com:
>> Hi Mark;
>>
>> I'm still not on your wavelength on this issue. For two reasons:
>>
>> - Collections, Arrays, Channels, Objects aren't typically used for
>> holding constants. They are either factories or collections of static
>> utility methods. Only Collections actually defines any constants and
>> those probably should have been put on the respective interfaces.
> The constants in the Collections class weren't put into the related
> interfaces because that would have polluted any implementing class or
> subinterface with that constant name.  Constants in interfaces are
> almost always a bad idea.
>
>> The closest example seems to be PosixFilePermission (an enum)
>> vs. PosixFilePermissions (static utility methods). The one difference
>> is that PosixFilePermission is exhaustive, no other permissions are
>> possible, vs the standard charsets where they represent only a subset
>> of the available charsets. AFAIK even ME distributions have more than
>> just the standard charsets.
>>
>> For the standard charsets we've got no methods, only constants. Is
>> Charsets preferred only because we want to avoid putting these
>> constants on Charset itself for the previously cited performance
>> reasons?
> To be clear, I'm not objecting to the prefix "Standard" -- I'm just
> arguing in favor of the suffix "s".
>
>> - The standard charset constants do seem like an enum. The enumeration
>> they belong to is easily named and part of the platform
>> specification.
> Conceptually, I agree.
>
>>                 Putting these definitions all into one class makes
>> membership obvious--the standard charsets are the ones in this
>> class.
> Yep.
>
>>         Charsets not defined in this class are not standard and thus not
>> guaranteed to be available. (which is an interesting point, though the
>> standard charsets are now visible in the platform we still don't
>> provide a way to test whether a given charset is a member of the set of
>> standard charsets).
> That is an omission, but I wonder if anyone cares?  (A possible fix would
> be to define StandardCharsets.ALL as a set containing all the singletons
> defined in the class.)
>
>> Barring history and implementation details I don't see why we wouldn't
>> be defining these constants as an enum implementing Charset the same
>> way as we've defined the jsr203 Standard* enums. The enum members would
>> have more implementation than the jsr203 Standard* enums, ie. more than
>> identity but almost all of that is constant data itself.
> That would require Charset to have been defined as an interface, which
> in turn would've added substantial complexity to the original design.
> At any rate, it's water under the bridge at this point.
>
>> What makes these constants *not* an enumeration?
> If I have the enum type
>
>      enum Foo { BAR, BAZ }
>
> then the following assignment works:
>
>      Foo x = Foo.BAR;
>
> A statement of that form does not work with your StandardCharset class:
>
>      StandardCharset x = StandardCharset.US_ASCII;
>
> That's because StandardCharset is a final class and it does not extend
> Charset, nor any other class or interface except Object.  It's just a
> holder for a bunch of convenient constants, like java.util.Arrays and
> j.u.Collections, etc.  The only difference is that it defines some
> static constant values but no static methods.
>
> In short, the hallmark of an enum type is that its members all have the
> same type, and that type is the enum type itself.  This is true whether
> you use the fancy enum construct from Java 5 or the older hand-coded enum
> pattern.  Your StandardCharset class is not an enum type.
>
> In order to fix this you could change StandardCharset to extend Charset,
> but then you'd have to jump through some hoops to make the standard
> charset implementation classes extend StandardCharset rather than plain
> Charset, and you'd have to be careful not to allow anyone to extend
> StandardCharset.  Offhand this seems to require moving the standard
> charset implementations into the java.nio.charset package itself.
>
> That's an awful lot of effort for just a handful of constants.
>
>> I'm not trying to be difficult on this, honestly.
> I understand.  Me neither.
>
> - Mark
>