Rewrite of IBM doublebyte charsets

Ulf Zibis Ulf.Zibis at
Thu May 21 23:41:24 UTC 2009

Am 21.05.2009 00:22, Xueming Shen schrieb:
> Ulf Zibis wrote:
>> (6) Unload b2cStr from memory after startup:
>>    - outsource b2cStr to additional class file like EUC_TW approach
>>    - set b2cStr = null after startup (remove final modifier)
>>   Benefit[6]: avoid 100 % superfluous memory-footprint
> I doubt it really saves something real, since the "class" should still 
> keep its copy somewhere...and
> I will need it for c2b (now I'm "delaying" the c2b init)

I always thought, setting an object to null after use, it would be 
automatically GCed. Am I wrong?
... but we can do c2binit from b2c[][] instead from b2cstr[], so why 
saving it.

>> (7) Avoid copying b2cStr to b2c:
>>    (String#charAt() is fast as char[] access)
>>   Benefit[7]: increase startup performance for decoder
> I tried again last night. char[][] is much faster than the String[] 
> version in both client
> and server vm. So keep it asis. (this was actually I switched from 
> String[] to char[][])

I'm surprised, because I had in mind from older benchmarks, that 
char_array[index] had same speed than str.charAt(index) after 
optimization from hotspot.
I also had this results here:

>> (12) Get rid of package dependency:
>>   Benefit[13]: avoid superfluous disk-footprint
>>   Benefit[14]: save maintenance of converters
>>   Disadvantage[1]: published under JRL (waiting for launch of 
>> OpenJDK-7 project "charset-enhancement") ;-)
> This is not something about engineering. It's about license, policy...

So hopefully we would have OpenJDK7 project "charset-enhancement" soon.

>> (17) Decoder#decodeArrayLoop: shortcut for single byte only:
>>      int sr = src.remaining();
>>      int sl = sp + sr;
>>      int dr = dst.remaining();
>>      int dl = dp + dr;
>>      // single byte only loop
>>      int slSB = sp + sr < dr ? sr : dr;
>>      while (sp < slSB) {
>>          char c = b2cSB[sa[sp] && 0xff];
>>          if (c == UNMAPPABLE_DECODING)
>>              break;
>>          da[dp++] = c;
>>          sp++;
>>      }
>>     Same for Encoder#encodeArrayLoop
>> (18) Decoder_EBCDIC: boolean singlebyteState:
>>      if (singlebyteState)
>>          ...
>> (19) Decoder_EBCDIC: decode single byte first:
>>      if (singlebyteState)
>>          c = b2cSB[b1];
>>      if (c == UNMAPPABLE_DECODING) {
>>          ...
>>      }
>>   Benefit[20]: should be faster
> Not like when we dealing with singlebyte charsets. For doublebyte 
> charsets
> the priority should be given to doublebyte codepoints, if possible. 
> Not single
> byte codepoints.

- I am in assumption that having singlebyte-only input is common use 
case. Am I wrong in case of EBCDIC ?
- This hack doesn't make processing of "normal" mixed input slower after 
escaping to "normal" while(...)-loop.
- This hack was copied from UTF-8 coder, where ASCII-only input is 
common use case.

>> *** Encoder-Suggestions:
>> (21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
>>   Benefit[21]: reduce no. of files
>>   Benefit[22]: simplifies initC2B() (avoids 2 loops)
> In theory you can do some magic to "join" .nr into .c2b. The price 
> might be more complicated
> logic depends on the codepoints. You may end up doing some table 
> lookup for each codepoint
> in b2c when processing.

This "magic" should be done in, so the price must only 
be paid once while building the JDK. But to be honest, it could be done 
by hand, for those few mapping pairs. See my single-byte IBMxxx mappings 
... and don't forget, it prevents from copying the whole b2c.

> And big thanks for all the suggestions.

Thanks for your appreciation. :-)


More information about the core-libs-dev mailing list