<i18n dev> RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Mon Oct 3 07:16:29 UTC 2022

On Fri, 26 Aug 2022 09:25:55 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and DBCS Only charsets.
>> |Charset|Mix|SBCS|DBCS|
>> | -- | -- | -- | -- |
>> | Japanese EBCDIC - Katakana | Cp930 | Cp290 | Cp300 |
>> | Korean | Cp933 | Cp833 | Cp834 |
>> 
>> But OpenJDK does not supports some of "Japanese EBCDIC - English" / "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS Only charsets.
>> 
>> I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency
>> |Charset|Mix|SBCS|DBCS|
>> | ------------- | ------------- | ------------- | ------------- |
>> | Japanese EBCDIC - English | Cp939 | **Cp1027** | Cp300 |
>> | Simplified Chinese EBCDIC | Cp935 | **Cp836** | **Cp837** |
>> | Traditional Chinese EBCDIC | Cp937 | (*1) | **Cp835** | 
>> 
>> *1: Cp037 compatible
>
>> Use following options, like OpenJDK: `java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1` ICU4J `java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1`
>> 
>> Actually, I'm confused by this result. Previously, I was just comparing A/A with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a difference.
> 
> My initial reaction is one of relief that the icu4j provider can be used with current JDK builds. This means there is an option should we decide to stop adding more EBCDIC charsets to the JDK.
> 
> The test uses IBM-1047 and I can't tell if the icu4j provider is used or not. Charset doesn't define a provider method but I think would be useful to print cs.getClass() or cs.getClass().getModule() so we know which Charset implementation is used. Also I think any discussion on performance would be better served with a JMH benchmark rather than a standalone test.

Hello @AlanBateman .
Sorry I'm late.

I created Charset SPI JAR `x-IBM1047_SPI` (`custom-charsets.jar`) which was ported from `sun.nio.cs.SingleByte.java` and `IBM1047.java` (generated one).

Test code:

package com.example;

import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;

public class MyBenchmark {

    final static String s;

    static {
        char[] ca = new char[0x2000];
        for (int i = 0; i < ca.length; i++) {
            ca[i] = (char) (i & 0xFF);
        }
        s = new String(ca);
    }

    @Benchmark
    public void testIBM1047() throws Exception {
        byte[] ba = s.getBytes("IBM1047");
    }

    @Benchmark
    public void testIBM1047_SPI() throws Exception {
        byte[] ba = s.getBytes("x-IBM1047_SPI");
    }

}

All test related files are in [JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834).

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :

1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s

20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s

IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to **JEP 254: Compact Strings** .
As I requested before, we'd like to use `sun.nio.cs.SingleByte*` and `sun.nio.cs.DoubleByte*` class as public API.

-------------

PR: https://git.openjdk.org/jdk/pull/9399