<i18n dev> RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets
Ichiroh Takiguchi
itakiguchi at openjdk.org
Mon Oct 3 07:16:29 UTC 2022
On Fri, 26 Aug 2022 09:25:55 GMT, Alan Bateman <alanb at openjdk.org> wrote:
>> OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and DBCS Only charsets.
>> |Charset|Mix|SBCS|DBCS|
>> | -- | -- | -- | -- |
>> | Japanese EBCDIC - Katakana | Cp930 | Cp290 | Cp300 |
>> | Korean | Cp933 | Cp833 | Cp834 |
>>
>> But OpenJDK does not supports some of "Japanese EBCDIC - English" / "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS Only charsets.
>>
>> I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency
>> |Charset|Mix|SBCS|DBCS|
>> | ------------- | ------------- | ------------- | ------------- |
>> | Japanese EBCDIC - English | Cp939 | **Cp1027** | Cp300 |
>> | Simplified Chinese EBCDIC | Cp935 | **Cp836** | **Cp837** |
>> | Traditional Chinese EBCDIC | Cp937 | (*1) | **Cp835** |
>>
>> *1: Cp037 compatible
>
>> Use following options, like OpenJDK: `java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1` ICU4J `java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1`
>>
>> Actually, I'm confused by this result. Previously, I was just comparing A/A with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a difference.
>
> My initial reaction is one of relief that the icu4j provider can be used with current JDK builds. This means there is an option should we decide to stop adding more EBCDIC charsets to the JDK.
>
> The test uses IBM-1047 and I can't tell if the icu4j provider is used or not. Charset doesn't define a provider method but I think would be useful to print cs.getClass() or cs.getClass().getModule() so we know which Charset implementation is used. Also I think any discussion on performance would be better served with a JMH benchmark rather than a standalone test.
Hello @AlanBateman .
Sorry I'm late.
I created Charset SPI JAR `x-IBM1047_SPI` (`custom-charsets.jar`) which was ported from `sun.nio.cs.SingleByte.java` and `IBM1047.java` (generated one).
Test code:
package com.example;
import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;
public class MyBenchmark {
final static String s;
static {
char[] ca = new char[0x2000];
for (int i = 0; i < ca.length; i++) {
ca[i] = (char) (i & 0xFF);
}
s = new String(ca);
}
@Benchmark
public void testIBM1047() throws Exception {
byte[] ba = s.getBytes("IBM1047");
}
@Benchmark
public void testIBM1047_SPI() throws Exception {
byte[] ba = s.getBytes("x-IBM1047_SPI");
}
}
All test related files are in [JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834).
Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :
1.8.0_345-b01
Benchmark Mode Cnt Score Error Units
MyBenchmark.testIBM1047 thrpt 25 53213.092 ± 126.962 ops/s
MyBenchmark.testIBM1047_SPI thrpt 25 47442.669 ± 349.003 ops/s
20-ea+17-1181
Benchmark Mode Cnt Score Error Units
MyBenchmark.testIBM1047 thrpt 25 136331.141 ± 1078.481 ops/s
MyBenchmark.testIBM1047_SPI thrpt 25 51563.213 ± 843.238 ops/s
IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to **JEP 254: Compact Strings** .
As I requested before, we'd like to use `sun.nio.cs.SingleByte*` and `sun.nio.cs.DoubleByte*` class as public API.
-------------
PR: https://git.openjdk.org/jdk/pull/9399
More information about the i18n-dev
mailing list