<i18n dev> RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Fri Aug 26 07:19:04 UTC 2022

On Mon, 8 Aug 2022 09:22:32 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> Hello @AlanBateman .
>> Sorry I'm late.
>> I got some responses from ICU. [ICU-22091](https://unicode-org.atlassian.net/browse/ICU-22091)
>> I'm not sure if they're interested in the new charset...
>> 
>> As you know `sun.nio.cs.ArrayDecoder` and `sun.nio.cs.ArrayEncoder`interface have performance advantage.
>> And some other performance advantages are there on built-in charset decoder/encoder.
>> Is it possible to create simple public API by using `sun.nio.cs.SingleByte` and `sun.nio.cs.DoubleByte*` classes?
>> We'd like to use stable conversion loop.
>
>> As you know `sun.nio.cs.ArrayDecoder` and `sun.nio.cs.ArrayEncoder`interface have performance advantage. And some other performance advantages are there on built-in charset decoder/encoder. Is it possible to create simple public API by using `sun.nio.cs.SingleByte` and `sun.nio.cs.DoubleByte*` classes? We'd like to use stable conversion loop.
> 
> If they have ASCII compatible regions then that may be so but I haven't see any performance data published on that. Do you know if any experiments that have deployed a CharsetProvider for the EBCDIC charsets and compared the performance with the charsets that in the JDK? There may be merit in exploring adding base abstracts implementations of CharsetEncoder/CharsetDecoder to java.nio.charsets.spi to support single and double byte charsets to see how such base implementations might look, how they would help performance, and if there are any security downsides.

Hello @AlanBateman .
Sorry, I'm late.
Test result is attached (not guaranteed).

I created attached small test program, I'm not sure it's good or not

import java.nio.*;
import java.nio.charset.*;

public class tc {
  public static void main(String[] args) throws Exception {
    Charset cs = Charset.forName(args[0]);
    int cnt = Integer.parseInt(args[1]);
    boolean useCA = "1".equals(args[2]);
    boolean useBA = "1".equals(args[3]);
    CharsetEncoder ce = cs.newEncoder();
    byte[] ba = new byte[0x4000];
    for(int i = 0; i < ba.length; i++) {
      ba[i] = (byte) i;
    }
    String s = new String(ba, cs);
    char[] ca = s.toCharArray();
    ByteBuffer bb = useBA ? ByteBuffer.allocate(ca.length) : ByteBuffer.allocateDirect(ca.length);;
    CharBuffer cb = useCA ? CharBuffer.wrap(ca) : CharBuffer.wrap(s);
    System.out.println("CharBuffer.hasArray() = " + cb.hasArray());
    System.out.println("ByteBuffer.hasArray() = " + bb.hasArray());
    long start_t = System.currentTimeMillis();
    for(int i = 0; i < 200; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Warmup: "+(System.currentTimeMillis() - start_t));
    start_t = System.currentTimeMillis();
    for(int i = 0; i < cnt; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Test: "+(System.currentTimeMillis() - start_t));
  }
}

Following test result is just for my test environment
* CPU: Intel (On-premises environment), not same machine
* Executed 5 times, the values are their average 

Use following options, like
OpenJDK:
`java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1`
ICU4J
`java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1`

I used jdk-20 b12
Only A/A with OpenJDK uses ArrayEncoder (ArrayDecoder) interface

| | A/A | A/B | B/A | B/B |
| -- | --: | --: | --: | --: |
| Linux (OpenJDK) | 862 | 1265 | 1838 | 1843 |
| Linux (ICU4J) | 1450 | 1410 | 1152 | 1138 |
| Windows (OpenJDK) | 921 | 1231 | 1959 | 1850 |
| Windows (ICU4J) | 1431 | 1446 | 2227 | 2265 |
| Mac (OpenJDK) | 820 | 1163 | 1799 | 1774 |
| Mac (ICU4J) | 1282 | 1242 | 994 | 1049 |

Notes:
* A/A means CharBuffer is created via char[], ByteBuffer is generated by allocate()
* A/B means CharBuffer is created via char[], ByteBuffer is generated by allocateDirect()
* B/A means CharBuffer is created via String, ByteBuffer is generated by allocate()
* B/B means CharBuffer is created via String, ByteBuffer is generated by allocateDirect()

Actually, I'm confused by this result.
Previously, I was just comparing A/A with B/B on OpenJDK's charset.
I didn't think ICU4J's result would make a difference.

Anyway, please evaluate about this result.
And please let me know if I need more investigation.

-------------

PR: https://git.openjdk.org/jdk/pull/9399