[jdk8u] RFR: 8186801: Add regression test to test mapping based charsets (generated at build time)

Thu Jun 22 11:31:08 UTC 2023

On Fri, 16 Jun 2023 15:55:38 GMT, Andrew John Hughes <andrew at openjdk.org> wrote:

> This is a pre-requisite for [JDK-8301119](https://bugs.openjdk.org/browse/JDK-8301119) ("Support for GB18030-2022") and so is being proposed for inclusion in 8u382 during rampdown, so that the changes are in place for when GB18030-2022 enforcement begins in August. It introduces `GB18030.map`, containing the character mappings for GB18030, and tests to verify the correct mappings happen at run-time.
> 
> A number of changes were necessary for 8u. One main reason was the inclusion of [JDK-8186803](https://bugs.openjdk.org/browse/JDK-8186803) "Update Cp1140-Cp1149 EBEDIC euro charset to map \u000A to EBCDIC 0x15" as part of this fix in OpenJDK 10+. I have removed these changes in the 8u version so as to avoid making potentially incompatible library changes and focus on testing the current character mappings in 8u.
> 
> Another is that the character set data is spread across three files - `dbcs`, `sbcs` and `extsbcs` - in 8u, whereas it was amalgamated into a single file, `charsets`, during the introduction of modules. The coding test (`TestCharsetMapping.java`) has been adapted to use the 8u data format.
> 
> The detailed changes from the OpenJDK 10 patch are as follows:
> 
> 1. Drop the introduction of the `IBM114x.nr` files which implement JDK-8186803.
> 2. Drop the change to `charsets` which doesn't exist in 8u and any equivalent change may lead to compatibility issues
> 3. `EUC_TW.java` has slightly different context in 8u (no `pkg` argument) but the filename change is the same
> 4. Drop the alias changes in `MS932_0213.java` & `x-MS932_0213` to avoid a compatibility risk.
> 5. Changes to `EuroConverter.java` are dropped as they relate to JDK-8186803.
> 6. Detect the IBM0114x character sets in `TestCharsetMapping.java` and expect an additional 0xA -> 25 mapping rather than counting this as an error
> 7. Remove the parsing and checking of aliases in `TestCharsetMapping.java` as the 8u data files don't store them
> 8. Handle the internal character sets in `TestCharsetMapping.java` as they are not marked as such in the 8u data files
> 9. Change the data file parsing in `TestCharsetMapping.java` to handle the three 8u data files. `dbcs` has additional fields to the other two, but the first five fields that we actually use are mostly the same (`dbcs` has a type field, the other two assume a type of `sbcs`).
> 10. `TestEBCDICLineFeed.java` is modified to handle IBM0114x as is, without the JDK-8186803 change

Second review round. Looked at TestCharSetMappings exclusively.

Most of my remarks are centered on code readability. 

My suggestion would be to test this test:

- build JDK and let it generate its charset
- then modify various *.map files and the charset definitions and check if the jtreg test picks up the changes as errors.

Unfortunately, I have no idea how to automate such tests, but as a sanity test that should be fine.

jdk/test/sun/nio/cs/TestCharsetMapping.java line 125:

> 123:     // 8u does not have JDK-8186803 so leHackIBM is true for
> 124:     // IBM1140-1149 charsets that map U+000A to 0x25.
> 125:     private boolean leHackIBM = false;

Can we give this a slightly better name? (I first thought le was leetspeak  pronoun :)

proposal: ebcdicLFHack or IBM114xLFHack or similar

jdk/test/sun/nio/cs/TestCharsetMapping.java line 477:

> 475:             if (csName.endsWith("_Solaris") ||
> 476:                 csName.endsWith("_MS5022X") ||
> 477:                 csName.endsWith("_MS932")) {

I would probably move this to the `charsets()` parsing function. To have all the code that deals with different data formats between 8 and later versions in one place.

Also, why the suffix solution? Easier to understand would be just listing the affected 5 charsets by full name, that way one can directly compare with later versions of the charsets file.

jdk/test/sun/nio/cs/TestCharsetMapping.java line 542:

> 540:                 continue;
> 541:             }
> 542:             CharsetInfo cs = new CharsetInfo(tokens[1], tokens[0]);

Small nit, could you add comments to the constructor args like this:

new CharsetInfo(/*csName*/ tokens[1], /*clzName*/ tokens[0]);

(We are different from the JDK 10 version anyway in this hunk)

jdk/test/sun/nio/cs/TestCharsetMapping.java line 549:

> 547:             } else {
> 548:                 cs.type = tokens[3];
> 549:             }

Mentally parsing:

Format 1, dbcs:
clzName  csName     hisName  dbtype    pkg               ascii   b1min  b1max  b2min b2max

hisname = 2
pkgName = 4
type = sbcs

Format 2, sbcs and extsbcs:
clzName	csName		hisName		containASCII	pkg

hisname = 2
pkgName = 4
type = sbcs

Okay, checks out.

jdk/test/sun/nio/cs/TestCharsetMapping.java line 576:

> 574:         Set<CharsetInfo> charsets = charsets(dir.resolve("sbcs"), true);
> 575:         charsets.addAll(charsets(dir.resolve("extsbcs"), true));
> 576:         charsets.addAll(charsets(dir.resolve("dbcs"), false));

small nit, can you reorder and parse dbcs first, then comment, then the other two? So its clear that dbcs is excluded from comment

-------------

Changes requested by stuefe (Committer).

PR Review: https://git.openjdk.org/jdk8u/pull/43#pullrequestreview-1492894268
PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1238335801
PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1238368942
PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1238384200
PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1238382586
PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1238381122