[jdk8u] RFR: 8186801: Add regression test to test mapping based charsets (generated at build time)

Fri Jun 23 01:51:08 UTC 2023

On Thu, 22 Jun 2023 08:45:27 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> This is a pre-requisite for [JDK-8301119](https://bugs.openjdk.org/browse/JDK-8301119) ("Support for GB18030-2022") and so is being proposed for inclusion in 8u382 during rampdown, so that the changes are in place for when GB18030-2022 enforcement begins in August. It introduces `GB18030.map`, containing the character mappings for GB18030, and tests to verify the correct mappings happen at run-time.
>> 
>> A number of changes were necessary for 8u. One main reason was the inclusion of [JDK-8186803](https://bugs.openjdk.org/browse/JDK-8186803) "Update Cp1140-Cp1149 EBEDIC euro charset to map \u000A to EBCDIC 0x15" as part of this fix in OpenJDK 10+. I have removed these changes in the 8u version so as to avoid making potentially incompatible library changes and focus on testing the current character mappings in 8u.
>> 
>> Another is that the character set data is spread across three files - `dbcs`, `sbcs` and `extsbcs` - in 8u, whereas it was amalgamated into a single file, `charsets`, during the introduction of modules. The coding test (`TestCharsetMapping.java`) has been adapted to use the 8u data format.
>> 
>> The detailed changes from the OpenJDK 10 patch are as follows:
>> 
>> 1. Drop the introduction of the `IBM114x.nr` files which implement JDK-8186803.
>> 2. Drop the change to `charsets` which doesn't exist in 8u and any equivalent change may lead to compatibility issues
>> 3. `EUC_TW.java` has slightly different context in 8u (no `pkg` argument) but the filename change is the same
>> 4. Drop the alias changes in `MS932_0213.java` & `x-MS932_0213` to avoid a compatibility risk.
>> 5. Changes to `EuroConverter.java` are dropped as they relate to JDK-8186803.
>> 6. Detect the IBM0114x character sets in `TestCharsetMapping.java` and expect an additional 0xA -> 25 mapping rather than counting this as an error
>> 7. Remove the parsing and checking of aliases in `TestCharsetMapping.java` as the 8u data files don't store them
>> 8. Handle the internal character sets in `TestCharsetMapping.java` as they are not marked as such in the 8u data files
>> 9. Change the data file parsing in `TestCharsetMapping.java` to handle the three 8u data files. `dbcs` has additional fields to the other two, but the first five fields that we actually use are mostly the same (`dbcs` has a type field, the other two assume a type of `sbcs`).
>> 10. `TestEBCDICLineFeed.java` is modified to handle IBM0114x as is, without the JDK-8186803 change
>
> jdk/test/sun/nio/cs/TestEBCDICLineFeed.java line 56:
> 
>> 54:                                   cs,  bb[0] & 0xff);
>> 55:                 errs++;
>> 56:             }
> 
> Took some thinking, but I think this is okay.
> 
> Are we sure the old (pre-8186803) translations are always "U+000A" -> 25? I am not sure, it looked to me like it could be either mapped to E15 or E25.

The changes I've made to detect IBM0114x were necessary to get the test to pass. Prior to that, it was failing because those character sets were producing 0x25.

-------------

PR Review Comment: https://git.openjdk.org/jdk8u/pull/43#discussion_r1239203978