[jdk8u] RFR: 8186801: Add regression test to test mapping based charsets (generated at build time) [v2]

Fri Jun 23 03:58:07 UTC 2023

On Fri, 23 Jun 2023 02:46:50 GMT, Andrew John Hughes <andrew at openjdk.org> wrote:

>> This is a pre-requisite for [JDK-8301119](https://bugs.openjdk.org/browse/JDK-8301119) ("Support for GB18030-2022") and so is being proposed for inclusion in 8u382 during rampdown, so that the changes are in place for when GB18030-2022 enforcement begins in August. It introduces `GB18030.map`, containing the character mappings for GB18030, and tests to verify the correct mappings happen at run-time.
>> 
>> A number of changes were necessary for 8u. One main reason was the inclusion of [JDK-8186803](https://bugs.openjdk.org/browse/JDK-8186803) "Update Cp1140-Cp1149 EBEDIC euro charset to map \u000A to EBCDIC 0x15" as part of this fix in OpenJDK 10+. I have removed these changes in the 8u version so as to avoid making potentially incompatible library changes and focus on testing the current character mappings in 8u.
>> 
>> Another is that the character set data is spread across three files - `dbcs`, `sbcs` and `extsbcs` - in 8u, whereas it was amalgamated into a single file, `charsets`, during the introduction of modules. The coding test (`TestCharsetMapping.java`) has been adapted to use the 8u data format.
>> 
>> The detailed changes from the OpenJDK 10 patch are as follows:
>> 
>> 1. Drop the introduction of the `IBM114x.nr` files which implement JDK-8186803.
>> 2. Drop the change to `charsets` which doesn't exist in 8u and any equivalent change may lead to compatibility issues
>> 3. `EUC_TW.java` has slightly different context in 8u (no `pkg` argument) but the filename change is the same
>> 4. Drop the alias changes in `MS932_0213.java` & `x-MS932_0213` to avoid a compatibility risk.
>> 5. Changes to `EuroConverter.java` are dropped as they relate to JDK-8186803.
>> 6. Detect the IBM0114x character sets in `TestCharsetMapping.java` and expect an additional 0xA -> 25 mapping rather than counting this as an error
>> 7. Remove the parsing and checking of aliases in `TestCharsetMapping.java` as the 8u data files don't store them
>> 8. Handle the internal character sets in `TestCharsetMapping.java` as they are not marked as such in the 8u data files
>> 9. Change the data file parsing in `TestCharsetMapping.java` to handle the three 8u data files. `dbcs` has additional fields to the other two, but the first five fields that we actually use are mostly the same (`dbcs` has a type field, the other two assume a type of `sbcs`).
>> 10. `TestEBCDICLineFeed.java` is modified to handle IBM0114x as is, without the JDK-8186803 change
>
> Andrew John Hughes has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review changes:
>   
>   * Rename leHackIBM => ebcdicLFHack
>   * Move internal handling to charsets method
>   * Use full class name for internal check
>   * Add field headers in comments to ease code readability
>   * Parse dbcs first to aid code readability
>   * Drop redundant @modules from  TestEBCDLineFeed

Looks good to me!

-------------

Marked as reviewed by stuefe (Committer).

PR Review: https://git.openjdk.org/jdk8u/pull/43#pullrequestreview-1494353597