RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Thu Oct 3 08:55:36 UTC 2024

On Fri, 23 Aug 2024 10:38:38 GMT, Pratiksha.Sawant <duke at openjdk.org> wrote:

> Mapping ISO-8859-8-I charset to ISO-8859-8.
> Below mentioned 2 aliases are added as part of this:-
> **ISO-8859-8-I**
> **ISO8859-8-I**
> 
> The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686

One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 would definitely be correct.

Java strings are stored in logical order. That is true for both LTR and RTL languages. This is plainly apparent from the OpenJDK String source code, but also explicitly mentioned/explained e.g. by official tutorials such as here: https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text

ISO-8859-8-i texts are always sent in logical order (by definition). **So decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will result in the correct order of characters in the Java string, i.e. logical order, and thus is always 100% correct by definition.**

Technically speaking, and for completeness sake here is the full list of cases for regular ISO-8859-8 today:

1. ISO-8859-8 texts may contain either LTR language content, in which case the text is correctly decoded to a Java string in logical order. -> OK
2. ISO-8859-8 texts may also contain RTL language content in logical order (newer applications already do this), in which case the text is also correctly decoded to a Java string in logical order. -> OK (this is the case if the alias is added)
3. But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

So: As long as nobody adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. The alias will result in case 2, i.e. texts will always be decoded into the correct Java string in logical order.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20690#issuecomment-2390872037