<i18n dev> RFR: 8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8

Wed Oct 9 20:27:14 UTC 2024

On Thu, 3 Oct 2024 08:52:01 GMT, Jeremie Miserez <duke at openjdk.org> wrote:

>> Mapping ISO-8859-8-I charset to ISO-8859-8.
>> Below mentioned 2 aliases are added as part of this:-
>> **ISO-8859-8-I**
>> **ISO8859-8-I**
>> 
>> The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686
>
> One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 would definitely be correct.
> 
> Java strings are stored in logical order. That is true for both LTR and RTL languages. This is plainly apparent from the OpenJDK String source code, but also explicitly mentioned/explained e.g. by official tutorials such as here: https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text
> 
> ISO-8859-8-i texts are always sent in logical order (by definition). **So decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will result in the correct order of characters in the Java string, i.e. logical order, and thus is always 100% correct by definition.**
> 
> Technically speaking, and for completeness sake here is the full list of cases for regular ISO-8859-8 today:
> 
> 1. ISO-8859-8 texts may contain either LTR language content, in which case the text is correctly decoded to a Java string in logical order. -> OK
> 2. ISO-8859-8 texts may also contain RTL language content in logical order (newer applications already do this), in which case the text is also correctly decoded to a Java string in logical order. -> OK.
> 3. But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.
> 
> So: Case 2 is what would happen if the alias was added. Now as long as nobody adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. The alias will result in case 2, i.e. texts will always be decoded into the correct Java string in logical order.

@jmiserez wrote:

> But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

In other words, Java _may_ have been incorrectly handling `ISO-8859-8` all this time if content was in visual order. Putting in this alias means that ISO-8859-8-I will be handled correctly.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20690#issuecomment-2403364716