Add String & Character ASCII case conversion methods

Tue Apr 11 19:40:14 UTC 2023

Thanks for your reply, I've opened a PR[1] where I hope to continue
discussing this issue.

Glavo

[1] https://github.com/openjdk/jdk/pull/13434

On Tue, Apr 11, 2023 at 9:02 PM Quân Anh Mai <anhmdq at gmail.com> wrote:

> Hi,
>
> To propose deprecation of String::toLowerCase() and String::toUpperCase(),
> you can create a patch as normal, with an addition of a CSR ticket that
> describes the situation and the proposed solution. After that, you can ask
> for someone from core-libs to review the ticket. The change can be merged
> after sufficient reviews and the CSR being approved. You can view
> other CSRs in JBS to see the overall structure, as well as other
> deprecations in the JDK to see a typical deprecation description. More
> details regarding CSRs can be found in the OpenJDK Wiki
> <https://wiki.openjdk.org/display/csr/Main>.
>
> Hope this helps,
> Quan Anh
>
> On Mon, 10 Apr 2023 at 00:47, Glavo <zjx001202 at gmail.com> wrote:
>
>> Hi,
>>
>> We discussed this issue on this mailing list[1] earlier this year.
>>
>> I investigated the usage of these two methods and found that all use
>> cases within
>> JDK are suspicious, resulting in many imperceptible bugs.
>>
>> I hope to create a PR for this issue, deprecate these two methods, and
>> create
>> alternative methods for them. But I don't have the experience of making
>> such
>> changes, maybe I need some guidance or have more experienced people do
>> these things.
>>
>> Glavo
>>
>> [1]
>> https://mail.openjdk.org/pipermail/core-libs-dev/2023-January/099375.html
>>
>> On Sun, Apr 9, 2023 at 10:58 PM <
>> some-java-user-99206970363698485155 at vodafonemail.de> wrote:
>>
>>> Hello,
>>> could you please add String & Character ASCII case conversion methods,
>>> that is, methods which only perform case conversion on ASCII characters in
>>> the input and leave any other characters unchanged. The conversion should
>>> not depend on the default locale. For example:
>>> - String:
>>>   - toAsciiLowerCase
>>>   - toAsciiUpperCase
>>>   - equalsAsciiIgnoreCase (or a better name)
>>>   - compareToAsciiIgnoreCase (or a better name)
>>> - Character:
>>>   - toAsciiLowerCase
>>>   - toAsciiUpperCase
>>>
>>> This would give the following advantages:
>>> - Increased performance (+ not be vulnerable to denial of service
>>> attacks)
>>> - Reduced number of bugs in applications
>>>
>>>
>>> Please read on for a detailed explanation.
>>>
>>> I assume for historic reasons (Applets) the current case conversion
>>> methods use the Unicode conversion rules, and even worse
>>> String.toLowerCase() and String.toUpperCase() use the default locale. While
>>> this might back then have been a reasonable choice because Applets ran
>>> locally in the browser and localization was a nice to have feature (or even
>>> a requirement), nowadays Java is largely used for back-end systems and case
>>> conversion is pretty often done for technical strings and not display text
>>> anymore. In this context applications mostly process ASCII strings.
>>> However, because Java does not offer any specific case conversion
>>> methods for these cases, users still use the standard String & Character
>>> methods. This causes the following problems [1]:
>>>
>>> - String.toLowerCase() & String.toUpperCase() using default locale
>>>   What this means is that depending on the OS locale your application
>>> might behave differently or fail [2]. For the scale of this, simply look in
>>> the OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~
>>> "turkish locale"
>>>   At this point you probably have to add a disclaimer to any Java
>>> program that running it on systems with Turkish (and possibly others) as
>>> locale is not supported, because either your own code or the libraries you
>>> are using might be calling toLowerCase() or toUpperCase() [3].
>>>
>>> - Bad performance for Unicode aware case conversions
>>>   Compared to simply performing ASCII case conversion, applying Unicode
>>> case conversion has worse performance. In some cases it can even have
>>> extremely bad performance (JDK-8292573). This could have security
>>> implications.
>>>
>>> - Bugs due to case conversion changing string length
>>>   Unicode case conversion for certain strings can change the length,
>>> either increasing or decreasing the size of the string (or when combining
>>> both, shifting position of characters in the string while keeping the
>>> length the same). If an application assumes that the length of the string
>>> remains the same and uses data derived from the original string (e.g.
>>> character indices or length) on the converted string this can lead to
>>> exceptions or potentially even security issues.
>>>
>>> - Unicode characters mapping to ASCII chars
>>>   When performing case conversion on certain non-ASCII Unicode
>>> characters, the results are ASCII characters. For example
>>> `Character.toLowerCase('\u212A') == 'k'`. This could have security
>>> implications.
>>>
>>> - Update to Unicode data changing application behavior
>>>   Unicode evolves over time, and the JDK regularly updates the Unicode
>>> data it is using. Even if an application is not affected by the issues
>>> mentioned above, it could become affected by them when the Unicode data is
>>> updated in a newer JDK version.
>>>
>>> My main point here is that (I assume) in many cases Java applications
>>> don't need Unicode case conversion, let alone Unicode case conversion using
>>> the default locale. If Java offered ASCII-only case conversion methods,
>>> then hopefully users would (where applicable) switch to these methods over
>>> time and avoid all the issues mentioned above. And even if they
>>> accidentally use the ASCII-only methods for display text, the result might
>>> be a minor inconvenience for users seeing the display text, compared to in
>>> the other cases application bugs and security vulnerabilities.
>>>
>>> Related information about other programming languages:
>>> - Rust: Has dedicated methods for ASCII case conversion, e.g.
>>> https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase
>>> - Kotlin: Functions which implicitly use the default locale were
>>> deprecated, see https://youtrack.jetbrains.com/issue/KT-43023
>>>
>>> Risks:
>>> - ASCII case conversion could lead to undesired results in some cases,
>>> see the example for the word "café" on
>>> https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that
>>> specific example is about a display string, for which these ASCII-only
>>> methods are not intended)
>>> - When applications start to mix ASCII-only and the existing Unicode
>>> conversion methods this could lead to bugs and security issues as well;
>>> though it might also indicate a flaw in the application if it performs case
>>> conversion on the same value in different places
>>>
>>> I hope you consider this suggestion. Feedback is appreciated!
>>>
>>> Kind regards
>>>
>>> ----
>>>
>>> [1] I am not saying though that Java is the only affected language, it
>>> definitely affects others as well. But that should not prevent improving
>>> the Java API.
>>> [2] Tool for detecting usage of such methods:
>>> https://github.com/policeman-tools/forbidden-apis
>>> [3] Maybe it would also be worth discussing deprecating
>>> String.toLowerCase() and String.toUpperCase() because they seem to do more
>>> harm than good.
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20230412/bd1088a3/attachment-0001.htm>