Add String & Character ASCII case conversion methods

Tue Apr 11 13:01:56 UTC 2023

Hi,

To propose deprecation of String::toLowerCase() and String::toUpperCase(),
you can create a patch as normal, with an addition of a CSR ticket that
describes the situation and the proposed solution. After that, you can ask
for someone from core-libs to review the ticket. The change can be merged
after sufficient reviews and the CSR being approved. You can view
other CSRs in JBS to see the overall structure, as well as other
deprecations in the JDK to see a typical deprecation description. More
details regarding CSRs can be found in the OpenJDK Wiki
<https://wiki.openjdk.org/display/csr/Main>.

Hope this helps,
Quan Anh

On Mon, 10 Apr 2023 at 00:47, Glavo <zjx001202 at gmail.com> wrote:

> Hi,
>
> We discussed this issue on this mailing list[1] earlier this year.
>
> I investigated the usage of these two methods and found that all use cases
> within
> JDK are suspicious, resulting in many imperceptible bugs.
>
> I hope to create a PR for this issue, deprecate these two methods, and
> create
> alternative methods for them. But I don't have the experience of making
> such
> changes, maybe I need some guidance or have more experienced people do
> these things.
>
> Glavo
>
> [1]
> https://mail.openjdk.org/pipermail/core-libs-dev/2023-January/099375.html
>
> On Sun, Apr 9, 2023 at 10:58 PM <
> some-java-user-99206970363698485155 at vodafonemail.de> wrote:
>
>> Hello,
>> could you please add String & Character ASCII case conversion methods,
>> that is, methods which only perform case conversion on ASCII characters in
>> the input and leave any other characters unchanged. The conversion should
>> not depend on the default locale. For example:
>> - String:
>>   - toAsciiLowerCase
>>   - toAsciiUpperCase
>>   - equalsAsciiIgnoreCase (or a better name)
>>   - compareToAsciiIgnoreCase (or a better name)
>> - Character:
>>   - toAsciiLowerCase
>>   - toAsciiUpperCase
>>
>> This would give the following advantages:
>> - Increased performance (+ not be vulnerable to denial of service attacks)
>> - Reduced number of bugs in applications
>>
>>
>> Please read on for a detailed explanation.
>>
>> I assume for historic reasons (Applets) the current case conversion
>> methods use the Unicode conversion rules, and even worse
>> String.toLowerCase() and String.toUpperCase() use the default locale. While
>> this might back then have been a reasonable choice because Applets ran
>> locally in the browser and localization was a nice to have feature (or even
>> a requirement), nowadays Java is largely used for back-end systems and case
>> conversion is pretty often done for technical strings and not display text
>> anymore. In this context applications mostly process ASCII strings.
>> However, because Java does not offer any specific case conversion methods
>> for these cases, users still use the standard String & Character methods.
>> This causes the following problems [1]:
>>
>> - String.toLowerCase() & String.toUpperCase() using default locale
>>   What this means is that depending on the OS locale your application
>> might behave differently or fail [2]. For the scale of this, simply look in
>> the OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~
>> "turkish locale"
>>   At this point you probably have to add a disclaimer to any Java program
>> that running it on systems with Turkish (and possibly others) as locale is
>> not supported, because either your own code or the libraries you are using
>> might be calling toLowerCase() or toUpperCase() [3].
>>
>> - Bad performance for Unicode aware case conversions
>>   Compared to simply performing ASCII case conversion, applying Unicode
>> case conversion has worse performance. In some cases it can even have
>> extremely bad performance (JDK-8292573). This could have security
>> implications.
>>
>> - Bugs due to case conversion changing string length
>>   Unicode case conversion for certain strings can change the length,
>> either increasing or decreasing the size of the string (or when combining
>> both, shifting position of characters in the string while keeping the
>> length the same). If an application assumes that the length of the string
>> remains the same and uses data derived from the original string (e.g.
>> character indices or length) on the converted string this can lead to
>> exceptions or potentially even security issues.
>>
>> - Unicode characters mapping to ASCII chars
>>   When performing case conversion on certain non-ASCII Unicode
>> characters, the results are ASCII characters. For example
>> `Character.toLowerCase('\u212A') == 'k'`. This could have security
>> implications.
>>
>> - Update to Unicode data changing application behavior
>>   Unicode evolves over time, and the JDK regularly updates the Unicode
>> data it is using. Even if an application is not affected by the issues
>> mentioned above, it could become affected by them when the Unicode data is
>> updated in a newer JDK version.
>>
>> My main point here is that (I assume) in many cases Java applications
>> don't need Unicode case conversion, let alone Unicode case conversion using
>> the default locale. If Java offered ASCII-only case conversion methods,
>> then hopefully users would (where applicable) switch to these methods over
>> time and avoid all the issues mentioned above. And even if they
>> accidentally use the ASCII-only methods for display text, the result might
>> be a minor inconvenience for users seeing the display text, compared to in
>> the other cases application bugs and security vulnerabilities.
>>
>> Related information about other programming languages:
>> - Rust: Has dedicated methods for ASCII case conversion, e.g.
>> https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase
>> - Kotlin: Functions which implicitly use the default locale were
>> deprecated, see https://youtrack.jetbrains.com/issue/KT-43023
>>
>> Risks:
>> - ASCII case conversion could lead to undesired results in some cases,
>> see the example for the word "café" on
>> https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that
>> specific example is about a display string, for which these ASCII-only
>> methods are not intended)
>> - When applications start to mix ASCII-only and the existing Unicode
>> conversion methods this could lead to bugs and security issues as well;
>> though it might also indicate a flaw in the application if it performs case
>> conversion on the same value in different places
>>
>> I hope you consider this suggestion. Feedback is appreciated!
>>
>> Kind regards
>>
>> ----
>>
>> [1] I am not saying though that Java is the only affected language, it
>> definitely affects others as well. But that should not prevent improving
>> the Java API.
>> [2] Tool for detecting usage of such methods:
>> https://github.com/policeman-tools/forbidden-apis
>> [3] Maybe it would also be worth discussing deprecating
>> String.toLowerCase() and String.toUpperCase() because they seem to do more
>> harm than good.
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20230411/bddad5c2/attachment.htm>