Add String & Character ASCII case conversion methods

Glavo zjx001202 at gmail.com
Sun Apr 9 16:46:15 UTC 2023


Hi,

We discussed this issue on this mailing list[1] earlier this year.

I investigated the usage of these two methods and found that all use cases
within
JDK are suspicious, resulting in many imperceptible bugs.

I hope to create a PR for this issue, deprecate these two methods, and
create
alternative methods for them. But I don't have the experience of making
such
changes, maybe I need some guidance or have more experienced people do
these things.

Glavo

[1]
https://mail.openjdk.org/pipermail/core-libs-dev/2023-January/099375.html

On Sun, Apr 9, 2023 at 10:58 PM <
some-java-user-99206970363698485155 at vodafonemail.de> wrote:

> Hello,
> could you please add String & Character ASCII case conversion methods,
> that is, methods which only perform case conversion on ASCII characters in
> the input and leave any other characters unchanged. The conversion should
> not depend on the default locale. For example:
> - String:
>   - toAsciiLowerCase
>   - toAsciiUpperCase
>   - equalsAsciiIgnoreCase (or a better name)
>   - compareToAsciiIgnoreCase (or a better name)
> - Character:
>   - toAsciiLowerCase
>   - toAsciiUpperCase
>
> This would give the following advantages:
> - Increased performance (+ not be vulnerable to denial of service attacks)
> - Reduced number of bugs in applications
>
>
> Please read on for a detailed explanation.
>
> I assume for historic reasons (Applets) the current case conversion
> methods use the Unicode conversion rules, and even worse
> String.toLowerCase() and String.toUpperCase() use the default locale. While
> this might back then have been a reasonable choice because Applets ran
> locally in the browser and localization was a nice to have feature (or even
> a requirement), nowadays Java is largely used for back-end systems and case
> conversion is pretty often done for technical strings and not display text
> anymore. In this context applications mostly process ASCII strings.
> However, because Java does not offer any specific case conversion methods
> for these cases, users still use the standard String & Character methods.
> This causes the following problems [1]:
>
> - String.toLowerCase() & String.toUpperCase() using default locale
>   What this means is that depending on the OS locale your application
> might behave differently or fail [2]. For the scale of this, simply look in
> the OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~
> "turkish locale"
>   At this point you probably have to add a disclaimer to any Java program
> that running it on systems with Turkish (and possibly others) as locale is
> not supported, because either your own code or the libraries you are using
> might be calling toLowerCase() or toUpperCase() [3].
>
> - Bad performance for Unicode aware case conversions
>   Compared to simply performing ASCII case conversion, applying Unicode
> case conversion has worse performance. In some cases it can even have
> extremely bad performance (JDK-8292573). This could have security
> implications.
>
> - Bugs due to case conversion changing string length
>   Unicode case conversion for certain strings can change the length,
> either increasing or decreasing the size of the string (or when combining
> both, shifting position of characters in the string while keeping the
> length the same). If an application assumes that the length of the string
> remains the same and uses data derived from the original string (e.g.
> character indices or length) on the converted string this can lead to
> exceptions or potentially even security issues.
>
> - Unicode characters mapping to ASCII chars
>   When performing case conversion on certain non-ASCII Unicode characters,
> the results are ASCII characters. For example
> `Character.toLowerCase('\u212A') == 'k'`. This could have security
> implications.
>
> - Update to Unicode data changing application behavior
>   Unicode evolves over time, and the JDK regularly updates the Unicode
> data it is using. Even if an application is not affected by the issues
> mentioned above, it could become affected by them when the Unicode data is
> updated in a newer JDK version.
>
> My main point here is that (I assume) in many cases Java applications
> don't need Unicode case conversion, let alone Unicode case conversion using
> the default locale. If Java offered ASCII-only case conversion methods,
> then hopefully users would (where applicable) switch to these methods over
> time and avoid all the issues mentioned above. And even if they
> accidentally use the ASCII-only methods for display text, the result might
> be a minor inconvenience for users seeing the display text, compared to in
> the other cases application bugs and security vulnerabilities.
>
> Related information about other programming languages:
> - Rust: Has dedicated methods for ASCII case conversion, e.g.
> https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase
> - Kotlin: Functions which implicitly use the default locale were
> deprecated, see https://youtrack.jetbrains.com/issue/KT-43023
>
> Risks:
> - ASCII case conversion could lead to undesired results in some cases, see
> the example for the word "café" on
> https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that
> specific example is about a display string, for which these ASCII-only
> methods are not intended)
> - When applications start to mix ASCII-only and the existing Unicode
> conversion methods this could lead to bugs and security issues as well;
> though it might also indicate a flaw in the application if it performs case
> conversion on the same value in different places
>
> I hope you consider this suggestion. Feedback is appreciated!
>
> Kind regards
>
> ----
>
> [1] I am not saying though that Java is the only affected language, it
> definitely affects others as well. But that should not prevent improving
> the Java API.
> [2] Tool for detecting usage of such methods:
> https://github.com/policeman-tools/forbidden-apis
> [3] Maybe it would also be worth discussing deprecating
> String.toLowerCase() and String.toUpperCase() because they seem to do more
> harm than good.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20230410/d7ae3e6c/attachment-0001.htm>


More information about the core-libs-dev mailing list