Add String & Character ASCII case conversion methods
some-java-user-99206970363698485155 at vodafonemail.de
some-java-user-99206970363698485155 at vodafonemail.de
Sun Apr 9 14:57:52 UTC 2023
Hello,
could you please add String & Character ASCII case conversion methods, that is, methods which only perform case conversion on ASCII characters in the input and leave any other characters unchanged. The conversion should not depend on the default locale. For example:
- String:
- toAsciiLowerCase
- toAsciiUpperCase
- equalsAsciiIgnoreCase (or a better name)
- compareToAsciiIgnoreCase (or a better name)
- Character:
- toAsciiLowerCase
- toAsciiUpperCase
This would give the following advantages:
- Increased performance (+ not be vulnerable to denial of service attacks)
- Reduced number of bugs in applications
Please read on for a detailed explanation.
I assume for historic reasons (Applets) the current case conversion methods use the Unicode conversion rules, and even worse String.toLowerCase() and String.toUpperCase() use the default locale. While this might back then have been a reasonable choice because Applets ran locally in the browser and localization was a nice to have feature (or even a requirement), nowadays Java is largely used for back-end systems and case conversion is pretty often done for technical strings and not display text anymore. In this context applications mostly process ASCII strings.
However, because Java does not offer any specific case conversion methods for these cases, users still use the standard String & Character methods. This causes the following problems [1]:
- String.toLowerCase() & String.toUpperCase() using default locale
What this means is that depending on the OS locale your application might behave differently or fail [2]. For the scale of this, simply look in the OpenJDK database: https://bugs.openjdk.org/issues/?jql=text ~ "turkish locale"
At this point you probably have to add a disclaimer to any Java program that running it on systems with Turkish (and possibly others) as locale is not supported, because either your own code or the libraries you are using might be calling toLowerCase() or toUpperCase() [3].
- Bad performance for Unicode aware case conversions
Compared to simply performing ASCII case conversion, applying Unicode case conversion has worse performance. In some cases it can even have extremely bad performance (JDK-8292573). This could have security implications.
- Bugs due to case conversion changing string length
Unicode case conversion for certain strings can change the length, either increasing or decreasing the size of the string (or when combining both, shifting position of characters in the string while keeping the length the same). If an application assumes that the length of the string remains the same and uses data derived from the original string (e.g. character indices or length) on the converted string this can lead to exceptions or potentially even security issues.
- Unicode characters mapping to ASCII chars
When performing case conversion on certain non-ASCII Unicode characters, the results are ASCII characters. For example `Character.toLowerCase('\u212A') == 'k'`. This could have security implications.
- Update to Unicode data changing application behavior
Unicode evolves over time, and the JDK regularly updates the Unicode data it is using. Even if an application is not affected by the issues mentioned above, it could become affected by them when the Unicode data is updated in a newer JDK version.
My main point here is that (I assume) in many cases Java applications don't need Unicode case conversion, let alone Unicode case conversion using the default locale. If Java offered ASCII-only case conversion methods, then hopefully users would (where applicable) switch to these methods over time and avoid all the issues mentioned above. And even if they accidentally use the ASCII-only methods for display text, the result might be a minor inconvenience for users seeing the display text, compared to in the other cases application bugs and security vulnerabilities.
Related information about other programming languages:
- Rust: Has dedicated methods for ASCII case conversion, e.g. https://doc.rust-lang.org/std/string/struct.String.html#method.to_ascii_lowercase
- Kotlin: Functions which implicitly use the default locale were deprecated, see https://youtrack.jetbrains.com/issue/KT-43023
Risks:
- ASCII case conversion could lead to undesired results in some cases, see the example for the word "café" on https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html (though that specific example is about a display string, for which these ASCII-only methods are not intended)
- When applications start to mix ASCII-only and the existing Unicode conversion methods this could lead to bugs and security issues as well; though it might also indicate a flaw in the application if it performs case conversion on the same value in different places
I hope you consider this suggestion. Feedback is appreciated!
Kind regards
----
[1] I am not saying though that Java is the only affected language, it definitely affects others as well. But that should not prevent improving the Java API.
[2] Tool for detecting usage of such methods: https://github.com/policeman-tools/forbidden-apis
[3] Maybe it would also be worth discussing deprecating String.toLowerCase() and String.toUpperCase() because they seem to do more harm than good.
More information about the core-libs-dev
mailing list