RFR: 8303623: Compiler should disallow non-standard UTF-8 string encodings [v5]
Vicente Romero
vromero at openjdk.org
Fri Mar 24 14:33:39 UTC 2023
On Wed, 22 Mar 2023 02:49:17 GMT, Archie L. Cobbs <duke at openjdk.org> wrote:
>> This patch is a precursor to upcoming refactoring to address these related bugs:
>> * [JDK-8269957](https://bugs.openjdk.org/browse/JDK-8269957) - facilitate alternate impls of NameTable and Name
>> * [JDK-8268622](https://bugs.openjdk.org/browse/JDK-8268622) - Performance issues in javac `Name` class
>>
>> In any multi-byte UTF-8 sequence, the bytes after the first byte are supposed to all look like `10xxxxxx`. But the code in `Convert.utf2chars()` is not checking that, so e.g., you could have `11xxxxxx` instead and it would encode the same character even though the UTF-8 bytes are different. For example, the character "è" normally encodes as `c3 a8`, but `Convert.utf2chars()` would also accept `c3 e8` or `c3 28` for "è". Another way to have non-standard encodings is by using more bytes than necessary. For example, you could encode the character `0x0100` as three bytes `e0 84 80`, but it should really be encoded as two bytes `c4 80`.
>>
>> This leniency poses a problem because the current `Name.Table` implementations store UTF-8 byte sequences, not characters. So the same `Name` could be encoded two different ways, which would cause it to be added to the hash table twice. This violates the guarantee of uniqueness provided by `Name.Table` and could even potentially create a security concern (depending on how the compiler is being used).
>>
>> But regardless of that, JVMS §4.4.7 describes "Modified UTF-8" for encoded strings, and it does not allow for non-standard encodings. Instead, you'll get something like this:
>>
>> $ java Test
>> Error: LinkageError occurred while loading main class Test
>> java.lang.ClassFormatError: Illegal UTF8 string in constant pool in class file Test
>>
>> So the compiler should also reject any invalid classfiles containing them.
>>
>> This patch makes `Convert.utf2chars()` throw a new checked exception `InvalidUtfException` and refactors accordingly, and adds a few minor cleanups along the way.
>
> Archie L. Cobbs has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
>
> - Merge branch 'master' into JDK-8303623
> - In release 21, emit warning instead of error for invalid UTF-8.
> - Use Fragments factory for creating message fragments.
> - Fix incorrect parameter index in message fragment.
> - Add the standard "This is NOT part of any supported API" warning.
> - Allow longer-than-necessary UTF-8 encoding in classfiles with major < 48.
> - Disallow non-standard UTF-8 string encodings.
src/jdk.compiler/share/classes/com/sun/tools/javac/jvm/PoolReader.java line 235:
> 233: return names.fromUtf(poolbuf.elems, offset + 2, len, utf8validation);
> 234: } catch (InvalidUtfException e) {
> 235: if (Source.DEFAULT == Source.JDK21) {
I think that a new feature should be added to `com.sun.tools.javac.code.Source.Feature`, could be named `WARN_ON_NON_STANDARD_UTF8` or so, and use it for this type of check. I would model the new feature based on `DEPRECATION_ON_IMPORT` with `MIN` as its applicable minimum version and 21 as the maximum applicable one
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/12893#discussion_r1146977932
More information about the compiler-dev
mailing list