RFR: 8303623: Compiler should disallow non-standard UTF-8 string encodings [v3]

Archie L. Cobbs duke at openjdk.org
Mon Mar 20 15:46:05 UTC 2023


> This patch is a precursor to upcoming refactoring to address these related bugs:
> * [JDK-8269957](https://bugs.openjdk.org/browse/JDK-8269957) - facilitate alternate impls of NameTable and Name
> * [JDK-8268622](https://bugs.openjdk.org/browse/JDK-8268622) - Performance issues in javac `Name` class
> 
> In any multi-byte UTF-8 sequence, the bytes after the first byte are supposed to all look like `10xxxxxx`. But the code in `Convert.utf2chars()` is not checking that, so e.g., you could have `11xxxxxx` instead and it would encode the same character even though the UTF-8 bytes are different. For example, the character "è" normally encodes as `c3 a8`, but `Convert.utf2chars()` would also accept `c3 e8` or `c3 28` for "è". Another way to have non-standard encodings is by using more bytes than necessary. For example, you could encode the character `0x0100` as three bytes `e0 84 80`, but it should really be encoded as two bytes `c4 80`.
> 
> This leniency poses a problem because the current `Name.Table` implementations store UTF-8 byte sequences, not characters. So the same `Name` could be encoded two different ways, which would cause it to be added to the hash table twice. This violates the guarantee of uniqueness provided by `Name.Table` and could even potentially create a security concern (depending on how the compiler is being used).
> 
> But regardless of that, JVMS §4.4.7 describes "Modified UTF-8" for encoded strings, and it does not allow for non-standard encodings. Instead, you'll get something like this:
> 
> $ java Test
> Error: LinkageError occurred while loading main class Test
> 	java.lang.ClassFormatError: Illegal UTF8 string in constant pool in class file Test
> 
> So the compiler should also reject any invalid classfiles containing them.
> 
> This patch makes `Convert.utf2chars()` throw a new checked exception `InvalidUtfException` and refactors accordingly, and adds a few minor cleanups along the way.

Archie L. Cobbs has updated the pull request incrementally with one additional commit since the last revision:

  Add the standard "This is NOT part of any supported API" warning.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/12893/files
  - new: https://git.openjdk.org/jdk/pull/12893/files/94dd3db3..62930b59

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12893&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12893&range=01-02

  Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/12893.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12893/head:pull/12893

PR: https://git.openjdk.org/jdk/pull/12893


More information about the compiler-dev mailing list