RFR: 8303623: Compiler should disallow non-standard UTF-8 string encodings [v2]

Sat Mar 18 18:04:20 UTC 2023

On Sat, 18 Mar 2023 04:05:12 GMT, Vicente Romero <vromero at openjdk.org> wrote:

> I think this one needs a CSR as this fix could provoke binary incompatibilities

Good point - thanks. I'll add it.

> src/jdk.compiler/share/classes/com/sun/tools/javac/jvm/ClassFile.java line 109:
> 
>> 107:     public enum Version {
>> 108:         V45_3(45, 3), // base level for all attributes
>> 109:         V48(48, 0),   // JDK 1.4
> 
> not sure why we are referring to a previous version

Here's why: Part of this change is to disallow encodings that are longer than necessary, for example, encoding the character `0x0100` as `e0 84 80` instead of `c4 80`. This is in accordance with the current JVMS.

However, classfiles prior to major version 48 were allowed to contain longer-than-necessary character encodings - most likely, I'm guessing, because of insufficiently strict validation in the early JVM implementations. So when we are parsing a UTF-8 sequence in a classfile, we need to know whether the classfile's major version is before or after 48 to know whether or not we should allow such longer-than-necessary character encodings. If we didn't do this, then we might incorrectly break a compilation where someone was compiling against some very old classfiles.

-------------

PR: https://git.openjdk.org/jdk/pull/12893