RFR: 8303623: Compiler should disallow non-standard UTF-8 string encodings [v2]

Sat Mar 18 18:50:18 UTC 2023

On Sat, 18 Mar 2023 18:26:06 GMT, Vicente Romero <vromero at openjdk.org> wrote:

>> Here's why: Part of this change is to disallow encodings that are longer than necessary, for example, encoding the character `0x0100` as `e0 84 80` instead of `c4 80`. This is in accordance with the current JVMS.
>> 
>> However, classfiles prior to major version 48 were allowed to contain longer-than-necessary character encodings - most likely, I'm guessing, because of insufficiently strict validation in the early JVM implementations. So when we are parsing a UTF-8 sequence in a classfile, we need to know whether the classfile's major version is before or after 48 to know whether or not we should allow such longer-than-necessary character encodings. If we didn't do this, then we might incorrectly break a compilation where someone was compiling against some very old classfiles.
>
> I see your point is that those class files (version >= 48) won't be accepted by the JVM anyway

I was checking hotspot code in particular method: `UTF8::is_legal_utf8` at file `src/hotspot/share/utilities/utf8.cpp` and yes if the major version is < 48 the VM will pardon that class file :)

-------------

PR: https://git.openjdk.org/jdk/pull/12893