RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos

Mon Feb 6 15:03:43 UTC 2023

On Mon, 6 Feb 2023 11:47:42 GMT, Eirik Bjorsnos <duke at openjdk.org> wrote:

>> src/java.base/share/classes/java/lang/System.java line 2668:
>> 
>>> 2666:             @Override
>>> 2667:             public int mismatchUTF8(String str, byte[] b, int fromIndex, int toIndex) {
>>> 2668:                 byte[] encoded = str.isLatin1() ? str.value() : str.getBytes(UTF_8.INSTANCE);
>> 
>> I think this is incorrect: latin-1 characters above codepoint 127 (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is probably `str.isAscii() ? ...`. The ASCII check will have to look at the bytes, so will incur a minor penalty.
>> 
>> Good news is that you should already be able to do this with what's already exposed via `JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)`, so no need for more shared secrets.
>
> Nice, I have updated the PR such that the new shared secret is replaced with using getBytesNoRepl instead. If there is a performance difference, it seems to hide in the noise.
> 
> I had expected such a regression to be caught by existing tests, which seems not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress this.

getBytesNoRepl throws CharacterCodingException "for malformed input or unmappable characters".

This should never happen since initCEN should already reject it. If it should happen anyway, I return NO_MATCH which will ignore the match just like the catch in getEntryPos currently does.

-------------

PR: https://git.openjdk.org/jdk/pull/12290