Review Request: 6443578: Continuation lines in JAR manifests do not follow RFC-822

Fri Oct 12 17:23:37 UTC 2018

Hi,

Attached is a patch with a proposal to fix bug 6443578 about Manifest
inserting line breaks in between bytes of characters encoded in utf-8 with more than one byte and a test for it.

The Issue

The current Manifest implementation places line breaks for breaking and
continuing values onto the next line always at 72 bytes from the start
of the line without paying attention to characters encoded in utf-8
with more than one byte and to all bytes of one character together. It
occurs that a character encoded in utf-8 with more than one byte is
partially written at the end of one line and then after a line break
for a continuation line ("\r\n ") continued on the next line resulting
in an invalid utf-8 byte sequence where "\r\n " is inserted in between
the multiple bytes of the byte sequence a character is encoded in utf-8 
with.

That cannot happen with characters encoded with one byte in utf-8 which
are used most often. It only affects characters with more than one byte
encoded in utf-8. It also does not affect header names (the keys)
because header names are limited to {A-Z} | {a-z} | {0-9} | - | _ [1]
which all happen to be characters encoded in utf-8 with one byte.

The Manifest implementation removes these continuation line delimiters
("\r\n ") again in [2] and [3] (mind the --) and [4] (mind the 1 as
second parameter to arraycopy skipping the space) and concatenates the
parts from individual continuation lines together in [5] operating
still on a sequence of bytes before decoding the remaining bytes of the
value back into the string value applying the utf-8 decoding in [6].
The process is very similar also for names of named sections, see [7],
[8], [9], and [10]. Manifests written with the current implementation
are read again correctly.

Not only the specification says that sequences of bytes of the same
utf-8 encoded character cannot be broken apart [1]
> value: SPACE *otherchar newline *continuation
> continuation: SPACE *otherchar newline
> otherchar: any UTF-8 character except NUL, CR and LF
but line break and space can occur only between whole utf-8 encoded
characters and also jar manifests cannot be viewed with utf-8 aware
file viewers correctly. Manifests with values with characters illegally
broken onto a new line look for example simplified and with a shortened
line width limit similar like this one in any of my favorite file viewers:

> Manifest-Version: 1.0
> Key: Gänsefüßchen in Übergr?
> ?ße

where ? and ? should in fact be a German o umlaut ö. A file viewer may
understand and decode utf-8 well but has no way to know that the two
bytes, one before and one after the continuation line break actually
constitute the utf-8 encoded byte sequence of one single character.

Don't Put Continuation Line Breaks in Front of UTF-8 Continuation Bytes

The approach chosen in my patch is to prevent putting continuation line
breaks before a utf-8 continuation byte. UTF-* continuation bytes can
be detected with a bit mask and comparison, see [11]. An alternative
would have been to write one character (unicode codepoint and not Java
16-bit char) after another computing the length in number of bytes in
utf-8 encoded form for each and checking if it still fits on the same
line before writing it with or without an addition continuation line
break. When working with strings in Java, an additional challenge would
have been to work with surrogate pairs. I figure the bitmask comparison
fits the purpose here best, it is also used elsewhere in the jdk, see
[12] and [13] among other places, and presumably involves the least
resource usage and performance penalty, unfortunately none of which
accessible (private).

Tricky Strings of Actual Bytes

The current Manifest implementation does some kind of strange
processing when writing and encoding manifests with respect to using
strings with bytes as already described and mentioned in [14] and [15]
among probably more. The values are converted into strings with the
invocation of the deprecated constructor [17] in [16]:
> value = new String(vb, 0, 0, vb.length);
which converts the utf-8 encoded bytes into a string of chars each of
which is set to a byte value in its lower byte and zero in the higher
byte [18]. Each char of that string actually holds a byte and its
length corresponds to the number of bytes of the utf-8 encoded value.
After inserting line breaks in make72Safe [19] the string is written to
the output stream disregarding the higher bytes of the chars by
applying a lossy conversion in [20] thereby restoring the original utf-
8 byte sequence with line breaks added which also don't use higher
bytes.

I'll be glad to also solve bug 8066619 by replacing these weird strings
with byte arrays or anyhow else but prefer to do that in a separate
effort and patch and for this patch here the strings are encoded as
they always were. Above paragraph explains why my patch's
isUTF8ContinuationByte takes a parameter of type char which looks weird
at first glance but after studying the situation it implies the least
possible number of type casts and each char actually represents a byte
held in each chars lower byte accounting for that methods name.

Compatibility

In addition to the tests which usually test the current implementation
it has to be made sure that manifests written by newer jdk can be read
by an older one and manifests written by an older jdk can be read by a
newer one. Obviously also manifests written by older versions of the
jdk should be read correctly by older versions of it which has been addressed in older jdk version already and the test in the patch is supposed to test writing and reading of manifests with all kinds of cases related to breaking values on continuation lines in conjunction with characters near that break with different number of bytes in their utf-8 encoded forms.
Reading the manifests remains unchanged which is why only writing
manifests like it was done before became a part of the new test in the
patch, see writeManifestWithBrokenCharacters and make72Safe in the LineBreakCharacter test in the attached patch.

Performance and Resource Usage Implications

I figure that reading and writing manifests are frequent operations
already optimized to a certain extent guessing for instance from the
name of the class Manifest.FastInputStream. Adding more complexity as
in my patch will certainly make operations not faster. I have no facts
here but state some assumptions.

When writing a manifest, performance and resource usage is only
affected if lines are actually broken onto continuation lines, which I
would assume happens statistically to a minority of headers only.

When breaking a value onto a continuation line when writing a manifest,
at least one and at most four bytes will be checked if they are
continuation bytes. In average I assume that characters encoded with
fewer bytes occur much more often.

It may also happen that the same manifest will need more continuation
lines because lines will be filled with less than the complete 72 byte
limit if at the end of the line a multi-byte character would exceed it.
When the next line starts with the whole character instead of the
remaining bytes the next continuation line break could have to placed
earlier and so on. The utf-8 encoded values will not change but there
might be more continuation line breaks ("\r\n ") than before using
storage space in main memory as well as on disk. A line can be at most
three bytes shorter than the limit and therefore at least 24 lines are
necessary to require an additional continuation line break.

When reading manifests, the only difference might make the increased
manifest file sizes or input stream lengths in number of bytes.

Altogether, I guess that these implications are negligible.

The issue has been open for quite some time now and Java still has
worked well without its resolution. Advantages of applying that patch
would be that the manifest files would comply with respect to utf-8
encoding to its own specification and could be correctly viewed and
also edited with any utf-8 capable viewer or editor with the complete
unicode character set and another bug could be closed.

I'm looking forward to any feedback and would be glad if someone would
volunteer to sponsor the change.

Regards,
Philipp

[1] https://docs.oracle.com/javase/7/docs/technotes/guides/jar/jar.html
#Section-Specification
[2] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Attributes.java#l384 byte c = lbuf[--len];
[3] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Attributes.java#l392 --len;
[4] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Attributes.java#l407 System.arraycopy(lbuf, 1,
buf, lastline.length, len - 1);
[5] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Attributes.java#l406 System.arraycopy(lastline,
0, buf, 0, lastline.length);
[6] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Attributes.java#l412 value = new String(buf, 0,
buf.length, "UTF8");
[7] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Manifest.java#l236 byte c = lbuf[--len];
[8] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Manifest.java#l244 --len;
[9] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/share
/classes/java/util/jar/Manifest.java#l267 System.arraycopy(lbuf, 1,
buf, lastline.length, len - 1);
[10] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/util/jar/Manifest.java#l273 name = new String(buf, 0,
buf.length, "UTF8");
[11] https://tools.ietf.org/html/rfc3629#section-3
[12] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/lang/StringCoding.java#l632 return (b & 0xc0) != 0x80;
[13] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/sun/nio/cs/UTF_8.java#l90 return (b & 0xc0) != 0x80;
[14] https://bugs.openjdk.java.net/browse/JDK-8066619?focusedCommentId=
14085301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#comment-14085301
[15] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-May/0529
46.html
[16] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/util/jar/Manifest.java#l177 value = new String(vb, 0, 0,
vb.length);
[17] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/lang/String.java#l375 @Deprecated(since="1.1") public
String(byte ascii[], int hibyte, int offset, int count) {
[18] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/lang/String.java#l389 StringUTF16.putChar(val, i, hibyte
| (ascii[offset++] & 0xff));
[19] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/util/jar/Manifest.java#l191
[20] http://hg.openjdk.org/jdk/jdk/file/331fbd2db6b5/src/java.base/shar
e/classes/java/io/DataOutputStream.java#l276
out.write((byte)s.charAt(i));
[21] https://bugs.openjdk.java.net/browse/JDK-6695402
[22] https://tools.ietf.org/html/rfc822#section-3.1.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6443578.patch
Type: text/x-patch
Size: 32640 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20181012/21c977f9/6443578-0001.patch>