6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Thu Feb 6 18:37:28 UTC 2020

Hi,

I recently submitted two patches related to jar manifests and UTF-8 and
haven't got any reaction so far. I understand and appreciate that
everyone has not time for every wish and my enquiry is certainly not
urgent, but still, may I gently ask if I may continue to hope for any
progress, have missed anything, or if this of no interest at all?

Unfortunately the line breaks in the previous mail went bad which is
why I paste the text again below and hope it looks nicer this time.

Regards,
Philipp

[1] 
https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.html
[2] 
https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.html

On Thu, 2019-12-26 at 17:50 +0100, Philipp Kunz wrote:
Hi,
The specification says, a line break in a manifest can occur before or
after a Unicode character encoded in UTF-8.
...  value:         SPACE *otherchar newline *continuation 
continuation:  SPACE *otherchar newline...  otherchar:     any UTF-8
character except NUL, CR and LF
The current implementation breaks manifest lines at 72 bytes regardless
of how the bytes around the break are part of a sequence of bytes
encoding a character. Code points may use up to four bytes when encoded
in UTF-8. Manifests with line breaks inside of sequences of bytes
encoding Unicode characters in UTF-8 with more than one bytes not only
are invalid UTF-8 but also look ugly in text editors. For example, a
manifest could look like this:
Manifest-Version: 1.0Some-Key: Some languages have decorated
characters, for example: espa? ?ol
Below code produces a result as seen above with some unexpected
question marks where the encoding is invalid:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class CharacterBrokenDemo1 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"),                  "Some
languages have decorated characters, " +                   "for
example: español"); // or
"espa\u00D1ol"        mf.write(System.out);    }}
This is of course an example written with actual question marks to get
a valid text for this message. The trick here is that "Some-Key" to
"example :espa" amounts to exactly one byte less encoded in UTF-8 than
would fit on one line with the 72 byte limit so that the subsequent
character encoded with two bytes gets broken inside of the sequence of
two bytes for this character across a continuation line break.
When decoding the resulting bytes from UTF-8 as one whole string, the
two question marks will not fit together again even if the line break
with the continuation space is removed. However, Manifest::read removes
the continuation line breaks ("\r\n ") before decoding the manifest
header value from UTF-8 and hence can reproduce the original value.
Characters encoded in UTF-8 can not only span up to four bytes for one
code point each, there are also combining characters or classes thereof
or combining diacritical marks or whatever the appropriate term could
be, that combine more than one code point into what is usually
experienced and referred to as a character.
The term character really gets ambiguous at this point. I wouldn't know
what the specification actually refers to with that term "character".
So rather than diving in too much specification or any sorts of theory,
let's look at another example:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class DemoCharacterBroken2 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
"Angstro\u0308m");        mf.write(System.out);    }}
which produces console output as follows:
Manifest-Version: 1.0Some-Key:                     Angstro ̈m
(In case this does not display well, the diaeresis is on the m on the
last line)
When the whole Manifest is decoded from UTF-8 as one big single string
and continuation line breaks are not removed until after UTF-8 decoding
the whole manifest, the diaeresis (umlaut, two points above, u0308)
apparently kind of jumps onto the following letter m because somehow it
cannot be combined with the preceding space. The UTF-8 decoder (all of
my editors I tried, not only Eclipse and its console view, also less,
gedit, cat and terminal) somehow tries to fix that but the diaeresis
may not necessarily jump back on the "o" where it originally belonged
by removing the continuation line break ("\r\n ") after UTF-8 decoding
has taken place, at least that did not work for me.
Hence, ideally combining diacritical marks should better not be
separated from whatever they combine with when breaking manifest lines
onto a continuation line. Such combinations, however, seem to be
unlimited in terms of number of code points combining into the same
"experienced" character. I was able to find combinations that not only
exceed the limit of 72 bytes per line but also exceed the line buffer
size of 512 bytes in Manifest::read. These may be rather uncommon but
still possible to my own surprise.
Next consideration would then be to remove that limit of 512 bytes per
manifest line but exceeding it would make such manifests incompatible
with previous Manifest::read implementations and is not really an
immediately available option at the moment.
As a compromise, those characters including combining diacritical marks
which combine only so many code points as that their binarily encoded
form in UTF-8 remains within a limit of 71 bytes can be written without
an interrupting continuation line break, which applies to most cases,
but not all. I guess this should suit practically and realistically to
be expected values well.
Another possibility would be to allow for characters that are
combinations of multiple Unicode code points to be kept together in
their encoded form in UTF-8 up to 512 bytes line length limit when
reading minus a space and a line break amounting to 509 bytes, but that
would still not make manifests be represented as valid Unicode in all
corner cases and I guess would not probably make a real improvement in
practice over a limit of 71 bytes.
Attached is a patch that tries to implement what was described above
using a BreakIterator. While it works from a functional point of view,
this might be less desirable performance-wise. Alternatively could be
considered to do without the BreakIterator and only keep Unicode code
points together by not placing line breaks before a continuation byte,
which however would not address combining diacritical marks as in the
second example above.
The jar file specification does not explicitly state that manifest
should be valid UTF-8, and they were not always, but it also does not
state otherwise, leaving an impression that manifests could be
(mis)taken for UTF-8 encoded strings, which they also are in many or
most cases and which has been confused many times. At the moment, the
only case where a valid manifest is not also a valid UTF-8 encoded
string is when a sequence of bytes encoding the same character happens
to be interrupted with a continuation line break. To the best of my
knowledge, all other valid manifests are also valid UTF-8 encoded
strings.
It would be nice, if manifests could be viewed and manipulated with all
Unicode capable editors and not only be parsed correctly with
Manifest::read.
Any opinions? Would someone sponsor this patch?
Regards,Philipp

https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character