6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

26 Dec 2019


      Hi,
The specification says, a line break in a manifest can occur beforeor
after a Unicode character encoded in UTF-8.
...
...>      value:         SPACE *otherchar newline
*continuation>      continuation:  SPACE *otherchar
newline>    ...>      otherchar:     any UTF-8 character except NUL, CR
and LF
The current implementation breaks manifest lines at 72 bytes regardless
ofhow the bytes around the break are part of a sequence of bytes
encoding acharacter. Code points may use up to four bytes when encoded
in UTF-8.Manifests with line breaks inside of sequences of bytes
encoding Unicodecharacters in UTF-8 with more than one bytes not only
are invalid UTF-8but also look ugly in text editors.For example, a
manifest could look like this:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class CharacterBrokenDemo1 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"),                  "Some
languages have decorated characters, " +                   "for
example: español"); // or
"espa\u00D1ol"        mf.write(System.out);    }}
Above code produces a result as follows with some unexpected question
markswhere the encoding is invalid:
   Manifest-Version: 1.0>    Some-Key: Some languages have decorated
characters, for example: espa?>     ?ol
This is of course an example written with actual question marks to get
a validtext for this message. The trick here is that "Some-Key" to
"example :espa"amounts to exactly one byte less encoded in UTF-8 than
would fit on one linewith the 72 byte limit so that the subsequent
character encoded with two bytesgets broken inside of the sequence of
two bytes for this character across acontinuation line break.
When decoding the resulting bytes from UTF-8 as one whole string, the
twoquestion marks will not fit together again even if the line break
with thecontinuation space is removed. However, Manifest::read removes
the continuationline breaks ("\r\n ") before decoding the manifest
header value from UTF-8 andhence can reproduce the original value.
Characters encoded in UTF-8 can not only span up to four bytes for one
codepoint each, there are also combining characters or classes thereof
or combiningdiacritical marks or whatever the appropriate term could
be, that combine morethan one code point into what is usually
experienced and referred to as acharacter.
The term character really gets ambiguous at this point. I wouldn't know
whatthe specification actually refers to with that term "character". So
rather thandiving in too much specification or any sorts of theory,
let's look at anotherexample:
import java.util.jar.Manifest;import java.util.jar.Attributes;import
static java.util.jar.Attributes.Name;
public class DemoCharacterBroken2 {    public static void main(String[]
args) throws Exception{        Manifest mf = new
Manifest();        Attributes attrs =
mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
"1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
"Angstro\u0308m");        mf.write(System.out);    }}
which produces console output as follows:
   Manifest-Version: 1.0>    Some-
Key:                                                      Angstro>     
̈m
(In case this does not display well, the diaeresis is on the m on the
last line)
When the whole Manifest is decoded from UTF-8 as one big single string
andcontinuation line breaks are not removed until after UTF-8 decoding
the wholemanifest, the diaeresis (umlaut, two points above, u0308)
apparently kind ofjumps onto the following letter m because somehow it
cannot be combined withthe preceding space. The UTF-8 decoder (all of
my editors I tried, not onlyEclipse and its console view, also less,
gedit, cat and terminal) somehowtries to fix that but the diaeresis may
not necessarily jump back on the "o"where it originally belonged by
removing the continuation line break ("\r\n ")after UTF-8 decoding has
taken place, at least that did not work for me.
Hence, ideally combining diacritical marks should better not be
separated fromwhatever they combine with when breaking manifest lines
onto a continuationline. Such combinations, however, seem to be
unlimited in terms of number ofcode points combining into the same
"experienced" character. I was able tofind combinations that not only
exceed the limit of 72 bytes per line but alsoexceed the line buffer
size of 512 bytes in Manifest::read. These may be ratheruncommon but
still possible to my own surprise.
Next consideration would then be to remove that limit of 512 bytes per
manifestline but exceeding it would make such manifests incompatible
with previousManifest::read implementations and is not really an
immediately availableoption at the moment.
As a compromise, those characters including combining diacritical marks
whichcombine only so many code points as that their binarily encoded
form in UTF-8remains within a limit of 71 bytes can be written without
an interruptingcontinuation line break, which applies to most cases,
but not all. I guess thisshould suit practically and realistically to
be expected values well.
Another possibility would be to allow for characters that are
combinations ofmultiple Unicode code points to be kept together in
their encoded form in UTF-8up to 512 bytes line length limit when
reading minus a space and a line breakamounting to 509 bytes, but that
would still not make manifests be representedas valid Unicode in all
corner cases and I guess would not probably make a realimprovement in
practice over a limit of 71 bytes.
Attached is a patch that tries to implement what was described above
using aBreakIterator. While it works from a functional point of view,
this might beless desirable performance-wise. Alternatively could be
considered to do withoutthe BreakIterator and only keep Unicode code
points together by not placingline breaks before a continuation byte,
which however would not addresscombining diacritical marks as in the
second example above.
The jar file specification does not explicitly state that manifest
should bevalid UTF-8, and they were not always, but it also does not
state otherwise,leaving an impression that manifests could be
(mis)taken for UTF-8 encodedstrings, which they also are in many or
most cases and which has been confusedmany times. At the moment, the
only case where a valid manifest is not also avalid UTF-8 encoded
string is when a sequence of bytes encoding the samecharacter happens
to be interrupted with a continuation line break. To the bestof my
knowledge, all other valid manifests are also valid UTF-8 encoded
strings.
It would be nice, if manifests could be viewed and manipulated with all
Unicodecapable editors and not only be parsed correctly with
Manifest::read.
Any opinions? Would someone sponsor this patch?
Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...

Philipp Kunz

Philipp Kunz

Lance Andersen

Lance Andersen

Lance Andersen

naoto.sato＠oracle.com

Philipp Kunz

naoto.sato＠oracle.com

Philipp Kunz

naoto.sato＠oracle.com

Philipp Kunz

naoto.sato＠oracle.com

Philipp Kunz

naoto.sato＠oracle.com

Philipp Kunz

Brent Christian

Philipp Kunz

tags

participants (4)