6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Mon Feb 10 21:22:56 UTC 2020

Hi all,

Here is a webrev for the patch that Philipp is proposing which will make it easier to review:  http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>

> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz at paratix.ch> wrote:
> 
> Hi,
> The specification says, a line break in a manifest can occur beforeor
> after a Unicode character encoded in UTF-8.
>>   ...>      value:         SPACE *otherchar newline
> *continuation>      continuation:  SPACE *otherchar
> newline>    ...>      otherchar:     any UTF-8 character except NUL, CR
> and LF
> The current implementation breaks manifest lines at 72 bytes regardless
> ofhow the bytes around the break are part of a sequence of bytes
> encoding acharacter. Code points may use up to four bytes when encoded
> in UTF-8.Manifests with line breaks inside of sequences of bytes
> encoding Unicodecharacters in UTF-8 with more than one bytes not only
> are invalid UTF-8but also look ugly in text editors.For example, a
> manifest could look like this:
> import java.util.jar.Manifest;import java.util.jar.Attributes;import
> static java.util.jar.Attributes.Name;
> public class CharacterBrokenDemo1 {    public static void main(String[]
> args) throws Exception{        Manifest mf = new
> Manifest();        Attributes attrs =
> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> "1.0");        attrs.put(new Name("Some-Key"),                  "Some
> languages have decorated characters, " +                   "for
> example: español"); // or
> "espa\u00D1ol"        mf.write(System.out);    }}
> Above code produces a result as follows with some unexpected question
> markswhere the encoding is invalid:
>>   Manifest-Version: 1.0>    Some-Key: Some languages have decorated
> characters, for example: espa?>     ?ol
> This is of course an example written with actual question marks to get
> a validtext for this message. The trick here is that "Some-Key" to
> "example :espa"amounts to exactly one byte less encoded in UTF-8 than
> would fit on one linewith the 72 byte limit so that the subsequent
> character encoded with two bytesgets broken inside of the sequence of
> two bytes for this character across acontinuation line break.
> When decoding the resulting bytes from UTF-8 as one whole string, the
> twoquestion marks will not fit together again even if the line break
> with thecontinuation space is removed. However, Manifest::read removes
> the continuationline breaks ("\r\n ") before decoding the manifest
> header value from UTF-8 andhence can reproduce the original value.
> Characters encoded in UTF-8 can not only span up to four bytes for one
> codepoint each, there are also combining characters or classes thereof
> or combiningdiacritical marks or whatever the appropriate term could
> be, that combine morethan one code point into what is usually
> experienced and referred to as acharacter.
> The term character really gets ambiguous at this point. I wouldn't know
> whatthe specification actually refers to with that term "character". So
> rather thandiving in too much specification or any sorts of theory,
> let's look at anotherexample:
> import java.util.jar.Manifest;import java.util.jar.Attributes;import
> static java.util.jar.Attributes.Name;
> public class DemoCharacterBroken2 {    public static void main(String[]
> args) throws Exception{        Manifest mf = new
> Manifest();        Attributes attrs =
> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
> "1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
> "Angstro\u0308m");        mf.write(System.out);    }}
> which produces console output as follows:
>>   Manifest-Version: 1.0>    Some-
> Key:                                                      Angstro>     
> ̈m
> (In case this does not display well, the diaeresis is on the m on the
> last line)
> When the whole Manifest is decoded from UTF-8 as one big single string
> andcontinuation line breaks are not removed until after UTF-8 decoding
> the wholemanifest, the diaeresis (umlaut, two points above, u0308)
> apparently kind ofjumps onto the following letter m because somehow it
> cannot be combined withthe preceding space. The UTF-8 decoder (all of
> my editors I tried, not onlyEclipse and its console view, also less,
> gedit, cat and terminal) somehowtries to fix that but the diaeresis may
> not necessarily jump back on the "o"where it originally belonged by
> removing the continuation line break ("\r\n ")after UTF-8 decoding has
> taken place, at least that did not work for me.
> Hence, ideally combining diacritical marks should better not be
> separated fromwhatever they combine with when breaking manifest lines
> onto a continuationline. Such combinations, however, seem to be
> unlimited in terms of number ofcode points combining into the same
> "experienced" character. I was able tofind combinations that not only
> exceed the limit of 72 bytes per line but alsoexceed the line buffer
> size of 512 bytes in Manifest::read. These may be ratheruncommon but
> still possible to my own surprise.
> Next consideration would then be to remove that limit of 512 bytes per
> manifestline but exceeding it would make such manifests incompatible
> with previousManifest::read implementations and is not really an
> immediately availableoption at the moment.
> As a compromise, those characters including combining diacritical marks
> whichcombine only so many code points as that their binarily encoded
> form in UTF-8remains within a limit of 71 bytes can be written without
> an interruptingcontinuation line break, which applies to most cases,
> but not all. I guess thisshould suit practically and realistically to
> be expected values well.
> Another possibility would be to allow for characters that are
> combinations ofmultiple Unicode code points to be kept together in
> their encoded form in UTF-8up to 512 bytes line length limit when
> reading minus a space and a line breakamounting to 509 bytes, but that
> would still not make manifests be representedas valid Unicode in all
> corner cases and I guess would not probably make a realimprovement in
> practice over a limit of 71 bytes.
> Attached is a patch that tries to implement what was described above
> using aBreakIterator. While it works from a functional point of view,
> this might beless desirable performance-wise. Alternatively could be
> considered to do withoutthe BreakIterator and only keep Unicode code
> points together by not placingline breaks before a continuation byte,
> which however would not addresscombining diacritical marks as in the
> second example above.
> The jar file specification does not explicitly state that manifest
> should bevalid UTF-8, and they were not always, but it also does not
> state otherwise,leaving an impression that manifests could be
> (mis)taken for UTF-8 encodedstrings, which they also are in many or
> most cases and which has been confusedmany times. At the moment, the
> only case where a valid manifest is not also avalid UTF-8 encoded
> string is when a sequence of bytes encoding the samecharacter happens
> to be interrupted with a continuation line break. To the bestof my
> knowledge, all other valid manifests are also valid UTF-8 encoded
> strings.
> It would be nice, if manifests could be viewed and manipulated with all
> Unicodecapable editors and not only be parsed correctly with
> Manifest::read.
> Any opinions? Would someone sponsor this patch?
> Regards,Philipp
> 
> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
> 
> <6202130-manifestutf8linebreak.patch>

 <http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering 
1 Network Drive 
Burlington, MA 01803
Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>