Here is the webrev for Phillip’s proposed change for the splash image http://cr.openjdk.java.net/~lancea/6202130/splash/ <http://cr.openjdk.java.net/~lancea/6202130/splash/>
On Feb 10, 2020, at 4:22 PM, Lance Andersen <lance.andersen@oracle.com> wrote:
Hi all,
Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz@paratix.ch> wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>