6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
Hi, I recently submitted two patches related to jar manifests and UTF-8 and haven't got any reaction so far. I understand and appreciate that everyone has not time for every wish and my enquiry is certainly not urgent, but still, may I gently ask if I may continue to hope for any progress, have missed anything, or if this of no interest at all? Unfortunately the line breaks in the previous mail went bad which is why I paste the text again below and hope it looks nicer this time. Regards, Philipp [1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.ht... [2] https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.h... On Thu, 2019-12-26 at 17:50 +0100, Philipp Kunz wrote: Hi, The specification says, a line break in a manifest can occur before or after a Unicode character encoded in UTF-8. ... value: SPACE *otherchar newline *continuation continuation: SPACE *otherchar newline... otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless of how the bytes around the break are part of a sequence of bytes encoding a character. Code points may use up to four bytes when encoded in UTF-8. Manifests with line breaks inside of sequences of bytes encoding Unicode characters in UTF-8 with more than one bytes not only are invalid UTF-8 but also look ugly in text editors. For example, a manifest could look like this: Manifest-Version: 1.0Some-Key: Some languages have decorated characters, for example: espa? ?ol Below code produces a result as seen above with some unexpected question marks where the encoding is invalid: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} This is of course an example written with actual question marks to get a valid text for this message. The trick here is that "Some-Key" to "example :espa" amounts to exactly one byte less encoded in UTF-8 than would fit on one line with the 72 byte limit so that the subsequent character encoded with two bytes gets broken inside of the sequence of two bytes for this character across a continuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the two question marks will not fit together again even if the line break with the continuation space is removed. However, Manifest::read removes the continuation line breaks ("\r\n ") before decoding the manifest header value from UTF-8 and hence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one code point each, there are also combining characters or classes thereof or combining diacritical marks or whatever the appropriate term could be, that combine more than one code point into what is usually experienced and referred to as a character. The term character really gets ambiguous at this point. I wouldn't know what the specification actually refers to with that term "character". So rather than diving in too much specification or any sorts of theory, let's look at another example: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0Some-Key: Angstro ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string and continuation line breaks are not removed until after UTF-8 decoding the whole manifest, the diaeresis (umlaut, two points above, u0308) apparently kind of jumps onto the following letter m because somehow it cannot be combined with the preceding space. The UTF-8 decoder (all of my editors I tried, not only Eclipse and its console view, also less, gedit, cat and terminal) somehow tries to fix that but the diaeresis may not necessarily jump back on the "o" where it originally belonged by removing the continuation line break ("\r\n ") after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated from whatever they combine with when breaking manifest lines onto a continuation line. Such combinations, however, seem to be unlimited in terms of number of code points combining into the same "experienced" character. I was able to find combinations that not only exceed the limit of 72 bytes per line but also exceed the line buffer size of 512 bytes in Manifest::read. These may be rather uncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifest line but exceeding it would make such manifests incompatible with previous Manifest::read implementations and is not really an immediately available option at the moment. As a compromise, those characters including combining diacritical marks which combine only so many code points as that their binarily encoded form in UTF-8 remains within a limit of 71 bytes can be written without an interrupting continuation line break, which applies to most cases, but not all. I guess this should suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations of multiple Unicode code points to be kept together in their encoded form in UTF-8 up to 512 bytes line length limit when reading minus a space and a line break amounting to 509 bytes, but that would still not make manifests be represented as valid Unicode in all corner cases and I guess would not probably make a real improvement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using a BreakIterator. While it works from a functional point of view, this might be less desirable performance-wise. Alternatively could be considered to do without the BreakIterator and only keep Unicode code points together by not placing line breaks before a continuation byte, which however would not address combining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should be valid UTF-8, and they were not always, but it also does not state otherwise, leaving an impression that manifests could be (mis)taken for UTF-8 encoded strings, which they also are in many or most cases and which has been confused many times. At the moment, the only case where a valid manifest is not also a valid UTF-8 encoded string is when a sequence of bytes encoding the same character happens to be interrupted with a continuation line break. To the best of my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicode capable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
Hi Philipp, Thank you for the reminder. Your email came in during the holidays so it kind of was lost in my backlog. First, thank you for the proposed patch. This will take some time to review but it is on the list Best Lance
On Feb 6, 2020, at 1:37 PM, Philipp Kunz <philipp.kunz@paratix.ch> wrote:
Hi,
I recently submitted two patches related to jar manifests and UTF-8 and haven't got any reaction so far. I understand and appreciate that everyone has not time for every wish and my enquiry is certainly not urgent, but still, may I gently ask if I may continue to hope for any progress, have missed anything, or if this of no interest at all?
Unfortunately the line breaks in the previous mail went bad which is why I paste the text again below and hope it looks nicer this time.
Regards, Philipp
[1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.ht... <https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-January/064190.html> [2] https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.h... <https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-December/064149.html>
On Thu, 2019-12-26 at 17:50 +0100, Philipp Kunz wrote: Hi, The specification says, a line break in a manifest can occur before or after a Unicode character encoded in UTF-8. ... value: SPACE *otherchar newline *continuation continuation: SPACE *otherchar newline... otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless of how the bytes around the break are part of a sequence of bytes encoding a character. Code points may use up to four bytes when encoded in UTF-8. Manifests with line breaks inside of sequences of bytes encoding Unicode characters in UTF-8 with more than one bytes not only are invalid UTF-8 but also look ugly in text editors. For example, a manifest could look like this: Manifest-Version: 1.0Some-Key: Some languages have decorated characters, for example: espa? ?ol Below code produces a result as seen above with some unexpected question marks where the encoding is invalid: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} This is of course an example written with actual question marks to get a valid text for this message. The trick here is that "Some-Key" to "example :espa" amounts to exactly one byte less encoded in UTF-8 than would fit on one line with the 72 byte limit so that the subsequent character encoded with two bytes gets broken inside of the sequence of two bytes for this character across a continuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the two question marks will not fit together again even if the line break with the continuation space is removed. However, Manifest::read removes the continuation line breaks ("\r\n ") before decoding the manifest header value from UTF-8 and hence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one code point each, there are also combining characters or classes thereof or combining diacritical marks or whatever the appropriate term could be, that combine more than one code point into what is usually experienced and referred to as a character. The term character really gets ambiguous at this point. I wouldn't know what the specification actually refers to with that term "character". So rather than diving in too much specification or any sorts of theory, let's look at another example: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0Some-Key: Angstro ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string and continuation line breaks are not removed until after UTF-8 decoding the whole manifest, the diaeresis (umlaut, two points above, u0308) apparently kind of jumps onto the following letter m because somehow it cannot be combined with the preceding space. The UTF-8 decoder (all of my editors I tried, not only Eclipse and its console view, also less, gedit, cat and terminal) somehow tries to fix that but the diaeresis may not necessarily jump back on the "o" where it originally belonged by removing the continuation line break ("\r\n ") after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated from whatever they combine with when breaking manifest lines onto a continuation line. Such combinations, however, seem to be unlimited in terms of number of code points combining into the same "experienced" character. I was able to find combinations that not only exceed the limit of 72 bytes per line but also exceed the line buffer size of 512 bytes in Manifest::read. These may be rather uncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifest line but exceeding it would make such manifests incompatible with previous Manifest::read implementations and is not really an immediately available option at the moment. As a compromise, those characters including combining diacritical marks which combine only so many code points as that their binarily encoded form in UTF-8 remains within a limit of 71 bytes can be written without an interrupting continuation line break, which applies to most cases, but not all. I guess this should suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations of multiple Unicode code points to be kept together in their encoded form in UTF-8 up to 512 bytes line length limit when reading minus a space and a line break amounting to 509 bytes, but that would still not make manifests be represented as valid Unicode in all corner cases and I guess would not probably make a real improvement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using a BreakIterator. While it works from a functional point of view, this might be less desirable performance-wise. Alternatively could be considered to do without the BreakIterator and only keep Unicode code points together by not placing line breaks before a continuation byte, which however would not address combining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should be valid UTF-8, and they were not always, but it also does not state otherwise, leaving an impression that manifests could be (mis)taken for UTF-8 encoded strings, which they also are in many or most cases and which has been confused many times. At the moment, the only case where a valid manifest is not also a valid UTF-8 encoded string is when a sequence of bytes encoding the same character happens to be interrupted with a continuation line break. To the best of my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicode capable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi all, Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz@paratix.ch> wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Here is the webrev for Phillip’s proposed change for the splash image http://cr.openjdk.java.net/~lancea/6202130/splash/ <http://cr.openjdk.java.net/~lancea/6202130/splash/>
On Feb 10, 2020, at 4:22 PM, Lance Andersen <lance.andersen@oracle.com> wrote:
Hi all,
Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz@paratix.ch> wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi, Can the code in Manifest.java be simplified using regex? E.g., var gc = Pattern.compile("\\X"); gc.matcher(line).results() .map(MatchResult::group) .forEach( // print till 72 bytes in UTF-8 ); Just a thought. If BreakIterator is preferred, it should take Locale.US as an argument to the factory method, so that it would produce the same result no matter what the default locale is. Naoto On 2/10/20 1:22 PM, Lance Andersen wrote:
Hi all,
Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
On Dec 26, 2019, at 11:50 AM, Philipp Kunz <philipp.kunz@paratix.ch> wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi Naoto and everyone, When I tried the regular expression approach you proposed which really looks delightful, ValueUtf8Coding test ran into a timeout, which is why I changed Pattern and Grapheme classes to start the loop in nextBoundary to start at the offset off rather than at the beginning of the character sequence src rendering isBoundary redundant. The changes to Pattern and Grapheme relate only by means of performance and could otherwise be applied seperately as well. I admit I'm not too confident and familiar with the regex parts of the openjdk and would welcome not only thorough review as usual but also any other feedback or thought shared about the matter. Then I started to wonder about an issue of copy paste between Grapheme and GraphemeTest already out of date and also about some comments in RegExTest and if no more matches shouldn't also be asserted, which also could possibly be changed in a different change but still relate to the actual changes to certain extent. I could not find a way to access UCD_FILES which has no or is in the default package. What about passing the name and the value parameters of Manifest.printLine72 to it as separate parameters? Because names cannot exceed 70 bytes and the allowed characters are limited to ones that are represented with one byte each in UTF-8 the name can/could always be printed without a line break. I don't know if Manifest.println72 can be removed or has to be retained and deprecated for compatibility. See attached patch. Looking forward to receiving any kind of feedback and regards,Philipp On Tue, 2020-02-11 at 14:20 -0800, naoto.sato@oracle.com wrote:
Hi, Can the code in Manifest.java be simplified using regex? E.g., var gc = Pattern.compile("\\X");gc.matcher(line).results() .map(MatchResul t::group) .forEach( // print till 72 bytes in UTF-8 ); Just a thought. If BreakIterator is preferred, it should take Locale.US as an argument to the factory method, so that it would produce the same result no matter what the default locale is. Naoto On 2/10/20 1:22 PM, Lance Andersen wrote:
Hi all, Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 < http://cr.openjdk.java.net/~lancea/6202130/webrev.00>
On Dec 26, 2019, at 11:50 AM, Philipp Kunz < philipp.kunz@paratix.ch> wrote: Hi,The specification says, a line break in a manifest can occur beforeorafter a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline *continuation> continuation: SPACE *othercharnewline> ...> otherchar: any UTF-8 character except NUL, CRand LFThe current implementation breaks manifest lines at 72 bytes regardlessofhow the bytes around the break are part of a sequence of bytesencoding acharacter. Code points may use up to four bytes when encodedin UTF-8.Manifests with line breaks inside of sequences of bytesencoding Unicodecharacters in UTF-8 with more than one bytes not onlyare invalid UTF-8but also look ugly in text editors.For example, amanifest could look like this:import java.util.jar.Manifest;import java.util.jar.Attributes;importstatic java.util.jar.Attributes.Name;public class CharacterBrokenDemo1 { public static void main(String[]args) throws Exception{ Manifest mf = newManifest(); Attributes attrs =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION," 1.0"); attrs.put(new Name("Some- Key"), "Somelanguages have decorated characters, " + "forexample: español"); // or"espa\u00D1ol" mf.write(System.out); }}Above code produces a result as follows with some unexpected questionmarkswhere the encoding is invalid: Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?olThis is of course an example written with actual question marks to geta validtext for this message. The trick here is that "Some-Key" to"example :espa"amounts to exactly one byte less encoded in UTF-8 thanwould fit on one linewith the 72 byte limit so that the subsequentcharacter encoded with two bytesgets broken inside of the sequence oftwo bytes for this character across acontinuation line break.When decoding the resulting bytes from UTF-8 as one whole string, thetwoquestion marks will not fit together again even if the line breakwith thecontinuation space is removed. However, Manifest::read removesthe continuationline breaks ("\r\n ") before decoding the manifestheader value from UTF-8 andhence can reproduce the original value.Characters encoded in UTF-8 can not only span up to four bytes for onecodepoint each, there are also combining characters or classes thereofor combiningdiacritical marks or whatever the appropriate term couldbe, that combine morethan one code point into what is usuallyexperienced and referred to as acharacter.The term character really gets ambiguous at this point. I wouldn't knowwhatthe specification actually refers to with that term "character". Sorather thandiving in too much specification or any sorts of theory,let's look at anotherexample:import java.util.jar.Manifest;import java.util.jar.Attributes;importstatic java.util.jar.Attributes.Name;public class DemoCharacterBroken2 { public static void main(String[]args) throws Exception{ Manifest mf = newManifest(); Attributes attrs =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION," 1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) +"Angstro\u0308m"); mf.write(System.out); }}which produces console output as follows: Manifest-Version: 1.0> Some- Key: Angstro ̈m(In case this does not display well, the diaeresis is on the m on thelast line)When the whole Manifest is decoded from UTF-8 as one big single stringandcontinuation line breaks are not removed until after UTF-8 decodingthe wholemanifest, the diaeresis (umlaut, two points above, u0308)apparently kind ofjumps onto the following letter m because somehow itcannot be combined withthe preceding space. The UTF-8 decoder (all ofmy editors I tried, not onlyEclipse and its console view, also less,gedit, cat and terminal) somehowtries to fix that but the diaeresis maynot necessarily jump back on the "o"where it originally belonged byremoving the continuation line break ("\r\n ")after UTF-8 decoding hastaken place, at least that did not work for me.Hence, ideally combining diacritical marks should better not beseparated fromwhatever they combine with when breaking manifest linesonto a continuationline. Such combinations, however, seem to beunlimited in terms of number ofcode points combining into the same"experienced" character. I was able tofind combinations that not onlyexceed the limit of 72 bytes per line but alsoexceed the line buffersize of 512 bytes in Manifest::read. These may be ratheruncommon butstill possible to my own surprise.Next consideration would then be to remove that limit of 512 bytes permanifestline but exceeding it would make such manifests incompatiblewith previousManifest::read implementations and is not really animmediately availableoption at the moment.As a compromise, those characters including combining diacritical markswhichcombine only so many code points as that their binarily encodedform in UTF-8remains within a limit of 71 bytes can be written withoutan interruptingcontinuation line break, which applies to most cases,but not all. I guess thisshould suit practically and realistically tobe expected values well.Another possibility would be to allow for characters that arecombinations ofmultiple Unicode code points to be kept together intheir encoded form in UTF-8up to 512 bytes line length limit whenreading minus a space and a line breakamounting to 509 bytes, but thatwould still not make manifests be representedas valid Unicode in allcorner cases and I guess would not probably make a realimprovement inpractice over a limit of 71 bytes.Attached is a patch that tries to implement what was described aboveusing aBreakIterator. While it works from a functional point of view,this might beless desirable performance-wise. Alternatively could beconsidered to do withoutthe BreakIterator and only keep Unicode codepoints together by not placingline breaks before a continuation byte,which however would not addresscombining diacritical marks as in thesecond example above.The jar file specification does not explicitly state that manifestshould bevalid UTF-8, and they were not always, but it also does notstate otherwise,leaving an impression that manifests could be(mis)taken for UTF-8 encodedstrings, which they also are in many ormost cases and which has been confusedmany times. At the moment, theonly case where a valid manifest is not also avalid UTF-8 encodedstring is when a sequence of bytes encoding the samecharacter happensto be interrupted with a continuation line break. To the bestof myknowledge, all other valid manifests are also valid UTF-8 encodedstrings.It would be nice, if manifests could be viewed and manipulated with allUnicodecapable editors and not only be parsed correctly withManifest::read.Any opinions? Would someone sponsor this patch?Regards,Philipp https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> < http://oracle.com/us/design/oracle-email-sig-198324.gif> < http://oracle.com/us/design/oracle-email-sig-198324.gif> < http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037Oracle Java Engineering1 Network DriveBurlington, MA 01803Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi Philipp, The reason that the current implementation of Grapheme boundary check does it is that, some rules require information before the boundary, i.e. the rule GB12 in Unicode Segmentation [1] reads: "do not break between regional indicator (RI) symbols if there is an odd number of RI characters *before* the break point." Of course this itself can be improved, but I am not so keen on modifying Grapheme/Pattern code at this time, so using BreakIterator (with Locale.US) is more preferred solution to me. Naoto [1] https://unicode.org/reports/tr29/ On 3/10/20 12:11 AM, Philipp Kunz wrote:
Hi Naoto and everyone,
When I tried the regular expression approach you proposed which really looks delightful, ValueUtf8Coding test ran into a timeout, which is why I changed Pattern and Grapheme classes to start the loop in nextBoundary to start at the offset off rather than at the beginning of the character sequence src rendering isBoundary redundant. The changes to Pattern and Grapheme relate only by means of performance and could otherwise be applied seperately as well. I admit I'm not too confident and familiar with the regex parts of the openjdk and would welcome not only thorough review as usual but also any other feedback or thought shared about the matter.
Then I started to wonder about an issue of copy paste between Grapheme and GraphemeTest already out of date and also about some comments in RegExTest and if no more matches shouldn't also be asserted, which also could possibly be changed in a different change but still relate to the actual changes to certain extent. I could not find a way to access UCD_FILES which has no or is in the default package.
What about passing the name and the value parameters of Manifest.printLine72 to it as separate parameters? Because names cannot exceed 70 bytes and the allowed characters are limited to ones that are represented with one byte each in UTF-8 the name can/could always be printed without a line break. I don't know if Manifest.println72 can be removed or has to be retained and deprecated for compatibility.
See attached patch.
Looking forward to receiving any kind of feedback and regards, Philipp
On Tue, 2020-02-11 at 14:20 -0800, naoto.sato@oracle.com wrote:
Hi,
Can the code in Manifest.java be simplified using regex? E.g.,
var gc = Pattern.compile("\\X"); gc.matcher(line).results() .map(MatchResult::group) .forEach( // print till 72 bytes in UTF-8 );
Just a thought. If BreakIterator is preferred, it should take Locale.US as an argument to the factory method, so that it would produce the same result no matter what the default locale is.
Naoto
On 2/10/20 1:22 PM, Lance Andersen wrote:
Hi all,
Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 < http://cr.openjdk.java.net/~lancea/6202130/webrev.00
On Dec 26, 2019, at 11:50 AM, Philipp Kunz < philipp.kunz@paratix.ch <mailto:philipp.kunz@paratix.ch>
wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline
*continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid:
Manifest-Version: 1.0> Some-Key: Some languages have decorated
characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows:
Manifest-Version: 1.0> Some-
Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
< http://oracle.com/us/design/oracle-email-sig-198324.gif
< http://oracle.com/us/design/oracle-email-sig-198324.gif
< http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com> <mailto: Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi Naoto, Thank you for your reply. I fully agree to your reluctance or inhibition to change the regex part just along with the jar manifest. On the other hand, i'm quite confident that all invocations of Grapheme.nextBoundary with my previous patch start matching at the previous boundary and therefore I believe the cited requirement is met which also the existing tests confirm. Obviously, it is difficult to prove the correctness or absence of a mistake. In fact I was kind of hoping someone would get inspired who knows the area better than I do. Maybe relates to JDK-8216332 or JDK-8046101 to some extent, but I guess not exactly. Shall we file a new bug about a performance issue in regex grapheme matchers, the performance of which potentially depends more than linearly and necessarily on the total input sequence size in Grapheme.nextBoundary?Shall we file another bug about an issue of copy and paste in GraphemeTest?I am not entitled to create new bugs myself but if there were I might contribute a patch later. Regards,Philipp On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, The reason that the current implementation of Grapheme boundary check does it is that, some rules require information before the boundary, i.e. the rule GB12 in Unicode Segmentation [1] reads: "do not break between regional indicator (RI) symbols if there is an odd number of RI characters *before* the break point." Of course this itself can be improved, but I am not so keen on modifying Grapheme/Pattern code at this time, so using BreakIterator (with Locale.US) is more preferred solution to me. Naoto [1] https://unicode.org/reports/tr29/
On 3/10/20 12:11 AM, Philipp Kunz wrote:
Hi Naoto and everyone, When I tried the regular expression approach you proposed which really looks delightful, ValueUtf8Coding test ran into a timeout, which is why I changed Pattern and Grapheme classes to start the loop in nextBoundary to start at the offset off rather than at the beginning of the character sequence src rendering isBoundary redundant. The changes to Pattern and Grapheme relate only by means of performance and could otherwise be applied seperately as well. I admit I'm not too confident and familiar with the regex parts of the openjdk and would welcome not only thorough review as usual but also any other feedback or thought shared about the matter. Then I started to wonder about an issue of copy paste between Grapheme and GraphemeTest already out of date and also about some comments in RegExTest and if no more matches shouldn't also be asserted, which also could possibly be changed in a different change but still relate to the actual changes to certain extent. I could not find a way to access UCD_FILES which has no or is in the default package. What about passing the name and the value parameters of Manifest.printLine72 to it as separate parameters? Because names cannot exceed 70 bytes and the allowed characters are limited to ones that are represented with one byte each in UTF-8 the name can/could always be printed without a line break. I don't know if Manifest.println72 can be removed or has to be retained and deprecated for compatibility. See attached patch. Looking forward to receiving any kind of feedback and regards,Philipp
On Tue, 2020-02-11 at 14:20 -0800, naoto.sato@oracle.com wrote:
Hi, Can the code in Manifest.java be simplified using regex? E.g., var gc = Pattern.compile("\\X");gc.matcher(line).results() .map(Match Result::group) .forEach( // print till 72 bytes in UTF- 8 ); Just a thought. If BreakIterator is preferred, it should take Locale.USas an argument to the factory method, so that it would produce the sameresult no matter what the default locale is. Naoto On 2/10/20 1:22 PM, Lance Andersen wrote:
Hi all, Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00 <http://cr.openjdk.java.net/~lancea/6202130/webrev.00
On Dec 26, 2019, at 11:50 AM, Philipp Kunz < philipp.kunz@paratix.ch <mailto:philipp.kunz@paratix.ch>
wrote:
Hi,The specification says, a line break in a manifest can occur beforeorafter a Unicode character encoded in UTF-8.
...> value: SPACE *otherchar newline
*continuation> continuation: SPACE *othercharnewline> ...> otherchar: any UTF-8 character except NUL, CRand LFThe current implementation breaks manifest lines at 72 bytes regardlessofhow the bytes around the break are part of a sequence of bytesencoding acharacter. Code points may use up to four bytes when encodedin UTF-8.Manifests with line breaks inside of sequences of bytesencoding Unicodecharacters in UTF-8 with more than one bytes not onlyare invalid UTF-8but also look ugly in text editors.For example, amanifest could look like this:import java.util.jar.Manifest;import java.util.jar.Attributes;importstatic java.util.jar.Attributes.Name;public class CharacterBrokenDemo1 { public static void main(String[]args) throws Exception{ Manifest mf = newManifest(); Attributes attrs =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSI ON,"1.0"); attrs.put(new Name("Some- Key"), "Somelanguages have decorated characters, " + "forexample: español"); // or"espa\u00D1ol" mf.write(System.out); }}Above code produces a result as follows with some unexpected questionmarkswhere the encoding is invalid:
Manifest-Version: 1.0> Some-Key: Some languages have decorated
characters, for example: espa?> ?olThis is of course an example written with actual question marks to geta validtext for this message. The trick here is that "Some-Key" to"example :espa"amounts to exactly one byte less encoded in UTF-8 thanwould fit on one linewith the 72 byte limit so that the subsequentcharacter encoded with two bytesgets broken inside of the sequence oftwo bytes for this character across acontinuation line break.When decoding the resulting bytes from UTF-8 as one whole string, thetwoquestion marks will not fit together again even if the line breakwith thecontinuation space is removed. However, Manifest::read removesthe continuationline breaks ("\r\n ") before decoding the manifestheader value from UTF-8 andhence can reproduce the original value.Characters encoded in UTF-8 can not only span up to four bytes for onecodepoint each, there are also combining characters or classes thereofor combiningdiacritical marks or whatever the appropriate term couldbe, that combine morethan one code point into what is usuallyexperienced and referred to as acharacter.The term character really gets ambiguous at this point. I wouldn't knowwhatthe specification actually refers to with that term "character". Sorather thandiving in too much specification or any sorts of theory,let's look at anotherexample:import java.util.jar.Manifest;import java.util.jar.Attributes;importstatic java.util.jar.Attributes.Name;public class DemoCharacterBroken2 { public static void main(String[]args) throws Exception{ Manifest mf = newManifest(); Attributes attrs =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSI ON,"1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) +"Angstro\u0308m"); mf.write(System.out); }}which produces console output as follows:
Manifest-Version: 1.0> Some-
Key: Ang stro>̈m(In case this does not display well, the diaeresis is on the m on thelast line)When the whole Manifest is decoded from UTF-8 as one big single stringandcontinuation line breaks are not removed until after UTF-8 decodingthe wholemanifest, the diaeresis (umlaut, two points above, u0308)apparently kind ofjumps onto the following letter m because somehow itcannot be combined withthe preceding space. The UTF-8 decoder (all ofmy editors I tried, not onlyEclipse and its console view, also less,gedit, cat and terminal) somehowtries to fix that but the diaeresis maynot necessarily jump back on the "o"where it originally belonged byremoving the continuation line break ("\r\n ")after UTF-8 decoding hastaken place, at least that did not work for me.Hence, ideally combining diacritical marks should better not beseparated fromwhatever they combine with when breaking manifest linesonto a continuationline. Such combinations, however, seem to beunlimited in terms of number ofcode points combining into the same"experienced" character. I was able tofind combinations that not onlyexceed the limit of 72 bytes per line but alsoexceed the line buffersize of 512 bytes in Manifest::read. These may be ratheruncommon butstill possible to my own surprise.Next consideration would then be to remove that limit of 512 bytes permanifestline but exceeding it would make such manifests incompatiblewith previousManifest::read implementations and is not really animmediately availableoption at the moment.As a compromise, those characters including combining diacritical markswhichcombine only so many code points as that their binarily encodedform in UTF-8remains within a limit of 71 bytes can be written withoutan interruptingcontinuation line break, which applies to most cases,but not all. I guess thisshould suit practically and realistically tobe expected values well.Another possibility would be to allow for characters that arecombinations ofmultiple Unicode code points to be kept together intheir encoded form in UTF-8up to 512 bytes line length limit whenreading minus a space and a line breakamounting to 509 bytes, but thatwould still not make manifests be representedas valid Unicode in allcorner cases and I guess would not probably make a realimprovement inpractice over a limit of 71 bytes.Attached is a patch that tries to implement what was described aboveusing aBreakIterator. While it works from a functional point of view,this might beless desirable performance-wise. Alternatively could beconsidered to do withoutthe BreakIterator and only keep Unicode codepoints together by not placingline breaks before a continuation byte,which however would not addresscombining diacritical marks as in thesecond example above.The jar file specification does not explicitly state that manifestshould bevalid UTF-8, and they were not always, but it also does notstate otherwise,leaving an impression that manifests could be(mis)taken for UTF-8 encodedstrings, which they also are in many ormost cases and which has been confusedmany times. At the moment, theonly case where a valid manifest is not also avalid UTF-8 encodedstring is when a sequence of bytes encoding the samecharacter happensto be interrupted with a continuation line break. To the bestof myknowledge, all other valid manifests are also valid UTF-8 encodedstrings.It would be nice, if manifests could be viewed and manipulated with allUnicodecapable editors and not only be parsed correctly withManifest::read.Any opinions? Would someone sponsor this patch?Regards,Philipp https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
<http://oracle.com/us/design/oracle-email-sig-198324.gif <http://oracle.com/us/design/oracle-email-sig-198324.gif
< http://oracle.com/us/design/oracle-email-sig-198324.gif < http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037Oracle Java Engineering1 Network DriveBurlington, MA 01803Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com> <mailto: Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi Phillip, Sure. Would you please file Grapheme related issues at bugs.java.com? This is the route for whom doesn't have JBS id to file bugs. Naoto On 3/10/20 11:27 PM, Philipp Kunz wrote:
Hi Naoto,
Thank you for your reply. I fully agree to your reluctance or inhibition to change the regex part just along with the jar manifest.
On the other hand, i'm quite confident that all invocations of Grapheme.nextBoundary with my previous patch start matching at the previous boundary and therefore I believe the cited requirement is met which also the existing tests confirm. Obviously, it is difficult to prove the correctness or absence of a mistake. In fact I was kind of hoping someone would get inspired who knows the area better than I do. Maybe relates to JDK-8216332 or JDK-8046101 to some extent, but I guess not exactly.
Shall we file a new bug about a performance issue in regex grapheme matchers, the performance of which potentially depends more than linearly and necessarily on the total input sequence size in Grapheme.nextBoundary? Shall we file another bug about an issue of copy and paste in GraphemeTest? I am not entitled to create new bugs myself but if there were I might contribute a patch later.
Regards, Philipp
On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,
The reason that the current implementation of Grapheme boundary check does it is that, some rules require information before the boundary, i.e. the rule GB12 in Unicode Segmentation [1] reads:
"do not break between regional indicator (RI) symbols if there is an odd number of RI characters *before* the break point."
Of course this itself can be improved, but I am not so keen on modifying Grapheme/Pattern code at this time, so using BreakIterator (with Locale.US) is more preferred solution to me.
Naoto
[1] https://unicode.org/reports/tr29/ <https://urldefense.com/v3/__https://unicode.org/reports/tr29/__;!!GqivPVa7Brio!PryBHBgba8QeMFfAG7kw5V4WCdaId7c9nfo_1k7Uav-uXDxSwn5ga1aPYvhF_D6x$>
On 3/10/20 12:11 AM, Philipp Kunz wrote:
Hi Naoto and everyone,
When I tried the regular expression approach you proposed which really looks delightful, ValueUtf8Coding test ran into a timeout, which is why I changed Pattern and Grapheme classes to start the loop in nextBoundary to start at the offset off rather than at the beginning of the character sequence src rendering isBoundary redundant. The changes to Pattern and Grapheme relate only by means of performance and could otherwise be applied seperately as well. I admit I'm not too confident and familiar with the regex parts of the openjdk and would welcome not only thorough review as usual but also any other feedback or thought shared about the matter.
Then I started to wonder about an issue of copy paste between Grapheme and GraphemeTest already out of date and also about some comments in RegExTest and if no more matches shouldn't also be asserted, which also could possibly be changed in a different change but still relate to the actual changes to certain extent. I could not find a way to access UCD_FILES which has no or is in the default package.
What about passing the name and the value parameters of Manifest.printLine72 to it as separate parameters? Because names cannot exceed 70 bytes and the allowed characters are limited to ones that are represented with one byte each in UTF-8 the name can/could always be printed without a line break. I don't know if Manifest.println72 can be removed or has to be retained and deprecated for compatibility.
See attached patch.
Looking forward to receiving any kind of feedback and regards, Philipp
On Tue, 2020-02-11 at 14:20 -0800, naoto.sato@oracle.com <mailto:naoto.sato@oracle.com> wrote:
Hi,
Can the code in Manifest.java be simplified using regex? E.g.,
var gc = Pattern.compile("\\X"); gc.matcher(line).results() .map(MatchResult::group) .forEach( // print till 72 bytes in UTF-8 );
Just a thought. If BreakIterator is preferred, it should take Locale.US as an argument to the factory method, so that it would produce the same result no matter what the default locale is.
Naoto
On 2/10/20 1:22 PM, Lance Andersen wrote:
Hi all,
Here is a webrev for the patch that Philipp is proposing which will make it easier to review: http://cr.openjdk.java.net/~lancea/6202130/webrev.00
< http://cr.openjdk.java.net/~lancea/6202130/webrev.00
On Dec 26, 2019, at 11:50 AM, Philipp Kunz < philipp.kunz@paratix.ch <mailto:philipp.kunz@paratix.ch>
<mailto: philipp.kunz@paratix.ch <mailto:philipp.kunz@paratix.ch> > > wrote:
Hi, The specification says, a line break in a manifest can occur beforeor after a Unicode character encoded in UTF-8. > ...> value: SPACE *otherchar newline *continuation> continuation: SPACE *otherchar newline> ...> otherchar: any UTF-8 character except NUL, CR and LF The current implementation breaks manifest lines at 72 bytes regardless ofhow the bytes around the break are part of a sequence of bytes encoding acharacter. Code points may use up to four bytes when encoded in UTF-8.Manifests with line breaks inside of sequences of bytes encoding Unicodecharacters in UTF-8 with more than one bytes not only are invalid UTF-8but also look ugly in text editors.For example, a manifest could look like this: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class CharacterBrokenDemo1 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), "Some languages have decorated characters, " + "for example: español"); // or "espa\u00D1ol" mf.write(System.out); }} Above code produces a result as follows with some unexpected question markswhere the encoding is invalid: > Manifest-Version: 1.0> Some-Key: Some languages have decorated characters, for example: espa?> ?ol This is of course an example written with actual question marks to get a validtext for this message. The trick here is that "Some-Key" to "example :espa"amounts to exactly one byte less encoded in UTF-8 than would fit on one linewith the 72 byte limit so that the subsequent character encoded with two bytesgets broken inside of the sequence of two bytes for this character across acontinuation line break. When decoding the resulting bytes from UTF-8 as one whole string, the twoquestion marks will not fit together again even if the line break with thecontinuation space is removed. However, Manifest::read removes the continuationline breaks ("\r\n ") before decoding the manifest header value from UTF-8 andhence can reproduce the original value. Characters encoded in UTF-8 can not only span up to four bytes for one codepoint each, there are also combining characters or classes thereof or combiningdiacritical marks or whatever the appropriate term could be, that combine morethan one code point into what is usually experienced and referred to as acharacter. The term character really gets ambiguous at this point. I wouldn't know whatthe specification actually refers to with that term "character". So rather thandiving in too much specification or any sorts of theory, let's look at anotherexample: import java.util.jar.Manifest;import java.util.jar.Attributes;import static java.util.jar.Attributes.Name; public class DemoCharacterBroken2 { public static void main(String[] args) throws Exception{ Manifest mf = new Manifest(); Attributes attrs = mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION, "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) + "Angstro\u0308m"); mf.write(System.out); }} which produces console output as follows: > Manifest-Version: 1.0> Some- Key: Angstro> ̈m (In case this does not display well, the diaeresis is on the m on the last line) When the whole Manifest is decoded from UTF-8 as one big single string andcontinuation line breaks are not removed until after UTF-8 decoding the wholemanifest, the diaeresis (umlaut, two points above, u0308) apparently kind ofjumps onto the following letter m because somehow it cannot be combined withthe preceding space. The UTF-8 decoder (all of my editors I tried, not onlyEclipse and its console view, also less, gedit, cat and terminal) somehowtries to fix that but the diaeresis may not necessarily jump back on the "o"where it originally belonged by removing the continuation line break ("\r\n ")after UTF-8 decoding has taken place, at least that did not work for me. Hence, ideally combining diacritical marks should better not be separated fromwhatever they combine with when breaking manifest lines onto a continuationline. Such combinations, however, seem to be unlimited in terms of number ofcode points combining into the same "experienced" character. I was able tofind combinations that not only exceed the limit of 72 bytes per line but alsoexceed the line buffer size of 512 bytes in Manifest::read. These may be ratheruncommon but still possible to my own surprise. Next consideration would then be to remove that limit of 512 bytes per manifestline but exceeding it would make such manifests incompatible with previousManifest::read implementations and is not really an immediately availableoption at the moment. As a compromise, those characters including combining diacritical marks whichcombine only so many code points as that their binarily encoded form in UTF-8remains within a limit of 71 bytes can be written without an interruptingcontinuation line break, which applies to most cases, but not all. I guess thisshould suit practically and realistically to be expected values well. Another possibility would be to allow for characters that are combinations ofmultiple Unicode code points to be kept together in their encoded form in UTF-8up to 512 bytes line length limit when reading minus a space and a line breakamounting to 509 bytes, but that would still not make manifests be representedas valid Unicode in all corner cases and I guess would not probably make a realimprovement in practice over a limit of 71 bytes. Attached is a patch that tries to implement what was described above using aBreakIterator. While it works from a functional point of view, this might beless desirable performance-wise. Alternatively could be considered to do withoutthe BreakIterator and only keep Unicode code points together by not placingline breaks before a continuation byte, which however would not addresscombining diacritical marks as in the second example above. The jar file specification does not explicitly state that manifest should bevalid UTF-8, and they were not always, but it also does not state otherwise,leaving an impression that manifests could be (mis)taken for UTF-8 encodedstrings, which they also are in many or most cases and which has been confusedmany times. At the moment, the only case where a valid manifest is not also avalid UTF-8 encoded string is when a sequence of bytes encoding the samecharacter happens to be interrupted with a continuation line break. To the bestof my knowledge, all other valid manifests are also valid UTF-8 encoded strings. It would be nice, if manifests could be viewed and manipulated with all Unicodecapable editors and not only be parsed correctly with Manifest::read. Any opinions? Would someone sponsor this patch? Regards,Philipp
https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificat...
<6202130-manifestutf8linebreak.patch>
< http://oracle.com/us/design/oracle-email-sig-198324.gif
< http://oracle.com/us/design/oracle-email-sig-198324.gif
< http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
<mailto: Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
<mailto: Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
<mailto: Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Hi Naoto and everyone, There are almost as many occurrences of Locale.ROOT as Locale.US which made me wonder which one is more appropriately locale-independent and which is probably another topic and not actually relevant here because BreakIterator.getCharacterInstance is locale-agnostic as far as I can tell. See attached patch with another attempt to fix bug 6202130. Regards, Philipp On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,
..., so using BreakIterator (with Locale.US) is more preferred solution to me.
Naoto
Hi Philipp, Right, Locale.ROOT is more appropriate here by definition, though the implementation is the same. Naoto On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone,
There are almost as many occurrences of Locale.ROOT as Locale.US which made me wonder which one is more appropriately locale-independent and which is probably another topic and not actually relevant here because BreakIterator.getCharacterInstance is locale-agnostic as far as I can tell.
See attached patch with another attempt to fix bug 6202130.
Regards, Philipp
On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,
..., so using BreakIterator (with Locale.US) is more preferred solution to me.
Naoto
Hi Naoto, See another patch attached with Locale.ROOT for the BreakIterator. I will be glad to hear of any feedback. There is another patch [1] around dealing with manifest attributes during application launch. It certainly is related to 6202130 but feels like a distinct set of changes with a slightly different concern. Any opinion as how to proceed with that one? Regards,Philipp [1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.h... On Mon, 2020-03-23 at 09:05 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, Right, Locale.ROOT is more appropriate here by definition, though the implementation is the same. Naoto On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone, There are almost as many occurrences of Locale.ROOT as Locale.US which made me wonder which one is more appropriately locale- independent and which is probably another topic and not actually relevant here because BreakIterator.getCharacterInstance is locale- agnostic as far as I can tell. See attached patch with another attempt to fix bug 6202130. Regards,Philipp
On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, ..., so using BreakIterator (withLocale.US) is more preferred solution to me. Naoto
Hi Philipp, Sorry for the delay. On 3/25/20 11:52 AM, Philipp Kunz wrote:
Hi Naoto,
See another patch attached with Locale.ROOT for the BreakIterator. I will be glad to hear of any feedback.
I wonder how your change affects the performance, as it will do String.getBytes(UTF-8) per each character. I think this can definitely be improved by adding some fastpath, e.g., for ASCII. The usage of the BreakIterator is fine, though.
There is another patch [1] around dealing with manifest attributes during application launch. It certainly is related to 6202130 but feels like a distinct set of changes with a slightly different concern. Any opinion as how to proceed with that one?
I am not quite sure which patch you are referring to, but I agree that creating an issue would not hurt. Naoto
Regards, Philipp
[1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.h...
On Mon, 2020-03-23 at 09:05 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,
Right, Locale.ROOT is more appropriate here by definition, though the implementation is the same.
Naoto
On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone,
There are almost as many occurrences of Locale.ROOT as Locale.US which made me wonder which one is more appropriately locale-independent and which is probably another topic and not actually relevant here because BreakIterator.getCharacterInstance is locale-agnostic as far as I can tell.
See attached patch with another attempt to fix bug 6202130.
Regards, Philipp
On Tue, 2020-03-10 at 10:45 -0700, naoto.sato@oracle.com <mailto:naoto.sato@oracle.com> wrote:
Hi Philipp,
..., so using BreakIterator (with Locale.US) is more preferred solution to me.
Naoto
Hi Naoto, You are absolutely right to raise the question. I've also thought about this but haven't come up so far with a compellingly elegant solution, at least not yet. If only String.isLatin1 was public that would come in very handy. Character or anything else I looked at cannot tell if a string is ascii. BreakIterator supports iterating backwards so we could start at the potential line end but that requires a start position that is a boundary to start with and that is not obviously possible due to the regional indicators and probably other code points requiring stateful processing. Same with regexes and I wouldn't know how to express groups that could count bytes. It does not even seem to be possible to guess any number of characters to start searching for a boundary because of the statefulness. Even the most simple approach to detect latin1 Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which essentially is another inefficient loop. It also does not work to encode the string into UTF-8 in a single pass because then it is not known which grapheme boundary matches to which byte position. I also tried to write the header names and the ": " delimiter without breaking first but it did not seem to significantly affect performance. UTF_8.newEncoder cannot process single surrogates, admittedly an edge case, but required for compatibility. I added a fast path, see patch, the best way I could think of. Did I miss a better way to tell ascii strings from others? What I found actually improving performance is based on the consideration that strings are composed of primitive chars that will be encoded into three bytes each maximum always that up to 24 characters can be passed to writeChar72 at a time reducing the number of writeChar72 and in turn String.getBytes invocations. The number of characters that can be passed to writeChar72 is at most the number of bytes remaining on the current manifest line (linePos) divided by three but at least one. See attached patch. Regards,Philipp On Mon, 2020-03-30 at 14:31 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, Sorry for the delay. On 3/25/20 11:52 AM, Philipp Kunz wrote:
Hi Naoto, See another patch attached with Locale.ROOT for the BreakIterator. I will be glad to hear of any feedback.
I wonder how your change affects the performance, as it will do String.getBytes(UTF-8) per each character. I think this can definitely be improved by adding some fastpath, e.g., for ASCII. The usage of the BreakIterator is fine, though.
There is another patch [1] around dealing with manifest attributes during application launch. It certainly is related to 6202130 but feels like a distinct set of changes with a slightly different concern. Any opinion as how to proceed with that one?
I am not quite sure which patch you are referring to, but I agree that creating an issue would not hurt. Naoto
Regards,Philipp
[1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.h...
On Mon, 2020-03-23 at 09:05 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, Right, Locale.ROOT is more appropriate here by definition, though theimplementation is the same. Naoto On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone, There are almost as many occurrences of Locale.ROOT as Locale.US whichmade me wonder which one is more appropriately locale-independent andwhich is probably another topic and not actually relevant here becauseBreakIterator.getCharacterInstance is locale-agnostic as far as I can tell. See attached patch with another attempt to fix bug 6202130. Regards,Philipp
On Tue, 2020-03-10 at 10:45 -0700,naoto.sato@oracle.com <mailto:naoto.sato@oracle.com> wrote:
Hi Philipp, ..., so using BreakIterator (withLocale.US) is more preferred solution to me. Naoto
Hi, Philipp. Sorry for the further delay. Just to step back a bit: The current manifest-writing behavior dates back to JDK 1.2 (at least), so compatibility is an overriding concern. It is unfortunate that multi-byte characters weren't better accounted for in the placement of line breaks from the beginning. But typically in cases like this, we would update the specification to fit the long-standing behavior, and avoid the risk of breaking applications in the wild. We can test that manifests written with your changes can be read by the JDK. But there is also manifest-reading code external to the JDK. Some examples, that would need to be investigated: jarmanifest https://pypi.org/project/jarmanifest/ Chromium https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775... Maven http://maven.apache.org/shared/maven-archiver/index.html Apache Ant https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support IDEs would be another good place to check, and maybe also consumers of JavaEE WAR / EAR files. Performance is also a consideration. JMH benchmarks would be needed to confirm that reading the new manifest is not slower, and to gauge any regression in Manifest writing performance. So that is the work that would need to be done to support this change. The question then will be if the benefit of this change (seen only outside of running code) outweighs the risk of changing behavior that has not been an issue for 20+ years. It seems unlikely that a strong enough case could be made. -Brent On 4/13/20 10:29 AM, Philipp Kunz wrote:
Hi Naoto, You are absolutely right to raise the question. I've also thought about this but haven't come up so far with a compellingly elegant solution, at least not yet. If only String.isLatin1 was public that would come in very handy. Character or anything else I looked at cannot tell if a string is ascii. BreakIterator supports iterating backwards so we could start at the potential line end but that requires a start position that is a boundary to start with and that is not obviously possible due to the regional indicators and probably other code points requiring stateful processing. Same with regexes and I wouldn't know how to express groups that could count bytes. It does not even seem to be possible to guess any number of characters to start searching for a boundary because of the statefulness. Even the most simple approach to detect latin1 Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which essentially is another inefficient loop. It also does not work to encode the string into UTF-8 in a single pass because then it is not known which grapheme boundary matches to which byte position. I also tried to write the header names and the ": " delimiter without breaking first but it did not seem to significantly affect performance. UTF_8.newEncoder cannot process single surrogates, admittedly an edge case, but required for compatibility. I added a fast path, see patch, the best way I could think of. Did I miss a better way to tell ascii strings from others? What I found actually improving performance is based on the consideration that strings are composed of primitive chars that will be encoded into three bytes each maximum always that up to 24 characters can be passed to writeChar72 at a time reducing the number of writeChar72 and in turn String.getBytes invocations. The number of characters that can be passed to writeChar72 is at most the number of bytes remaining on the current manifest line (linePos) divided by three but at least one. See attached patch. Regards,Philipp
On Mon, 2020-03-30 at 14:31 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, Sorry for the delay. On 3/25/20 11:52 AM, Philipp Kunz wrote:
Hi Naoto, See another patch attached with Locale.ROOT for the BreakIterator. I will be glad to hear of any feedback.
I wonder how your change affects the performance, as it will do String.getBytes(UTF-8) per each character. I think this can definitely be improved by adding some fastpath, e.g., for ASCII. The usage of the BreakIterator is fine, though.
There is another patch [1] around dealing with manifest attributes during application launch. It certainly is related to 6202130 but feels like a distinct set of changes with a slightly different concern. Any opinion as how to proceed with that one?
I am not quite sure which patch you are referring to, but I agree that creating an issue would not hurt. Naoto
Regards,Philipp
[1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.h...
On Mon, 2020-03-23 at 09:05 -0700, naoto.sato@oracle.com wrote:
Hi Philipp, Right, Locale.ROOT is more appropriate here by definition, though theimplementation is the same. Naoto On 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone, There are almost as many occurrences of Locale.ROOT as Locale.US whichmade me wonder which one is more appropriately locale-independent andwhich is probably another topic and not actually relevant here becauseBreakIterator.getCharacterInstance is locale-agnostic as far as I can tell. See attached patch with another attempt to fix bug 6202130. Regards,Philipp
On Tue, 2020-03-10 at 10:45 -0700,naoto.sato@oracle.com <mailto:naoto.sato@oracle.com> wrote:
Hi Philipp, ..., so using BreakIterator (withLocale.US) is more preferred solution to me. Naoto
Hi Brent, Thank you very much indeed for your serious answer. Still a lot of work left but to start with I randomly picked Ant and found it suffers a bug. See attached bunch of files to reproduce. The bug does not occur with my proposed patch. One reason for this bug to occur is cutting and concatenating binary portions of manifests as strings which don't properly align with Unicode character boundaries which just does not work and my patch avoids. If I may share an opinion I'd prefer to avoid manifests which are not also valid UTF-8 in the first place. Regards,Philipp On Fri, 2020-05-08 at 11:04 -0700, Brent Christian wrote:
Hi, Philipp. Sorry for the further delay. Just to step back a bit: The current manifest-writing behavior dates back to JDK 1.2 (at least), so compatibility is an overriding concern. It is unfortunate that multi-byte characters weren't better accounted for in the placement of line breaks from the beginning. But typically in cases like this, we would update the specification to fit the long-standing behavior, and avoid the risk of breaking applications in the wild. We can test that manifests written with your changes can be read by the JDK. But there is also manifest-reading code external to the JDK. Some examples, that would need to be investigated: jarmanifesthttps://pypi.org/project/jarmanifest/
Chromium https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775...
Mavenhttp://maven.apache.org/shared/maven-archiver/index.html
Apache Ant https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support
IDEs would be another good place to check, and maybe also consumers of JavaEE WAR / EAR files.
Performance is also a consideration. JMH benchmarks would be needed to confirm that reading the new manifest is not slower, and to gauge any regression in Manifest writing performance.
So that is the work that would need to be done to support this change. The question then will be if the benefit of this change (seen only outside of running code) outweighs the risk of changing behavior that has not been an issue for 20+ years. It seems unlikely that a strong enough case could be made. -Brent On 4/13/20 10:29 AM, Philipp Kunz wrote:
Hi Naoto,You are absolutely right to raise the question. I've also thought aboutthis but haven't come up so far with a compellingly elegant solution,at least not yet.If only String.isLatin1 was public that would come in very handy.Character or anything else I looked at cannot tell if a string isascii. BreakIterator supports iterating backwards so we could start atthe potential line end but that requires a start position that is aboundary to start with and that is not obviously possible due to theregional indicators and probably other code points requiring statefulprocessing. Same with regexes and I wouldn't know how to express groupsthat could count bytes. It does not even seem to be possible to guessany number of characters to start searching for a boundary because ofthe statefulness. Even the most simple approach to detect latin1Strings requires an encoding or a regex such as "[^\\p{ASCII}]" whichessentially is another inefficient loop. It also does not work toencode the string into UTF-8 in a single pass because then it is notknown which grapheme boundary matches to which byte position. I alsotried to write the header names and the ": " delimiter without breakingfirst but it did not seem to significantly affect performance.UTF_8.newEncoder cannot process single surrogates, admittedly an edgecase, but required for compatibility. I added a fast path, see patch,the best way I could think of. Did I miss a better way to tell asciistrings from others?What I found actually improving performance is based on theconsideration that strings are composed of primitive chars that will beencoded into three bytes each maximum always that up to 24 characterscan be passed to writeChar72 at a time reducing the number ofwriteChar72 and in turn String.getBytes invocations. The number ofcharacters that can be passed to writeChar72 is at most the number ofbytes remaining on the current manifest line (linePos) divided by threebut at least one. See attached patch.Regards,Philipp On Mon, 2020-03-30 at 14:31 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,Sorry for the delay.On 3/25/20 11:52 AM, Philipp Kunz wrote:
Hi Naoto,See another patch attached with Locale.ROOT for the BreakIterator.I will be glad to hear of any feedback.
I wonder how your change affects the performance, as it will doString.getBytes(UTF-8) per each character. I think this candefinitely be improved by adding some fastpath, e.g., for ASCII. Theusage of the BreakIterator is fine, though.
There is another patch [1] around dealing with manifest attributesduring application launch. It certainly is related to 6202130 butfeels like a distinct set of changes with a slightly differentconcern. Any opinion as how to proceed with that one?
I am not quite sure which patch you are referring to, but I agreethat creating an issue would not hurt.Naoto
Regards,Philipp
[1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.h...
On Mon, 2020-03-23 at 09:05 -0700, naoto.sato@oracle.com wrote:
Hi Philipp,Right, Locale.ROOT is more appropriate here by definition, thoughtheimplementation is the same.NaotoOn 3/21/20 5:19 AM, Philipp Kunz wrote:
Hi Naoto and everyone,There are almost as many occurrences of Locale.ROOT asLocale.US whichmade me wonder which one is more appropriatelylocale-independent andwhich is probably another topic and notactually relevant herebecauseBreakIterator.getCharacterInstance is locale- agnostic asfar as I can tell.See attached patch with another attempt to fix bug 6202130.Regards,Philipp On Tue, 2020-03-10 at 10:45 -0700,naoto.sato@oracle.com <mailto:naoto.sato@oracle.com> wrote: > Hi Philipp,..., so using BreakIterator (withLocale.US) is > more preferredsolution to me.Naoto
participants (4)
-
Brent Christian
-
Lance Andersen
-
naoto.sato@oracle.com
-
Philipp Kunz