6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes
Philipp Kunz
philipp.kunz at paratix.ch
Wed Mar 11 06:27:55 UTC 2020
Hi Naoto,
Thank you for your reply. I fully agree to your reluctance or
inhibition to change the regex part just along with the jar manifest.
On the other hand, i'm quite confident that all invocations of
Grapheme.nextBoundary with my previous patch start matching at the
previous boundary and therefore I believe the cited requirement is met
which also the existing tests confirm. Obviously, it is difficult to
prove the correctness or absence of a mistake. In fact I was kind of
hoping someone would get inspired who knows the area better than I do.
Maybe relates to JDK-8216332 or JDK-8046101 to some extent, but I guess
not exactly.
Shall we file a new bug about a performance issue in regex grapheme
matchers, the performance of which potentially depends more than
linearly and necessarily on the total input sequence size
in Grapheme.nextBoundary?Shall we file another bug about an issue of
copy and paste in GraphemeTest?I am not entitled to create new bugs
myself but if there were I might contribute a patch later.
Regards,Philipp
On Tue, 2020-03-10 at 10:45 -0700, naoto.sato at oracle.com wrote:
> Hi Philipp,
> The reason that the current implementation of Grapheme boundary check
> does it is that, some rules require information before the boundary,
> i.e. the rule GB12 in Unicode Segmentation [1] reads:
> "do not break between regional indicator (RI) symbols if there is an
> odd number of RI characters *before* the break point."
> Of course this itself can be improved, but I am not so keen on
> modifying Grapheme/Pattern code at this time, so using BreakIterator
> (with Locale.US) is more preferred solution to me.
> Naoto
> [1] https://unicode.org/reports/tr29/
>
>
> On 3/10/20 12:11 AM, Philipp Kunz wrote:
> > Hi Naoto and everyone,
> > When I tried the regular expression approach you proposed which
> > really looks delightful, ValueUtf8Coding test ran into a timeout,
> > which is why I changed Pattern and Grapheme classes to start the
> > loop in nextBoundary to start at the offset off rather than at the
> > beginning of the character sequence src rendering isBoundary
> > redundant. The changes to Pattern and Grapheme relate only by means
> > of performance and could otherwise be applied seperately as well. I
> > admit I'm not too confident and familiar with the regex parts of
> > the openjdk and would welcome not only thorough review as usual but
> > also any other feedback or thought shared about the matter.
> > Then I started to wonder about an issue of copy paste between
> > Grapheme and GraphemeTest already out of date and also about some
> > comments in RegExTest and if no more matches shouldn't also be
> > asserted, which also could possibly be changed in a different
> > change but still relate to the actual changes to certain extent. I
> > could not find a way to access UCD_FILES which has no or is in the
> > default package.
> > What about passing the name and the value parameters of
> > Manifest.printLine72 to it as separate parameters? Because names
> > cannot exceed 70 bytes and the allowed characters are limited to
> > ones that are represented with one byte each in UTF-8 the name
> > can/could always be printed without a line break. I don't know if
> > Manifest.println72 can be removed or has to be retained and
> > deprecated for compatibility.
> > See attached patch.
> > Looking forward to receiving any kind of feedback and
> > regards,Philipp
> >
> >
> >
> >
> > On Tue, 2020-02-11 at 14:20 -0800, naoto.sato at oracle.com wrote:
> > > Hi,
> > > Can the code in Manifest.java be simplified using regex? E.g.,
> > > var gc =
> > > Pattern.compile("\\X");gc.matcher(line).results() .map(Match
> > > Result::group) .forEach( // print till 72 bytes in UTF-
> > > 8 );
> > > Just a thought. If BreakIterator is preferred, it should take
> > > Locale.USas an argument to the factory method, so that it would
> > > produce the sameresult no matter what the default locale is.
> > > Naoto
> > > On 2/10/20 1:22 PM, Lance Andersen wrote:
> > > > Hi all,
> > > > Here is a webrev for the patch that Philipp is proposing which
> > > > will make it easier to review:
> > > > http://cr.openjdk.java.net/~lancea/6202130/webrev.00
> > > > <http://cr.openjdk.java.net/~lancea/6202130/webrev.00
> > > > > On Dec 26, 2019, at 11:50 AM, Philipp Kunz <
> > > > > philipp.kunz at paratix.ch
> > > > > <mailto:philipp.kunz at paratix.ch>
> > > > > > wrote:
> > > > >
> > > > > Hi,The specification says, a line break in a manifest can
> > > > > occur beforeorafter a Unicode character encoded in UTF-8.
> > > > > > ...> value: SPACE *otherchar newline
> > > > > *continuation> continuation: SPACE
> > > > > *othercharnewline> ...> otherchar: any UTF-8
> > > > > character except NUL, CRand LFThe current implementation
> > > > > breaks manifest lines at 72 bytes regardlessofhow the bytes
> > > > > around the break are part of a sequence of bytesencoding
> > > > > acharacter. Code points may use up to four bytes when
> > > > > encodedin UTF-8.Manifests with line breaks inside of
> > > > > sequences of bytesencoding Unicodecharacters in UTF-8 with
> > > > > more than one bytes not onlyare invalid UTF-8but also look
> > > > > ugly in text editors.For example, amanifest could look like
> > > > > this:import java.util.jar.Manifest;import
> > > > > java.util.jar.Attributes;importstatic
> > > > > java.util.jar.Attributes.Name;public class
> > > > > CharacterBrokenDemo1 { public static void
> > > > > main(String[]args) throws Exception{ Manifest mf =
> > > > > newManifest(); Attributes attrs
> > > > > =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSI
> > > > > ON,"1.0"); attrs.put(new Name("Some-
> > > > > Key"), "Somelanguages have decorated
> > > > > characters, " + "forexample: español"); //
> > > > > or"espa\u00D1ol" mf.write(System.out); }}Above code
> > > > > produces a result as follows with some unexpected
> > > > > questionmarkswhere the encoding is invalid:
> > > > > > Manifest-Version: 1.0> Some-Key: Some languages have
> > > > > > decorated
> > > > > characters, for example: espa?> ?olThis is of course an
> > > > > example written with actual question marks to geta validtext
> > > > > for this message. The trick here is that "Some-Key"
> > > > > to"example :espa"amounts to exactly one byte less encoded in
> > > > > UTF-8 thanwould fit on one linewith the 72 byte limit so that
> > > > > the subsequentcharacter encoded with two bytesgets broken
> > > > > inside of the sequence oftwo bytes for this character across
> > > > > acontinuation line break.When decoding the resulting bytes
> > > > > from UTF-8 as one whole string, thetwoquestion marks will not
> > > > > fit together again even if the line breakwith thecontinuation
> > > > > space is removed. However, Manifest::read removesthe
> > > > > continuationline breaks ("\r\n ") before decoding the
> > > > > manifestheader value from UTF-8 andhence can reproduce the
> > > > > original value.Characters encoded in UTF-8 can not only span
> > > > > up to four bytes for onecodepoint each, there are also
> > > > > combining characters or classes thereofor
> > > > > combiningdiacritical marks or whatever the appropriate term
> > > > > couldbe, that combine morethan one code point into what is
> > > > > usuallyexperienced and referred to as acharacter.The term
> > > > > character really gets ambiguous at this point. I wouldn't
> > > > > knowwhatthe specification actually refers to with that term
> > > > > "character". Sorather thandiving in too much specification or
> > > > > any sorts of theory,let's look at anotherexample:import
> > > > > java.util.jar.Manifest;import
> > > > > java.util.jar.Attributes;importstatic
> > > > > java.util.jar.Attributes.Name;public class
> > > > > DemoCharacterBroken2 { public static void
> > > > > main(String[]args) throws Exception{ Manifest mf =
> > > > > newManifest(); Attributes attrs
> > > > > =mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSI
> > > > > ON,"1.0"); attrs.put(new Name("Some-Key"), "
> > > > > ".repeat(53)
> > > > > +"Angstro\u0308m"); mf.write(System.out); }}which
> > > > > produces console output as follows:
> > > > > > Manifest-Version: 1.0> Some-
> > > > > Key: Ang
> > > > > stro>̈m(In case this does not display well, the diaeresis is
> > > > > on the m on thelast line)When the whole Manifest is decoded
> > > > > from UTF-8 as one big single stringandcontinuation line
> > > > > breaks are not removed until after UTF-8 decodingthe
> > > > > wholemanifest, the diaeresis (umlaut, two points above,
> > > > > u0308)apparently kind ofjumps onto the following letter m
> > > > > because somehow itcannot be combined withthe preceding space.
> > > > > The UTF-8 decoder (all ofmy editors I tried, not onlyEclipse
> > > > > and its console view, also less,gedit, cat and terminal)
> > > > > somehowtries to fix that but the diaeresis maynot necessarily
> > > > > jump back on the "o"where it originally belonged byremoving
> > > > > the continuation line break ("\r\n ")after UTF-8 decoding
> > > > > hastaken place, at least that did not work for me.Hence,
> > > > > ideally combining diacritical marks should better not
> > > > > beseparated fromwhatever they combine with when breaking
> > > > > manifest linesonto a continuationline. Such combinations,
> > > > > however, seem to beunlimited in terms of number ofcode points
> > > > > combining into the same"experienced" character. I was able
> > > > > tofind combinations that not onlyexceed the limit of 72 bytes
> > > > > per line but alsoexceed the line buffersize of 512 bytes in
> > > > > Manifest::read. These may be ratheruncommon butstill possible
> > > > > to my own surprise.Next consideration would then be to remove
> > > > > that limit of 512 bytes permanifestline but exceeding it
> > > > > would make such manifests incompatiblewith
> > > > > previousManifest::read implementations and is not really
> > > > > animmediately availableoption at the moment.As a compromise,
> > > > > those characters including combining diacritical
> > > > > markswhichcombine only so many code points as that their
> > > > > binarily encodedform in UTF-8remains within a limit of 71
> > > > > bytes can be written withoutan interruptingcontinuation line
> > > > > break, which applies to most cases,but not all. I guess
> > > > > thisshould suit practically and realistically tobe expected
> > > > > values well.Another possibility would be to allow for
> > > > > characters that arecombinations ofmultiple Unicode code
> > > > > points to be kept together intheir encoded form in UTF-8up to
> > > > > 512 bytes line length limit whenreading minus a space and a
> > > > > line breakamounting to 509 bytes, but thatwould still not
> > > > > make manifests be representedas valid Unicode in allcorner
> > > > > cases and I guess would not probably make a realimprovement
> > > > > inpractice over a limit of 71 bytes.Attached is a patch that
> > > > > tries to implement what was described aboveusing
> > > > > aBreakIterator. While it works from a functional point of
> > > > > view,this might beless desirable performance-wise.
> > > > > Alternatively could beconsidered to do withoutthe
> > > > > BreakIterator and only keep Unicode codepoints together by
> > > > > not placingline breaks before a continuation byte,which
> > > > > however would not addresscombining diacritical marks as in
> > > > > thesecond example above.The jar file specification does not
> > > > > explicitly state that manifestshould bevalid UTF-8, and they
> > > > > were not always, but it also does notstate otherwise,leaving
> > > > > an impression that manifests could be(mis)taken for UTF-8
> > > > > encodedstrings, which they also are in many ormost cases and
> > > > > which has been confusedmany times. At the moment, theonly
> > > > > case where a valid manifest is not also avalid UTF-8
> > > > > encodedstring is when a sequence of bytes encoding the
> > > > > samecharacter happensto be interrupted with a continuation
> > > > > line break. To the bestof myknowledge, all other valid
> > > > > manifests are also valid UTF-8 encodedstrings.It would be
> > > > > nice, if manifests could be viewed and manipulated with
> > > > > allUnicodecapable editors and not only be parsed correctly
> > > > > withManifest::read.Any opinions? Would someone sponsor this
> > > > > patch?Regards,Philipp
> > > > > https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
> > > > >
> > > > >
> > > > > <6202130-manifestutf8linebreak.patch>
> > > >
> > > > <http://oracle.com/us/design/oracle-email-sig-198324.gif
> > > > <http://oracle.com/us/design/oracle-email-sig-198324.gif
> > > > > <
> > > > http://oracle.com/us/design/oracle-email-sig-198324.gif
> > > > <
> > > > http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
> > > > Andersen| Principal Member of Technical Staff |
> > > > +1.781.442.2037Oracle Java Engineering1 Network
> > > > DriveBurlington, MA 01803Lance.Andersen at oracle.com
> > > > <mailto:Lance.Andersen at oracle.com> <mailto:
> > > > Lance.Andersen at oracle.com
> > > > <mailto:Lance.Andersen at oracle.com>
> > > >
> > > >
More information about the core-libs-dev
mailing list