6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Wed Mar 11 16:52:41 UTC 2020

Hi Phillip,

Sure. Would you please file Grapheme related issues at bugs.java.com? 
This is the route for whom doesn't have JBS id to file bugs.

Naoto

On 3/10/20 11:27 PM, Philipp Kunz wrote:
> Hi Naoto,
> 
> Thank you for your reply. I fully agree to your reluctance or inhibition 
> to change the regex part just along with the jar manifest.
> 
> On the other hand, i'm quite confident that all invocations of 
> Grapheme.nextBoundary with my previous patch start matching at the 
> previous boundary and therefore I believe the cited requirement is met 
> which also the existing tests confirm. Obviously, it is difficult to 
> prove the correctness or absence of a mistake. In fact I was kind of 
> hoping someone would get inspired who knows the area better than I do. 
> Maybe relates to JDK-8216332 or JDK-8046101 to some extent, but I guess 
> not exactly.
> 
> Shall we file a new bug about a performance issue in regex grapheme 
> matchers, the performance of which potentially depends more than 
> linearly and necessarily on the total input sequence size 
> in Grapheme.nextBoundary?
> Shall we file another bug about an issue of copy and paste in GraphemeTest?
> I am not entitled to create new bugs myself but if there were I might 
> contribute a patch later.
> 
> Regards,
> Philipp
> 
> 
> On Tue, 2020-03-10 at 10:45 -0700, naoto.sato at oracle.com wrote:
>> Hi Philipp,
>>
>> The reason that the current implementation of Grapheme boundary check
>> does it is that, some rules require information before the boundary,
>> i.e. the rule GB12 in Unicode Segmentation [1] reads:
>>
>> "do not break between regional indicator (RI) symbols if there is an odd
>> number of RI characters *before* the break point."
>>
>> Of course this itself can be improved, but I am not so keen on modifying
>> Grapheme/Pattern code at this time, so using BreakIterator (with
>> Locale.US) is more preferred solution to me.
>>
>> Naoto
>>
>> [1]
>> https://unicode.org/reports/tr29/
>>  <https://urldefense.com/v3/__https://unicode.org/reports/tr29/__;!!GqivPVa7Brio!PryBHBgba8QeMFfAG7kw5V4WCdaId7c9nfo_1k7Uav-uXDxSwn5ga1aPYvhF_D6x$>
>>
>>
>>
>> On 3/10/20 12:11 AM, Philipp Kunz wrote:
>>> Hi Naoto and everyone,
>>>
>>> When I tried the regular expression approach you proposed which really
>>> looks delightful, ValueUtf8Coding test ran into a timeout, which is why
>>> I changed Pattern and Grapheme classes to start the loop in nextBoundary
>>> to start at the offset off rather than at the beginning of the character
>>> sequence src rendering isBoundary redundant. The changes to Pattern and
>>> Grapheme relate only by means of performance and could otherwise be
>>> applied seperately as well. I admit I'm not too confident and familiar
>>> with the regex parts of the openjdk and would welcome not only thorough
>>> review as usual but also any other feedback or thought shared about the
>>> matter.
>>>
>>> Then I started to wonder about an issue of copy paste between Grapheme
>>> and GraphemeTest already out of date and also about some comments in
>>> RegExTest and if no more matches shouldn't also be asserted, which also
>>> could possibly be changed in a different change but still relate to the
>>> actual changes to certain extent. I could not find a way to access
>>> UCD_FILES which has no or is in the default package.
>>>
>>> What about passing the name and the value parameters of
>>> Manifest.printLine72 to it as separate parameters? Because names cannot
>>> exceed 70 bytes and the allowed characters are limited to ones that are
>>> represented with one byte each in UTF-8 the name can/could always be
>>> printed without a line break. I don't know if Manifest.println72 can be
>>> removed or has to be retained and deprecated for compatibility.
>>>
>>> See attached patch.
>>>
>>> Looking forward to receiving any kind of feedback and regards,
>>> Philipp
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 2020-02-11 at 14:20 -0800,
>>> naoto.sato at oracle.com
>>>  <mailto:naoto.sato at oracle.com>
>>>   wrote:
>>>> Hi,
>>>>
>>>> Can the code in Manifest.java be simplified using regex? E.g.,
>>>>
>>>> var gc = Pattern.compile("\\X");
>>>> gc.matcher(line).results()
>>>>        .map(MatchResult::group)
>>>>        .forEach(
>>>> 	// print till 72 bytes in UTF-8
>>>>        );
>>>>
>>>> Just a thought. If BreakIterator is preferred, it should take Locale.US
>>>> as an argument to the factory method, so that it would produce the same
>>>> result no matter what the default locale is.
>>>>
>>>> Naoto
>>>>
>>>> On 2/10/20 1:22 PM, Lance Andersen wrote:
>>>>> Hi all,
>>>>>
>>>>> Here is a webrev for the patch that Philipp is proposing which will make it easier to review:
>>>>> http://cr.openjdk.java.net/~lancea/6202130/webrev.00
>>>>>
>>>>>    <
>>>>> http://cr.openjdk.java.net/~lancea/6202130/webrev.00
>>>>>
>>>>>>
>>>>>
>>>>>> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <
>>>>>> philipp.kunz at paratix.ch
>>>>>>  <mailto:philipp.kunz at paratix.ch>
>>>>>>
>>>>>>   <mailto:
>>>>>> philipp.kunz at paratix.ch
>>>>>>  <mailto:philipp.kunz at paratix.ch>
>>>>>> >
>>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> The specification says, a line break in a manifest can occur beforeor
>>>>>> after a Unicode character encoded in UTF-8.
>>>>>>>      ...>      value:         SPACE *otherchar newline
>>>>>> *continuation>      continuation:  SPACE *otherchar
>>>>>> newline>    ...>      otherchar:     any UTF-8 character except NUL, CR
>>>>>> and LF
>>>>>> The current implementation breaks manifest lines at 72 bytes regardless
>>>>>> ofhow the bytes around the break are part of a sequence of bytes
>>>>>> encoding acharacter. Code points may use up to four bytes when encoded
>>>>>> in UTF-8.Manifests with line breaks inside of sequences of bytes
>>>>>> encoding Unicodecharacters in UTF-8 with more than one bytes not only
>>>>>> are invalid UTF-8but also look ugly in text editors.For example, a
>>>>>> manifest could look like this:
>>>>>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>>>>>> static java.util.jar.Attributes.Name;
>>>>>> public class CharacterBrokenDemo1 {    public static void main(String[]
>>>>>> args) throws Exception{        Manifest mf = new
>>>>>> Manifest();        Attributes attrs =
>>>>>> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
>>>>>> "1.0");        attrs.put(new Name("Some-Key"),                  "Some
>>>>>> languages have decorated characters, " +                   "for
>>>>>> example: español"); // or
>>>>>> "espa\u00D1ol"        mf.write(System.out);    }}
>>>>>> Above code produces a result as follows with some unexpected question
>>>>>> markswhere the encoding is invalid:
>>>>>>>      Manifest-Version: 1.0>    Some-Key: Some languages have decorated
>>>>>> characters, for example: espa?>     ?ol
>>>>>> This is of course an example written with actual question marks to get
>>>>>> a validtext for this message. The trick here is that "Some-Key" to
>>>>>> "example :espa"amounts to exactly one byte less encoded in UTF-8 than
>>>>>> would fit on one linewith the 72 byte limit so that the subsequent
>>>>>> character encoded with two bytesgets broken inside of the sequence of
>>>>>> two bytes for this character across acontinuation line break.
>>>>>> When decoding the resulting bytes from UTF-8 as one whole string, the
>>>>>> twoquestion marks will not fit together again even if the line break
>>>>>> with thecontinuation space is removed. However, Manifest::read removes
>>>>>> the continuationline breaks ("\r\n ") before decoding the manifest
>>>>>> header value from UTF-8 andhence can reproduce the original value.
>>>>>> Characters encoded in UTF-8 can not only span up to four bytes for one
>>>>>> codepoint each, there are also combining characters or classes thereof
>>>>>> or combiningdiacritical marks or whatever the appropriate term could
>>>>>> be, that combine morethan one code point into what is usually
>>>>>> experienced and referred to as acharacter.
>>>>>> The term character really gets ambiguous at this point. I wouldn't know
>>>>>> whatthe specification actually refers to with that term "character". So
>>>>>> rather thandiving in too much specification or any sorts of theory,
>>>>>> let's look at anotherexample:
>>>>>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>>>>>> static java.util.jar.Attributes.Name;
>>>>>> public class DemoCharacterBroken2 {    public static void main(String[]
>>>>>> args) throws Exception{        Manifest mf = new
>>>>>> Manifest();        Attributes attrs =
>>>>>> mf.getMainAttributes();        attrs.put(Name.MANIFEST_VERSION,
>>>>>> "1.0");        attrs.put(new Name("Some-Key"), " ".repeat(53) +
>>>>>> "Angstro\u0308m");        mf.write(System.out);    }}
>>>>>> which produces console output as follows:
>>>>>>>      Manifest-Version: 1.0>    Some-
>>>>>> Key:                                                      Angstro>
>>>>>> ̈m
>>>>>> (In case this does not display well, the diaeresis is on the m on the
>>>>>> last line)
>>>>>> When the whole Manifest is decoded from UTF-8 as one big single string
>>>>>> andcontinuation line breaks are not removed until after UTF-8 decoding
>>>>>> the wholemanifest, the diaeresis (umlaut, two points above, u0308)
>>>>>> apparently kind ofjumps onto the following letter m because somehow it
>>>>>> cannot be combined withthe preceding space. The UTF-8 decoder (all of
>>>>>> my editors I tried, not onlyEclipse and its console view, also less,
>>>>>> gedit, cat and terminal) somehowtries to fix that but the diaeresis may
>>>>>> not necessarily jump back on the "o"where it originally belonged by
>>>>>> removing the continuation line break ("\r\n ")after UTF-8 decoding has
>>>>>> taken place, at least that did not work for me.
>>>>>> Hence, ideally combining diacritical marks should better not be
>>>>>> separated fromwhatever they combine with when breaking manifest lines
>>>>>> onto a continuationline. Such combinations, however, seem to be
>>>>>> unlimited in terms of number ofcode points combining into the same
>>>>>> "experienced" character. I was able tofind combinations that not only
>>>>>> exceed the limit of 72 bytes per line but alsoexceed the line buffer
>>>>>> size of 512 bytes in Manifest::read. These may be ratheruncommon but
>>>>>> still possible to my own surprise.
>>>>>> Next consideration would then be to remove that limit of 512 bytes per
>>>>>> manifestline but exceeding it would make such manifests incompatible
>>>>>> with previousManifest::read implementations and is not really an
>>>>>> immediately availableoption at the moment.
>>>>>> As a compromise, those characters including combining diacritical marks
>>>>>> whichcombine only so many code points as that their binarily encoded
>>>>>> form in UTF-8remains within a limit of 71 bytes can be written without
>>>>>> an interruptingcontinuation line break, which applies to most cases,
>>>>>> but not all. I guess thisshould suit practically and realistically to
>>>>>> be expected values well.
>>>>>> Another possibility would be to allow for characters that are
>>>>>> combinations ofmultiple Unicode code points to be kept together in
>>>>>> their encoded form in UTF-8up to 512 bytes line length limit when
>>>>>> reading minus a space and a line breakamounting to 509 bytes, but that
>>>>>> would still not make manifests be representedas valid Unicode in all
>>>>>> corner cases and I guess would not probably make a realimprovement in
>>>>>> practice over a limit of 71 bytes.
>>>>>> Attached is a patch that tries to implement what was described above
>>>>>> using aBreakIterator. While it works from a functional point of view,
>>>>>> this might beless desirable performance-wise. Alternatively could be
>>>>>> considered to do withoutthe BreakIterator and only keep Unicode code
>>>>>> points together by not placingline breaks before a continuation byte,
>>>>>> which however would not addresscombining diacritical marks as in the
>>>>>> second example above.
>>>>>> The jar file specification does not explicitly state that manifest
>>>>>> should bevalid UTF-8, and they were not always, but it also does not
>>>>>> state otherwise,leaving an impression that manifests could be
>>>>>> (mis)taken for UTF-8 encodedstrings, which they also are in many or
>>>>>> most cases and which has been confusedmany times. At the moment, the
>>>>>> only case where a valid manifest is not also avalid UTF-8 encoded
>>>>>> string is when a sequence of bytes encoding the samecharacter happens
>>>>>> to be interrupted with a continuation line break. To the bestof my
>>>>>> knowledge, all other valid manifests are also valid UTF-8 encoded
>>>>>> strings.
>>>>>> It would be nice, if manifests could be viewed and manipulated with all
>>>>>> Unicodecapable editors and not only be parsed correctly with
>>>>>> Manifest::read.
>>>>>> Any opinions? Would someone sponsor this patch?
>>>>>> Regards,Philipp
>>>>>>
>>>>>> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
>>>>>>
>>>>>>
>>>>>>
>>>>>> <6202130-manifestutf8linebreak.patch>
>>>>>
>>>>>     <
>>>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>>>>
>>>>>>
>>>>>     <
>>>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>>>>
>>>>>> <
>>>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>>>>
>>>>>>
>>>>>     <
>>>>> http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
>>>>>
>>>>>    Andersen| Principal Member of Technical Staff | +1.781.442.2037
>>>>> Oracle Java Engineering
>>>>> 1 Network Drive
>>>>> Burlington, MA 01803
>>>>> Lance.Andersen at oracle.com
>>>>>  <mailto:Lance.Andersen at oracle.com>
>>>>>
>>>>>   <mailto:
>>>>> Lance.Andersen at oracle.com
>>>>>  <mailto:Lance.Andersen at oracle.com>
>>>>> >
>>>>>    <mailto:
>>>>> Lance.Andersen at oracle.com
>>>>>  <mailto:Lance.Andersen at oracle.com>
>>>>>
>>>>>   <mailto:
>>>>> Lance.Andersen at oracle.com
>>>>>  <mailto:Lance.Andersen at oracle.com>
>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>>