6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes
naoto.sato at oracle.com
naoto.sato at oracle.com
Tue Mar 10 17:45:00 UTC 2020
Hi Philipp,
The reason that the current implementation of Grapheme boundary check
does it is that, some rules require information before the boundary,
i.e. the rule GB12 in Unicode Segmentation [1] reads:
"do not break between regional indicator (RI) symbols if there is an odd
number of RI characters *before* the break point."
Of course this itself can be improved, but I am not so keen on modifying
Grapheme/Pattern code at this time, so using BreakIterator (with
Locale.US) is more preferred solution to me.
Naoto
[1] https://unicode.org/reports/tr29/
On 3/10/20 12:11 AM, Philipp Kunz wrote:
> Hi Naoto and everyone,
>
> When I tried the regular expression approach you proposed which really
> looks delightful, ValueUtf8Coding test ran into a timeout, which is why
> I changed Pattern and Grapheme classes to start the loop in nextBoundary
> to start at the offset off rather than at the beginning of the character
> sequence src rendering isBoundary redundant. The changes to Pattern and
> Grapheme relate only by means of performance and could otherwise be
> applied seperately as well. I admit I'm not too confident and familiar
> with the regex parts of the openjdk and would welcome not only thorough
> review as usual but also any other feedback or thought shared about the
> matter.
>
> Then I started to wonder about an issue of copy paste between Grapheme
> and GraphemeTest already out of date and also about some comments in
> RegExTest and if no more matches shouldn't also be asserted, which also
> could possibly be changed in a different change but still relate to the
> actual changes to certain extent. I could not find a way to access
> UCD_FILES which has no or is in the default package.
>
> What about passing the name and the value parameters of
> Manifest.printLine72 to it as separate parameters? Because names cannot
> exceed 70 bytes and the allowed characters are limited to ones that are
> represented with one byte each in UTF-8 the name can/could always be
> printed without a line break. I don't know if Manifest.println72 can be
> removed or has to be retained and deprecated for compatibility.
>
> See attached patch.
>
> Looking forward to receiving any kind of feedback and regards,
> Philipp
>
>
>
>
>
> On Tue, 2020-02-11 at 14:20 -0800, naoto.sato at oracle.com wrote:
>> Hi,
>>
>> Can the code in Manifest.java be simplified using regex? E.g.,
>>
>> var gc = Pattern.compile("\\X");
>> gc.matcher(line).results()
>> .map(MatchResult::group)
>> .forEach(
>> // print till 72 bytes in UTF-8
>> );
>>
>> Just a thought. If BreakIterator is preferred, it should take Locale.US
>> as an argument to the factory method, so that it would produce the same
>> result no matter what the default locale is.
>>
>> Naoto
>>
>> On 2/10/20 1:22 PM, Lance Andersen wrote:
>>> Hi all,
>>>
>>> Here is a webrev for the patch that Philipp is proposing which will make it easier to review:
>>> http://cr.openjdk.java.net/~lancea/6202130/webrev.00
>>> <
>>> http://cr.openjdk.java.net/~lancea/6202130/webrev.00
>>> >
>>>
>>>> On Dec 26, 2019, at 11:50 AM, Philipp Kunz <
>>>> philipp.kunz at paratix.ch
>>>> <mailto:philipp.kunz at paratix.ch>
>>>> > wrote:
>>>>
>>>> Hi,
>>>> The specification says, a line break in a manifest can occur beforeor
>>>> after a Unicode character encoded in UTF-8.
>>>>> ...> value: SPACE *otherchar newline
>>>> *continuation> continuation: SPACE *otherchar
>>>> newline> ...> otherchar: any UTF-8 character except NUL, CR
>>>> and LF
>>>> The current implementation breaks manifest lines at 72 bytes regardless
>>>> ofhow the bytes around the break are part of a sequence of bytes
>>>> encoding acharacter. Code points may use up to four bytes when encoded
>>>> in UTF-8.Manifests with line breaks inside of sequences of bytes
>>>> encoding Unicodecharacters in UTF-8 with more than one bytes not only
>>>> are invalid UTF-8but also look ugly in text editors.For example, a
>>>> manifest could look like this:
>>>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>>>> static java.util.jar.Attributes.Name;
>>>> public class CharacterBrokenDemo1 { public static void main(String[]
>>>> args) throws Exception{ Manifest mf = new
>>>> Manifest(); Attributes attrs =
>>>> mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION,
>>>> "1.0"); attrs.put(new Name("Some-Key"), "Some
>>>> languages have decorated characters, " + "for
>>>> example: español"); // or
>>>> "espa\u00D1ol" mf.write(System.out); }}
>>>> Above code produces a result as follows with some unexpected question
>>>> markswhere the encoding is invalid:
>>>>> Manifest-Version: 1.0> Some-Key: Some languages have decorated
>>>> characters, for example: espa?> ?ol
>>>> This is of course an example written with actual question marks to get
>>>> a validtext for this message. The trick here is that "Some-Key" to
>>>> "example :espa"amounts to exactly one byte less encoded in UTF-8 than
>>>> would fit on one linewith the 72 byte limit so that the subsequent
>>>> character encoded with two bytesgets broken inside of the sequence of
>>>> two bytes for this character across acontinuation line break.
>>>> When decoding the resulting bytes from UTF-8 as one whole string, the
>>>> twoquestion marks will not fit together again even if the line break
>>>> with thecontinuation space is removed. However, Manifest::read removes
>>>> the continuationline breaks ("\r\n ") before decoding the manifest
>>>> header value from UTF-8 andhence can reproduce the original value.
>>>> Characters encoded in UTF-8 can not only span up to four bytes for one
>>>> codepoint each, there are also combining characters or classes thereof
>>>> or combiningdiacritical marks or whatever the appropriate term could
>>>> be, that combine morethan one code point into what is usually
>>>> experienced and referred to as acharacter.
>>>> The term character really gets ambiguous at this point. I wouldn't know
>>>> whatthe specification actually refers to with that term "character". So
>>>> rather thandiving in too much specification or any sorts of theory,
>>>> let's look at anotherexample:
>>>> import java.util.jar.Manifest;import java.util.jar.Attributes;import
>>>> static java.util.jar.Attributes.Name;
>>>> public class DemoCharacterBroken2 { public static void main(String[]
>>>> args) throws Exception{ Manifest mf = new
>>>> Manifest(); Attributes attrs =
>>>> mf.getMainAttributes(); attrs.put(Name.MANIFEST_VERSION,
>>>> "1.0"); attrs.put(new Name("Some-Key"), " ".repeat(53) +
>>>> "Angstro\u0308m"); mf.write(System.out); }}
>>>> which produces console output as follows:
>>>>> Manifest-Version: 1.0> Some-
>>>> Key: Angstro>
>>>> ̈m
>>>> (In case this does not display well, the diaeresis is on the m on the
>>>> last line)
>>>> When the whole Manifest is decoded from UTF-8 as one big single string
>>>> andcontinuation line breaks are not removed until after UTF-8 decoding
>>>> the wholemanifest, the diaeresis (umlaut, two points above, u0308)
>>>> apparently kind ofjumps onto the following letter m because somehow it
>>>> cannot be combined withthe preceding space. The UTF-8 decoder (all of
>>>> my editors I tried, not onlyEclipse and its console view, also less,
>>>> gedit, cat and terminal) somehowtries to fix that but the diaeresis may
>>>> not necessarily jump back on the "o"where it originally belonged by
>>>> removing the continuation line break ("\r\n ")after UTF-8 decoding has
>>>> taken place, at least that did not work for me.
>>>> Hence, ideally combining diacritical marks should better not be
>>>> separated fromwhatever they combine with when breaking manifest lines
>>>> onto a continuationline. Such combinations, however, seem to be
>>>> unlimited in terms of number ofcode points combining into the same
>>>> "experienced" character. I was able tofind combinations that not only
>>>> exceed the limit of 72 bytes per line but alsoexceed the line buffer
>>>> size of 512 bytes in Manifest::read. These may be ratheruncommon but
>>>> still possible to my own surprise.
>>>> Next consideration would then be to remove that limit of 512 bytes per
>>>> manifestline but exceeding it would make such manifests incompatible
>>>> with previousManifest::read implementations and is not really an
>>>> immediately availableoption at the moment.
>>>> As a compromise, those characters including combining diacritical marks
>>>> whichcombine only so many code points as that their binarily encoded
>>>> form in UTF-8remains within a limit of 71 bytes can be written without
>>>> an interruptingcontinuation line break, which applies to most cases,
>>>> but not all. I guess thisshould suit practically and realistically to
>>>> be expected values well.
>>>> Another possibility would be to allow for characters that are
>>>> combinations ofmultiple Unicode code points to be kept together in
>>>> their encoded form in UTF-8up to 512 bytes line length limit when
>>>> reading minus a space and a line breakamounting to 509 bytes, but that
>>>> would still not make manifests be representedas valid Unicode in all
>>>> corner cases and I guess would not probably make a realimprovement in
>>>> practice over a limit of 71 bytes.
>>>> Attached is a patch that tries to implement what was described above
>>>> using aBreakIterator. While it works from a functional point of view,
>>>> this might beless desirable performance-wise. Alternatively could be
>>>> considered to do withoutthe BreakIterator and only keep Unicode code
>>>> points together by not placingline breaks before a continuation byte,
>>>> which however would not addresscombining diacritical marks as in the
>>>> second example above.
>>>> The jar file specification does not explicitly state that manifest
>>>> should bevalid UTF-8, and they were not always, but it also does not
>>>> state otherwise,leaving an impression that manifests could be
>>>> (mis)taken for UTF-8 encodedstrings, which they also are in many or
>>>> most cases and which has been confusedmany times. At the moment, the
>>>> only case where a valid manifest is not also avalid UTF-8 encoded
>>>> string is when a sequence of bytes encoding the samecharacter happens
>>>> to be interrupted with a continuation line break. To the bestof my
>>>> knowledge, all other valid manifests are also valid UTF-8 encoded
>>>> strings.
>>>> It would be nice, if manifests could be viewed and manipulated with all
>>>> Unicodecapable editors and not only be parsed correctly with
>>>> Manifest::read.
>>>> Any opinions? Would someone sponsor this patch?
>>>> Regards,Philipp
>>>>
>>>> https://docs.oracle.com/en/java/javase/13/docs/specs/jar/jar.html#specificationhttps://bugs.openjdk.java.net/browse/JDK-6202130https://bugs.openjdk.java.net/browse/JDK-6443578https://github.com/gradle/gradle/issues/5225https://bugs.openjdk.java.net/browse/JDK-8202525https://en.wikipedia.org/wiki/Combining_character
>>>>
>>>>
>>>> <6202130-manifestutf8linebreak.patch>
>>>
>>> <
>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>> >
>>> <
>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>> > <
>>> http://oracle.com/us/design/oracle-email-sig-198324.gif
>>> >
>>> <
>>> http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance
>>> Andersen| Principal Member of Technical Staff | +1.781.442.2037
>>> Oracle Java Engineering
>>> 1 Network Drive
>>> Burlington, MA 01803
>>> Lance.Andersen at oracle.com
>>> <mailto:Lance.Andersen at oracle.com>
>>> <mailto:
>>> Lance.Andersen at oracle.com
>>> <mailto:Lance.Andersen at oracle.com>
>>> >
>>>
>>>
>>>
More information about the core-libs-dev
mailing list