6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes
Philipp Kunz
philipp.kunz at paratix.ch
Mon May 18 21:42:43 UTC 2020
Hi Brent,
Thank you very much indeed for your serious answer.
Still a lot of work left but to start with I randomly picked Ant and
found it suffers a bug. See attached bunch of files to reproduce.
The bug does not occur with my proposed patch. One reason for this bug
to occur is cutting and concatenating binary portions of manifests as
strings which don't properly align with Unicode character boundaries
which just does not work and my patch avoids. If I may share an opinion
I'd prefer to avoid manifests which are not also valid UTF-8 in the
first place.
Regards,Philipp
On Fri, 2020-05-08 at 11:04 -0700, Brent Christian wrote:
> Hi, Philipp. Sorry for the further delay.
> Just to step back a bit:
> The current manifest-writing behavior dates back to JDK 1.2 (at
> least), so compatibility is an overriding concern. It is unfortunate
> that multi-byte characters weren't better accounted for in the
> placement of line breaks from the beginning.
> But typically in cases like this, we would update the specification
> to fit the long-standing behavior, and avoid the risk of breaking
> applications in the wild.
> We can test that manifests written with your changes can be read by
> the JDK. But there is also manifest-reading code external to the
> JDK. Some examples, that would need to be investigated:
> jarmanifesthttps://pypi.org/project/jarmanifest/
>
> Chromium
> https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775c587657f0f93a3134f2db99e46bb98bd/google/appengine/tools/jarfile.py
>
> Mavenhttp://maven.apache.org/shared/maven-archiver/index.html
>
> Apache Ant
> https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support
>
> IDEs would be another good place to check, and maybe also consumers
> of JavaEE WAR / EAR files.
>
> Performance is also a consideration. JMH benchmarks would be needed
> to confirm that reading the new manifest is not slower, and to gauge
> any regression in Manifest writing performance.
>
> So that is the work that would need to be done to support this
> change.
> The question then will be if the benefit of this change (seen only
> outside of running code) outweighs the risk of changing behavior that
> has not been an issue for 20+ years. It seems unlikely that a strong
> enough case could be made.
> -Brent
> On 4/13/20 10:29 AM, Philipp Kunz wrote:
> > Hi Naoto,You are absolutely right to raise the question. I've also
> > thought aboutthis but haven't come up so far with a compellingly
> > elegant solution,at least not yet.If only String.isLatin1 was
> > public that would come in very handy.Character or anything else I
> > looked at cannot tell if a string isascii. BreakIterator supports
> > iterating backwards so we could start atthe potential line end but
> > that requires a start position that is aboundary to start with and
> > that is not obviously possible due to theregional indicators and
> > probably other code points requiring statefulprocessing. Same with
> > regexes and I wouldn't know how to express groupsthat could count
> > bytes. It does not even seem to be possible to guessany number of
> > characters to start searching for a boundary because ofthe
> > statefulness. Even the most simple approach to detect latin1Strings
> > requires an encoding or a regex such as "[^\\p{ASCII}]"
> > whichessentially is another inefficient loop. It also does not work
> > toencode the string into UTF-8 in a single pass because then it is
> > notknown which grapheme boundary matches to which byte position. I
> > alsotried to write the header names and the ": " delimiter without
> > breakingfirst but it did not seem to significantly affect
> > performance.UTF_8.newEncoder cannot process single surrogates,
> > admittedly an edgecase, but required for compatibility. I added a
> > fast path, see patch,the best way I could think of. Did I miss a
> > better way to tell asciistrings from others?What I found actually
> > improving performance is based on theconsideration that strings are
> > composed of primitive chars that will beencoded into three bytes
> > each maximum always that up to 24 characterscan be passed to
> > writeChar72 at a time reducing the number ofwriteChar72 and in turn
> > String.getBytes invocations. The number ofcharacters that can be
> > passed to writeChar72 is at most the number ofbytes remaining on
> > the current manifest line (linePos) divided by threebut at least
> > one. See attached patch.Regards,Philipp
> > On Mon, 2020-03-30 at 14:31 -0700, naoto.sato at oracle.com wrote:
> > > Hi Philipp,Sorry for the delay.On 3/25/20 11:52 AM, Philipp Kunz
> > > wrote:
> > > > Hi Naoto,See another patch attached with Locale.ROOT for the
> > > > BreakIterator.I will be glad to hear of any feedback.
> > >
> > > I wonder how your change affects the performance, as it will
> > > doString.getBytes(UTF-8) per each character. I think this
> > > candefinitely be improved by adding some fastpath, e.g., for
> > > ASCII. Theusage of the BreakIterator is fine, though.
> > > > There is another patch [1] around dealing with manifest
> > > > attributesduring application launch. It certainly is related to
> > > > 6202130 butfeels like a distinct set of changes with a slightly
> > > > differentconcern. Any opinion as how to proceed with that one?
> > >
> > > I am not quite sure which patch you are referring to, but I
> > > agreethat creating an issue would not hurt.Naoto
> > > > Regards,Philipp
> > > >
> > > > [1]
> > > > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.html
> > > >
> > > >
> > > > On Mon, 2020-03-23 at 09:05 -0700, naoto.sato at oracle.com wrote:
> > > > > Hi Philipp,Right, Locale.ROOT is more appropriate here by
> > > > > definition, thoughtheimplementation is the same.NaotoOn
> > > > > 3/21/20 5:19 AM, Philipp Kunz wrote:
> > > > > > Hi Naoto and everyone,There are almost as many occurrences
> > > > > > of Locale.ROOT asLocale.US whichmade me wonder which one is
> > > > > > more appropriatelylocale-independent andwhich is probably
> > > > > > another topic and notactually relevant
> > > > > > herebecauseBreakIterator.getCharacterInstance is locale-
> > > > > > agnostic asfar as I can tell.See attached patch with
> > > > > > another attempt to fix bug 6202130.Regards,Philipp
> > > > > > On Tue, 2020-03-10 at 10:45 -0700,naoto.sato at oracle.com
> > > > > > <mailto:naoto.sato at oracle.com> wrote:
> > > > > > > Hi Philipp,..., so using BreakIterator (withLocale.US) is
> > > > > > > more preferredsolution to me.Naoto
More information about the core-libs-dev
mailing list