6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Philipp Kunz philipp.kunz at paratix.ch
Mon Apr 13 17:29:46 UTC 2020


Hi Naoto,
You are absolutely right to raise the question. I've also thought about
this but haven't come up so far with a compellingly elegant solution,
at least not yet.
If only String.isLatin1 was public that would come in very handy.
Character or anything else I looked at cannot tell if a string is
ascii. BreakIterator supports iterating backwards so we could start at
the potential line end but that requires a start position that is a
boundary to start with and that is not obviously possible due to the
regional indicators and probably other code points requiring stateful
processing. Same with regexes and I wouldn't know how to express groups
that could count bytes. It does not even seem to be possible to guess
any number of characters to start searching for a boundary because of
the statefulness. Even the most simple approach to detect latin1
Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which
essentially is another inefficient loop. It also does not work to
encode the string into UTF-8 in a single pass because then it is not
known which grapheme boundary matches to which byte position. I also
tried to write the header names and the ": " delimiter without breaking
first but it did not seem to significantly affect performance.
UTF_8.newEncoder cannot process single surrogates, admittedly an edge
case, but required for compatibility. I added a fast path, see patch,
the best way I could think of. Did I miss a better way to tell ascii
strings from others?
What I found actually improving performance is based on the
consideration that strings are composed of primitive chars that will be
encoded into three bytes each maximum always that up to 24 characters
can be passed to writeChar72 at a time reducing the number of
writeChar72 and in turn String.getBytes invocations. The number of
characters that can be passed to writeChar72 is at most the number of
bytes remaining on the current manifest line (linePos) divided by three
but at least one. See attached patch.
Regards,Philipp

On Mon, 2020-03-30 at 14:31 -0700, naoto.sato at oracle.com wrote:
> Hi Philipp,
> Sorry for the delay.
> On 3/25/20 11:52 AM, Philipp Kunz wrote:
> > Hi Naoto,
> > See another patch attached with Locale.ROOT for the BreakIterator.
> > I will be glad to hear of any feedback.
> 
> I wonder how your change affects the performance, as it will do
> String.getBytes(UTF-8) per each character. I think this can
> definitely be improved by adding some fastpath, e.g., for ASCII. The
> usage of the BreakIterator is fine, though.
> > There is another patch [1] around dealing with manifest attributes
> > during application launch. It certainly is related to 6202130 but
> > feels like a distinct set of changes with a slightly different
> > concern. Any opinion as how to proceed with that one?
> 
> I am not quite sure which patch you are referring to, but I agree
> that creating an issue would not hurt.
> Naoto
> > Regards,Philipp
> > 
> > 
> > [1] 
> > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.html
> > 
> > 
> > On Mon, 2020-03-23 at 09:05 -0700, naoto.sato at oracle.com wrote:
> > > Hi Philipp,
> > > Right, Locale.ROOT is more appropriate here by definition, though
> > > theimplementation is the same.
> > > Naoto
> > > On 3/21/20 5:19 AM, Philipp Kunz wrote:
> > > > Hi Naoto and everyone,
> > > > There are almost as many occurrences of Locale.ROOT as
> > > > Locale.US whichmade me wonder which one is more appropriately
> > > > locale-independent andwhich is probably another topic and not
> > > > actually relevant here
> > > > becauseBreakIterator.getCharacterInstance is locale-agnostic as
> > > > far as I can tell.
> > > > See attached patch with another attempt to fix bug 6202130.
> > > > Regards,Philipp
> > > > 
> > > > On Tue, 2020-03-10 at 10:45 -0700,naoto.sato at oracle.com
> > > >  <mailto:naoto.sato at oracle.com>  wrote:
> > > > > Hi Philipp,
> > > > > ..., so using BreakIterator (withLocale.US) is more preferred
> > > > > solution to me.
> > > > > Naoto


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20200413-bug6202130-manifestutf8.patch
Type: text/x-patch
Size: 60361 bytes
Desc: not available
URL: <https://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20200413/91b6769c/20200413-bug6202130-manifestutf8-0001.patch>


More information about the core-libs-dev mailing list