6202130: Need to handle UTF-8 values and break up lines longer than 72 bytes

Fri May 8 18:04:32 UTC 2020

Hi, Philipp.  Sorry for the further delay.

Just to step back a bit:

The current manifest-writing behavior dates back to JDK 1.2 (at least), 
so compatibility is an overriding concern.  It is unfortunate that 
multi-byte characters weren't better accounted for in the placement of 
line breaks from the beginning.

But typically in cases like this, we would update the specification to 
fit the long-standing behavior, and avoid the risk of breaking 
applications in the wild.

We can test that manifests written with your changes can be read by the 
JDK.  But there is also manifest-reading code external to the JDK.  Some 
examples, that would need to be investigated:

jarmanifest
https://pypi.org/project/jarmanifest/

Chromium
https://chromium.googlesource.com/external/googleappengine/python/+/7e0ab775c587657f0f93a3134f2db99e46bb98bd/google/appengine/tools/jarfile.py

Maven
http://maven.apache.org/shared/maven-archiver/index.html

Apache Ant
https://en.wikipedia.org/wiki/JAR_(file_format)#Apache_Ant_Zip/JAR_support

IDEs would be another good place to check, and maybe also consumers of 
JavaEE WAR / EAR files.

Performance is also a consideration.  JMH benchmarks would be needed to 
confirm that reading the new manifest is not slower, and to gauge any 
regression in Manifest writing performance.

So that is the work that would need to be done to support this change.

The question then will be if the benefit of this change (seen only 
outside of running code) outweighs the risk of changing behavior that 
has not been an issue for 20+ years.  It seems unlikely that a strong 
enough case could be made.

-Brent

On 4/13/20 10:29 AM, Philipp Kunz wrote:
> Hi Naoto,
> You are absolutely right to raise the question. I've also thought about
> this but haven't come up so far with a compellingly elegant solution,
> at least not yet.
> If only String.isLatin1 was public that would come in very handy.
> Character or anything else I looked at cannot tell if a string is
> ascii. BreakIterator supports iterating backwards so we could start at
> the potential line end but that requires a start position that is a
> boundary to start with and that is not obviously possible due to the
> regional indicators and probably other code points requiring stateful
> processing. Same with regexes and I wouldn't know how to express groups
> that could count bytes. It does not even seem to be possible to guess
> any number of characters to start searching for a boundary because of
> the statefulness. Even the most simple approach to detect latin1
> Strings requires an encoding or a regex such as "[^\\p{ASCII}]" which
> essentially is another inefficient loop. It also does not work to
> encode the string into UTF-8 in a single pass because then it is not
> known which grapheme boundary matches to which byte position. I also
> tried to write the header names and the ": " delimiter without breaking
> first but it did not seem to significantly affect performance.
> UTF_8.newEncoder cannot process single surrogates, admittedly an edge
> case, but required for compatibility. I added a fast path, see patch,
> the best way I could think of. Did I miss a better way to tell ascii
> strings from others?
> What I found actually improving performance is based on the
> consideration that strings are composed of primitive chars that will be
> encoded into three bytes each maximum always that up to 24 characters
> can be passed to writeChar72 at a time reducing the number of
> writeChar72 and in turn String.getBytes invocations. The number of
> characters that can be passed to writeChar72 is at most the number of
> bytes remaining on the current manifest line (linePos) divided by three
> but at least one. See attached patch.
> Regards,Philipp
> 
> On Mon, 2020-03-30 at 14:31 -0700, naoto.sato at oracle.com wrote:
>> Hi Philipp,
>> Sorry for the delay.
>> On 3/25/20 11:52 AM, Philipp Kunz wrote:
>>> Hi Naoto,
>>> See another patch attached with Locale.ROOT for the BreakIterator.
>>> I will be glad to hear of any feedback.
>>
>> I wonder how your change affects the performance, as it will do
>> String.getBytes(UTF-8) per each character. I think this can
>> definitely be improved by adding some fastpath, e.g., for ASCII. The
>> usage of the BreakIterator is fine, though.
>>> There is another patch [1] around dealing with manifest attributes
>>> during application launch. It certainly is related to 6202130 but
>>> feels like a distinct set of changes with a slightly different
>>> concern. Any opinion as how to proceed with that one?
>>
>> I am not quite sure which patch you are referring to, but I agree
>> that creating an issue would not hurt.
>> Naoto
>>> Regards,Philipp
>>>
>>>
>>> [1]
>>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-February/064720.html
>>>
>>>
>>> On Mon, 2020-03-23 at 09:05 -0700, naoto.sato at oracle.com wrote:
>>>> Hi Philipp,
>>>> Right, Locale.ROOT is more appropriate here by definition, though
>>>> theimplementation is the same.
>>>> Naoto
>>>> On 3/21/20 5:19 AM, Philipp Kunz wrote:
>>>>> Hi Naoto and everyone,
>>>>> There are almost as many occurrences of Locale.ROOT as
>>>>> Locale.US whichmade me wonder which one is more appropriately
>>>>> locale-independent andwhich is probably another topic and not
>>>>> actually relevant here
>>>>> becauseBreakIterator.getCharacterInstance is locale-agnostic as
>>>>> far as I can tell.
>>>>> See attached patch with another attempt to fix bug 6202130.
>>>>> Regards,Philipp
>>>>>
>>>>> On Tue, 2020-03-10 at 10:45 -0700,naoto.sato at oracle.com
>>>>>   <mailto:naoto.sato at oracle.com>  wrote:
>>>>>> Hi Philipp,
>>>>>> ..., so using BreakIterator (withLocale.US) is more preferred
>>>>>> solution to me.
>>>>>> Naoto
> 
>