RFR for 6443578 and 6202130: UTF-8 in Manifests

Thu May 3 01:12:22 UTC 2018

Hi,

Here is patch for 6443578 and 6202130 also in webrev form.

http://files.paratix.ch/jdk/6372077and6443578/webrev.01/
http://files.paratix.ch/jdk/6372077and6443578/webrev.01.zip

Hope it helps. With all the patience, can I do anything to make it 
easier to get feedback or find a sponsor?

Regards,
Philipp

On 02.05.2018 07:21, Philipp Kunz wrote:
> Hi,
>
> Recently, I tried to fix only bug 6202130 with the intention to fix 
> bug 6443578 later with the intention to get some opportunity for 
> feedback, but haven't got any, and propose now a fix for both together 
> which in my opinion makes more sense.
>
> See attached patch.
>
> Some considerations, assumptions, and explanations
>
>   * In my opinion, the code for writing manifests was distributed in
>     the two classes Attributes and Manifest in an elegant way but
>     somewhat difficult to explain the coherence. I chose to group the
>     code that writes manifests into a new class ManifestWriter. The
>     main incentive for that was to prevent or reduce duplicated code I
>     would have had to change twice otherwise. This also results in a
>     source file of a suitable size.
>   * I could not support the assumption that the write and writeMain
>     methods in Attributes couldn't be referenced anywhere so I
>     deprecated them rather than having them removed.
>   * I assumed the patch will not make it into JDK 10 and, hence, the
>     deprecated annotations are attributed with since = 11.
>   * I could not figure out any reason for the use of DataOutputStream
>     and did not use it.
>   * Performance-wise I assume that the code is approximately
>     comparable to the previous version. The biggest improvement in
>     this respect I hope comes from removing the String that contains
>     the byte array constructed with deprecated String(byte[], int,
>     int, int) and then copying it over again to a StringBuffer and
>     from there to a String again and then Characters. On the other
>     hand, keeping whole characters together when breaking lines might
>     make it slightly slower. I hope my changes are an overall
>     improvement, but I haven't measured it.
>   * For telling first from continuation bytes of utf-8 characters
>     apart I re-used a method isNotUtfContinuationByte from either
>     StringCoding or UTF_8.Decoder. Unfortunately I found no way not to
>     duplicate it.
>   * Where it said before "XXX Need to handle UTF8 values and break up
>     lines longer than 72 bytes" in Attributes#writeMain I did not dare
>     to remove the comment completely because it still does not deal
>     correctly with version headers longer than 72 bytes and the set of
>     allowed values. I changed it accordingly. Two similar comments are
>     removed in the patch.
>   * I added two tests, WriteDeprecated and NullKeysAndValues, to
>     demonstrate compatibility as good as I could. Might however not be
>     desired to keep and having to maintain.
>   * LineBrokenMultiByteCharacter for jarsigner should not be removed
>     or not so immediately because someone might attempt to sign an
>     older jarfile created without that patch with a newer jarsigner
>     that already contains it.
>
>
>
> suggested changes or additions to the bug database: (i have no 
> permissions to edit it myself)
>
>   * Re-combine copies of isNotUtfContinuationByte (three by now).
>     Relates to 6184334. Worth to file another issue?
>   * Manifest versions have specific specifications, cannot break
>     across lines and can contain a subset of characters only. Bug
>     6910466 relates but is not exactly the same. If someone else is
>     convinced that writing a manifest should issue a warning or any
>     other way to deal with a version that does not conform to the
>     specification, I'd suggest to create a separate bug for that.
>
>
> Now, I would be glad if someone sponsored a review. This is only my 
> third attempt to submit a patch which is why I chose a lesser 
> important subject to fix in order to get familiar and now I understand 
> it's not the most attractive patch to review. Please don't hesitate to 
> suggest what I could do better or differently.
>
> As a bonus, with these changes, manifest files will always be 
> displayed correctly with just any utf capable viewer even if they 
> contain multi-byte utf characters that would have been broken across a 
> line break with the current/previous implementation and all manifests 
> will become also valid strings in Java.
>
> Regards,
> Philipp
>
>
>
> On 20.04.2018 00:58, Philipp Kunz wrote:
>> Hi,
>>
>> I tried to fix bug 6202130 about manifest utf support and come up now 
>> with a test as suggested in the bug's comments that shows that utf 
>> charset actually works before removing the comments from the code.
>>
>> When I wanted to remove the XXX comments about utf it occurred to me 
>> that version attributes ("Signature-Version" and "Manifest-Version") 
>> would never be broken across lines and should anyway not support the 
>> whole utf character set which sounds more like related to bugs 
>> 6910466 or 4935610 but it's not a real fit. Therefore, I could not 
>> remove one such comment of Attributes#writeMain but I changed it. The 
>> first comment in bug 6202130 mentions only two comments but there are 
>> three in Attributes. In the attached patch I removed only two of 
>> three and changed the remaining third to not mention utf anymore.
>>
>> At the moment, at least until 6443578 is fixed, multi-byte utf 
>> characters can be broken across lines. It might be worth a 
>> consideration to test that explicitly as well but then I guess there 
>> is not much of a point in testing the current behavior that will 
>> change with 6443578, hopefully soon. There are in my opinion enough 
>> characters broken across lines in the attached test that demonstrate 
>> that this still works like it did before.
>>
>> I would have preferred also to remove the calls to deprecated 
>> String(byte[], int, int, int) but then figured it relates more to bug 
>> 6443578 than 6202130 and now prefer to do that in another separate 
>> patch.
>>
>> Bug 6202130 also states that lines are broken by String.length not by 
>> byte length. While it looks so at first glance, I could not confirm. 
>> The combination of getBytes("UTF8"), String(byte[], int, int, int), 
>> and then DataOutputStream.writeBytes(String) in that combination does 
>> not drop high-bytes because every byte (whether a whole character or 
>> only a part of a multi-byte character) becomes a character in 
>> String(...) containing that byte in its low-byte which will be read 
>> again from writeBytes(...). Or put in a different way, every utf 
>> encoded byte becomes a character and multi-byte utf characters are 
>> converted into multiple string characters containing one byte each in 
>> their lower bytes. The current solution is not nice, but at least 
>> works. With that respect I'd like to suggest to deprecate 
>> DataOutputStream.writeBytes(String) because it does something not 
>> exactly expected when guessing from its name and that would suit a 
>> byte[] parameter better very much like it has been done with 
>> String(byte[], int, int, int). Any advice about the procedure to 
>> deprecate something?
>>
>> I was surprised that it was not trivial to list all valid utf 
>> characters. If someone has a better idea than isValidUtfCharacter in 
>> the attached test, let me know.
>>
>> Altogether, I would not consider 6202130 resolved completely, unless 
>> maybe all remaining points are copied to 6443578 and maybe another 
>> bug about valid values for "Signature-Version" and "Manifest-Version" 
>> if at all desired. But still I consider the attached patch an 
>> improvement and most of the remainder can then be solved in 6443578 
>> and so far I am looking forward to any kind of feedback.
>>
>> Regards,
>> Philipp
>>