RFR for 6443578 and 6202130: UTF-8 in Manifests

Fri May 4 19:51:57 UTC 2018

Hi Phillip,

Just a reminder that OpenJDK can *only* accept patches via 
cr.openjdk.java.net
(as an author) or inline or attached in email.

Thanks, Roger

On 5/2/18 9:12 PM, Philipp Kunz wrote:
> Hi,
>
> Here is patch for 6443578 and 6202130 also in webrev form.
>
> http://files.paratix.ch/jdk/6372077and6443578/webrev.01/
> http://files.paratix.ch/jdk/6372077and6443578/webrev.01.zip
>
> Hope it helps. With all the patience, can I do anything to make it 
> easier to get feedback or find a sponsor?
>
> Regards,
> Philipp
>
>
> On 02.05.2018 07:21, Philipp Kunz wrote:
>> Hi,
>>
>> Recently, I tried to fix only bug 6202130 with the intention to fix 
>> bug 6443578 later with the intention to get some opportunity for 
>> feedback, but haven't got any, and propose now a fix for both 
>> together which in my opinion makes more sense.
>>
>> See attached patch.
>>
>> Some considerations, assumptions, and explanations
>>
>>   * In my opinion, the code for writing manifests was distributed in
>>     the two classes Attributes and Manifest in an elegant way but
>>     somewhat difficult to explain the coherence. I chose to group the
>>     code that writes manifests into a new class ManifestWriter. The
>>     main incentive for that was to prevent or reduce duplicated code
>>     I would have had to change twice otherwise. This also results in
>>     a source file of a suitable size.
>>   * I could not support the assumption that the write and writeMain
>>     methods in Attributes couldn't be referenced anywhere so I
>>     deprecated them rather than having them removed.
>>   * I assumed the patch will not make it into JDK 10 and, hence, the
>>     deprecated annotations are attributed with since = 11.
>>   * I could not figure out any reason for the use of DataOutputStream
>>     and did not use it.
>>   * Performance-wise I assume that the code is approximately
>>     comparable to the previous version. The biggest improvement in
>>     this respect I hope comes from removing the String that contains
>>     the byte array constructed with deprecated String(byte[], int,
>>     int, int) and then copying it over again to a StringBuffer and
>>     from there to a String again and then Characters. On the other
>>     hand, keeping whole characters together when breaking lines might
>>     make it slightly slower. I hope my changes are an overall
>>     improvement, but I haven't measured it.
>>   * For telling first from continuation bytes of utf-8 characters
>>     apart I re-used a method isNotUtfContinuationByte from either
>>     StringCoding or UTF_8.Decoder. Unfortunately I found no way not
>>     to duplicate it.
>>   * Where it said before "XXX Need to handle UTF8 values and break up
>>     lines longer than 72 bytes" in Attributes#writeMain I did not
>>     dare to remove the comment completely because it still does not
>>     deal correctly with version headers longer than 72 bytes and the
>>     set of allowed values. I changed it accordingly. Two similar
>>     comments are removed in the patch.
>>   * I added two tests, WriteDeprecated and NullKeysAndValues, to
>>     demonstrate compatibility as good as I could. Might however not
>>     be desired to keep and having to maintain.
>>   * LineBrokenMultiByteCharacter for jarsigner should not be removed
>>     or not so immediately because someone might attempt to sign an
>>     older jarfile created without that patch with a newer jarsigner
>>     that already contains it.
>>
>>
>>
>> suggested changes or additions to the bug database: (i have no 
>> permissions to edit it myself)
>>
>>   * Re-combine copies of isNotUtfContinuationByte (three by now).
>>     Relates to 6184334. Worth to file another issue?
>>   * Manifest versions have specific specifications, cannot break
>>     across lines and can contain a subset of characters only. Bug
>>     6910466 relates but is not exactly the same. If someone else is
>>     convinced that writing a manifest should issue a warning or any
>>     other way to deal with a version that does not conform to the
>>     specification, I'd suggest to create a separate bug for that.
>>
>>
>> Now, I would be glad if someone sponsored a review. This is only my 
>> third attempt to submit a patch which is why I chose a lesser 
>> important subject to fix in order to get familiar and now I 
>> understand it's not the most attractive patch to review. Please don't 
>> hesitate to suggest what I could do better or differently.
>>
>> As a bonus, with these changes, manifest files will always be 
>> displayed correctly with just any utf capable viewer even if they 
>> contain multi-byte utf characters that would have been broken across 
>> a line break with the current/previous implementation and all 
>> manifests will become also valid strings in Java.
>>
>> Regards,
>> Philipp
>>
>>
>>
>> On 20.04.2018 00:58, Philipp Kunz wrote:
>>> Hi,
>>>
>>> I tried to fix bug 6202130 about manifest utf support and come up 
>>> now with a test as suggested in the bug's comments that shows that 
>>> utf charset actually works before removing the comments from the code.
>>>
>>> When I wanted to remove the XXX comments about utf it occurred to me 
>>> that version attributes ("Signature-Version" and "Manifest-Version") 
>>> would never be broken across lines and should anyway not support the 
>>> whole utf character set which sounds more like related to bugs 
>>> 6910466 or 4935610 but it's not a real fit. Therefore, I could not 
>>> remove one such comment of Attributes#writeMain but I changed it. 
>>> The first comment in bug 6202130 mentions only two comments but 
>>> there are three in Attributes. In the attached patch I removed only 
>>> two of three and changed the remaining third to not mention utf 
>>> anymore.
>>>
>>> At the moment, at least until 6443578 is fixed, multi-byte utf 
>>> characters can be broken across lines. It might be worth a 
>>> consideration to test that explicitly as well but then I guess there 
>>> is not much of a point in testing the current behavior that will 
>>> change with 6443578, hopefully soon. There are in my opinion enough 
>>> characters broken across lines in the attached test that demonstrate 
>>> that this still works like it did before.
>>>
>>> I would have preferred also to remove the calls to deprecated 
>>> String(byte[], int, int, int) but then figured it relates more to 
>>> bug 6443578 than 6202130 and now prefer to do that in another 
>>> separate patch.
>>>
>>> Bug 6202130 also states that lines are broken by String.length not 
>>> by byte length. While it looks so at first glance, I could not 
>>> confirm. The combination of getBytes("UTF8"), String(byte[], int, 
>>> int, int), and then DataOutputStream.writeBytes(String) in that 
>>> combination does not drop high-bytes because every byte (whether a 
>>> whole character or only a part of a multi-byte character) becomes a 
>>> character in String(...) containing that byte in its low-byte which 
>>> will be read again from writeBytes(...). Or put in a different way, 
>>> every utf encoded byte becomes a character and multi-byte utf 
>>> characters are converted into multiple string characters containing 
>>> one byte each in their lower bytes. The current solution is not 
>>> nice, but at least works. With that respect I'd like to suggest to 
>>> deprecate DataOutputStream.writeBytes(String) because it does 
>>> something not exactly expected when guessing from its name and that 
>>> would suit a byte[] parameter better very much like it has been done 
>>> with String(byte[], int, int, int). Any advice about the procedure 
>>> to deprecate something?
>>>
>>> I was surprised that it was not trivial to list all valid utf 
>>> characters. If someone has a better idea than isValidUtfCharacter in 
>>> the attached test, let me know.
>>>
>>> Altogether, I would not consider 6202130 resolved completely, unless 
>>> maybe all remaining points are copied to 6443578 and maybe another 
>>> bug about valid values for "Signature-Version" and 
>>> "Manifest-Version" if at all desired. But still I consider the 
>>> attached patch an improvement and most of the remainder can then be 
>>> solved in 6443578 and so far I am looking forward to any kind of 
>>> feedback.
>>>
>>> Regards,
>>> Philipp
>>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/security-dev/attachments/20180504/051196ac/attachment.htm>