RFR for 6443578 and 6202130: UTF-8 in Manifests
Philipp Kunz
philipp.kunz at paratix.ch
Thu May 3 01:12:22 UTC 2018
Hi,
Here is patch for 6443578 and 6202130 also in webrev form.
http://files.paratix.ch/jdk/6372077and6443578/webrev.01/
http://files.paratix.ch/jdk/6372077and6443578/webrev.01.zip
Hope it helps. With all the patience, can I do anything to make it
easier to get feedback or find a sponsor?
Regards,
Philipp
On 02.05.2018 07:21, Philipp Kunz wrote:
> Hi,
>
> Recently, I tried to fix only bug 6202130 with the intention to fix
> bug 6443578 later with the intention to get some opportunity for
> feedback, but haven't got any, and propose now a fix for both together
> which in my opinion makes more sense.
>
> See attached patch.
>
> Some considerations, assumptions, and explanations
>
> * In my opinion, the code for writing manifests was distributed in
> the two classes Attributes and Manifest in an elegant way but
> somewhat difficult to explain the coherence. I chose to group the
> code that writes manifests into a new class ManifestWriter. The
> main incentive for that was to prevent or reduce duplicated code I
> would have had to change twice otherwise. This also results in a
> source file of a suitable size.
> * I could not support the assumption that the write and writeMain
> methods in Attributes couldn't be referenced anywhere so I
> deprecated them rather than having them removed.
> * I assumed the patch will not make it into JDK 10 and, hence, the
> deprecated annotations are attributed with since = 11.
> * I could not figure out any reason for the use of DataOutputStream
> and did not use it.
> * Performance-wise I assume that the code is approximately
> comparable to the previous version. The biggest improvement in
> this respect I hope comes from removing the String that contains
> the byte array constructed with deprecated String(byte[], int,
> int, int) and then copying it over again to a StringBuffer and
> from there to a String again and then Characters. On the other
> hand, keeping whole characters together when breaking lines might
> make it slightly slower. I hope my changes are an overall
> improvement, but I haven't measured it.
> * For telling first from continuation bytes of utf-8 characters
> apart I re-used a method isNotUtfContinuationByte from either
> StringCoding or UTF_8.Decoder. Unfortunately I found no way not to
> duplicate it.
> * Where it said before "XXX Need to handle UTF8 values and break up
> lines longer than 72 bytes" in Attributes#writeMain I did not dare
> to remove the comment completely because it still does not deal
> correctly with version headers longer than 72 bytes and the set of
> allowed values. I changed it accordingly. Two similar comments are
> removed in the patch.
> * I added two tests, WriteDeprecated and NullKeysAndValues, to
> demonstrate compatibility as good as I could. Might however not be
> desired to keep and having to maintain.
> * LineBrokenMultiByteCharacter for jarsigner should not be removed
> or not so immediately because someone might attempt to sign an
> older jarfile created without that patch with a newer jarsigner
> that already contains it.
>
>
>
> suggested changes or additions to the bug database: (i have no
> permissions to edit it myself)
>
> * Re-combine copies of isNotUtfContinuationByte (three by now).
> Relates to 6184334. Worth to file another issue?
> * Manifest versions have specific specifications, cannot break
> across lines and can contain a subset of characters only. Bug
> 6910466 relates but is not exactly the same. If someone else is
> convinced that writing a manifest should issue a warning or any
> other way to deal with a version that does not conform to the
> specification, I'd suggest to create a separate bug for that.
>
>
> Now, I would be glad if someone sponsored a review. This is only my
> third attempt to submit a patch which is why I chose a lesser
> important subject to fix in order to get familiar and now I understand
> it's not the most attractive patch to review. Please don't hesitate to
> suggest what I could do better or differently.
>
> As a bonus, with these changes, manifest files will always be
> displayed correctly with just any utf capable viewer even if they
> contain multi-byte utf characters that would have been broken across a
> line break with the current/previous implementation and all manifests
> will become also valid strings in Java.
>
> Regards,
> Philipp
>
>
>
> On 20.04.2018 00:58, Philipp Kunz wrote:
>> Hi,
>>
>> I tried to fix bug 6202130 about manifest utf support and come up now
>> with a test as suggested in the bug's comments that shows that utf
>> charset actually works before removing the comments from the code.
>>
>> When I wanted to remove the XXX comments about utf it occurred to me
>> that version attributes ("Signature-Version" and "Manifest-Version")
>> would never be broken across lines and should anyway not support the
>> whole utf character set which sounds more like related to bugs
>> 6910466 or 4935610 but it's not a real fit. Therefore, I could not
>> remove one such comment of Attributes#writeMain but I changed it. The
>> first comment in bug 6202130 mentions only two comments but there are
>> three in Attributes. In the attached patch I removed only two of
>> three and changed the remaining third to not mention utf anymore.
>>
>> At the moment, at least until 6443578 is fixed, multi-byte utf
>> characters can be broken across lines. It might be worth a
>> consideration to test that explicitly as well but then I guess there
>> is not much of a point in testing the current behavior that will
>> change with 6443578, hopefully soon. There are in my opinion enough
>> characters broken across lines in the attached test that demonstrate
>> that this still works like it did before.
>>
>> I would have preferred also to remove the calls to deprecated
>> String(byte[], int, int, int) but then figured it relates more to bug
>> 6443578 than 6202130 and now prefer to do that in another separate
>> patch.
>>
>> Bug 6202130 also states that lines are broken by String.length not by
>> byte length. While it looks so at first glance, I could not confirm.
>> The combination of getBytes("UTF8"), String(byte[], int, int, int),
>> and then DataOutputStream.writeBytes(String) in that combination does
>> not drop high-bytes because every byte (whether a whole character or
>> only a part of a multi-byte character) becomes a character in
>> String(...) containing that byte in its low-byte which will be read
>> again from writeBytes(...). Or put in a different way, every utf
>> encoded byte becomes a character and multi-byte utf characters are
>> converted into multiple string characters containing one byte each in
>> their lower bytes. The current solution is not nice, but at least
>> works. With that respect I'd like to suggest to deprecate
>> DataOutputStream.writeBytes(String) because it does something not
>> exactly expected when guessing from its name and that would suit a
>> byte[] parameter better very much like it has been done with
>> String(byte[], int, int, int). Any advice about the procedure to
>> deprecate something?
>>
>> I was surprised that it was not trivial to list all valid utf
>> characters. If someone has a better idea than isValidUtfCharacter in
>> the attached test, let me know.
>>
>> Altogether, I would not consider 6202130 resolved completely, unless
>> maybe all remaining points are copied to 6443578 and maybe another
>> bug about valid values for "Signature-Version" and "Manifest-Version"
>> if at all desired. But still I consider the attached patch an
>> improvement and most of the remainder can then be solved in 6443578
>> and so far I am looking forward to any kind of feedback.
>>
>> Regards,
>> Philipp
>>
More information about the core-libs-dev
mailing list