JDK-6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Thu Apr 19 22:58:05 UTC 2018

Hi,

I tried to fix bug 6202130 about manifest utf support and come up now 
with a test as suggested in the bug's comments that shows that utf 
charset actually works before removing the comments from the code.

When I wanted to remove the XXX comments about utf it occurred to me 
that version attributes ("Signature-Version" and "Manifest-Version") 
would never be broken across lines and should anyway not support the 
whole utf character set which sounds more like related to bugs 6910466 
or 4935610 but it's not a real fit. Therefore, I could not remove one 
such comment of Attributes#writeMain but I changed it. The first comment 
in bug 6202130 mentions only two comments but there are three in 
Attributes. In the attached patch I removed only two of three and 
changed the remaining third to not mention utf anymore.

At the moment, at least until 6443578 is fixed, multi-byte utf 
characters can be broken across lines. It might be worth a consideration 
to test that explicitly as well but then I guess there is not much of a 
point in testing the current behavior that will change with 6443578, 
hopefully soon. There are in my opinion enough characters broken across 
lines in the attached test that demonstrate that this still works like 
it did before.

I would have preferred also to remove the calls to deprecated 
String(byte[], int, int, int) but then figured it relates more to bug 
6443578 than 6202130 and now prefer to do that in another separate patch.

Bug 6202130 also states that lines are broken by String.length not by 
byte length. While it looks so at first glance, I could not confirm. The 
combination of getBytes("UTF8"), String(byte[], int, int, int), and then 
DataOutputStream.writeBytes(String) in that combination does not drop 
high-bytes because every byte (whether a whole character or only a part 
of a multi-byte character) becomes a character in String(...) containing 
that byte in its low-byte which will be read again from writeBytes(...). 
Or put in a different way, every utf encoded byte becomes a character 
and multi-byte utf characters are converted into multiple string 
characters containing one byte each in their lower bytes. The current 
solution is not nice, but at least works. With that respect I'd like to 
suggest to deprecate DataOutputStream.writeBytes(String) because it does 
something not exactly expected when guessing from its name and that 
would suit a byte[] parameter better very much like it has been done 
with String(byte[], int, int, int). Any advice about the procedure to 
deprecate something?

I was surprised that it was not trivial to list all valid utf 
characters. If someone has a better idea than isValidUtfCharacter in the 
attached test, let me know.

Altogether, I would not consider 6202130 resolved completely, unless 
maybe all remaining points are copied to 6443578 and maybe another bug 
about valid values for "Signature-Version" and "Manifest-Version" if at 
all desired. But still I consider the attached patch an improvement and 
most of the remainder can then be solved in 6443578 and so far I am 
looking forward to any kind of feedback.

Regards,
Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6202130.patch
Type: text/x-patch
Size: 10586 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20180420/74a22cff/6202130-0001.patch>