JDK-6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars
Philipp Kunz
philipp.kunz at paratix.ch
Thu Apr 19 22:58:05 UTC 2018
Hi,
I tried to fix bug 6202130 about manifest utf support and come up now
with a test as suggested in the bug's comments that shows that utf
charset actually works before removing the comments from the code.
When I wanted to remove the XXX comments about utf it occurred to me
that version attributes ("Signature-Version" and "Manifest-Version")
would never be broken across lines and should anyway not support the
whole utf character set which sounds more like related to bugs 6910466
or 4935610 but it's not a real fit. Therefore, I could not remove one
such comment of Attributes#writeMain but I changed it. The first comment
in bug 6202130 mentions only two comments but there are three in
Attributes. In the attached patch I removed only two of three and
changed the remaining third to not mention utf anymore.
At the moment, at least until 6443578 is fixed, multi-byte utf
characters can be broken across lines. It might be worth a consideration
to test that explicitly as well but then I guess there is not much of a
point in testing the current behavior that will change with 6443578,
hopefully soon. There are in my opinion enough characters broken across
lines in the attached test that demonstrate that this still works like
it did before.
I would have preferred also to remove the calls to deprecated
String(byte[], int, int, int) but then figured it relates more to bug
6443578 than 6202130 and now prefer to do that in another separate patch.
Bug 6202130 also states that lines are broken by String.length not by
byte length. While it looks so at first glance, I could not confirm. The
combination of getBytes("UTF8"), String(byte[], int, int, int), and then
DataOutputStream.writeBytes(String) in that combination does not drop
high-bytes because every byte (whether a whole character or only a part
of a multi-byte character) becomes a character in String(...) containing
that byte in its low-byte which will be read again from writeBytes(...).
Or put in a different way, every utf encoded byte becomes a character
and multi-byte utf characters are converted into multiple string
characters containing one byte each in their lower bytes. The current
solution is not nice, but at least works. With that respect I'd like to
suggest to deprecate DataOutputStream.writeBytes(String) because it does
something not exactly expected when guessing from its name and that
would suit a byte[] parameter better very much like it has been done
with String(byte[], int, int, int). Any advice about the procedure to
deprecate something?
I was surprised that it was not trivial to list all valid utf
characters. If someone has a better idea than isValidUtfCharacter in the
attached test, let me know.
Altogether, I would not consider 6202130 resolved completely, unless
maybe all remaining points are copied to 6443578 and maybe another bug
about valid values for "Signature-Version" and "Manifest-Version" if at
all desired. But still I consider the attached patch an improvement and
most of the remainder can then be solved in 6443578 and so far I am
looking forward to any kind of feedback.
Regards,
Philipp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6202130.patch
Type: text/x-patch
Size: 10586 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20180420/74a22cff/6202130-0001.patch>
More information about the core-libs-dev
mailing list