RFR for 6202130: java.util.jar.Attributes.writeMain() can't handle multi-byte chars

Fri May 4 07:49:56 UTC 2018

Hi Sherman,

Thank you very much for your reply. Your question led me to reconsider 
if so many changes were actually necessary and they are really not. 
Thanks. I prepared a smaller patch for 6202130:

http://files.paratix.ch/jdk/6202130/webrev.02/
http://files.paratix.ch/jdk/6202130/webrev.02.zip

I fully agree that utf-8 + 72-bytes breaks have worked before 
(http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-April/052688.html). 
However, a comment in the bug says: "But before we do that, we should 
have a test". Breaking the lines is already covered in 
LineBreakLineWidth. And for utf support I wrote ValueUtfEncoding and 
with that, provided the test is good, the comments can be removed.

Breaking lines and correct utf support in combination results in the 
additional requirement not to break inside of multi-byte characters onto 
continuation lines. I confused it in my previous patch with the subject 
here, bug 6202130, but now I think that can be addressed separately in 
bug 6443578.

The version header, which you also pointed out,
"            if ((version != null) && !(name.equalsIgnoreCase(vername))) {"
is subject of other bugs 4935610, 6910466, 8196371, or 4271239.

Version headers are defined separately in the specification and writing 
them is also implemented slightly differently than other headers. 
According to the specification, version headers can contain only digits 
and periods and cannot be continued on the next line. Lines must not 
exceed a length and therefore I would conclude that version values 
should have their own length limit. In the current implementation 
version headers are never broken on continuation lines but they can 
exceed the maximum allowed line width which is not correct. Could this 
be worth a comment in a bug?

With the previous patch, I also tried to remove the use of deprecated 
String(byte[], int, int, int). There also are duplicated portions of 
code nearby. Would it be, maybe if not just too crazy an idea, an option 
to create another bug for these points?

Regards,
Philipp

On 03.05.2018 08:56, Xueming Shen wrote:
> Philipp,
>
> I kinda recalled JDK-6202130 which I read long time ago and I believed 
> it's not a bug but
> just wanted to leave it open and for double check, then it's off the 
> priority list somehow...
>
> Reread the code below I think it handles utf8 and the 72-length-limit 
> appropriately, though
> a little tricky,
>
> (1) value.getByte("utf8") + deprecated String(byte[], int, int, int);
>      which creates a String that each "char" inside that String object 
> actually represents one byte
>      value of the resulted utf8 byte sequence, with its utf16 value = 
> [hi-byte= 0 and low-byte="utf8-byte"]
>
> (2) append this String to the StringBuffer (--> StringBuilder), now 
> the sb.length() actually is the
>      length of the utf8 byte sequence, make72Safe(...) is used to 
> adjust the length to 72 working
>      on "chars" inside the buffer.
>
> (3) write out the adjusted buffer via DataOutputStream.writeBytes()
>      in which the "write" cuts off the hi-byte of that utf16, so you 
> actually get the original
>      utf-8 byte sequence output to the stream.
>
> Sure the implementation looks "interesting" and uses deprecated String 
> constructor, but it was
> written 2 decades ago.
>
> Back then I think we were thinking maybe the fix is to simply remove 
> the "misleading" comment
> line below in the source code, if the implementation actually supports 
> utf-8 + 72-limit.
>
> "     * XXX Need to handle UTF8 values and break up lines longer than 
> 72 bytes"
>
> Any reason/test case to believe the mechanism does not work?
>
> BTW the logic of line below in writeMain() appears to be incorrect. I 
> do have another bug for it.
> "            if ((version != null) && 
> !(name.equalsIgnoreCase(vername))) {"
>
> ----------------------
>
>                 StringBuffer buffer = new StringBuffer(name);
>                 buffer.append(": ");
>
>                 String value = (String) e.getValue();
>                 if (value != null) {
>                     byte[] vb = value.getBytes("UTF8");
>                     value = new String(vb, 0, 0, vb.length);
>                 }
>                 buffer.append(value);
>
>                 Manifest.make72Safe(buffer);
>                 buffer.append("\r\n");
>                 out.writeBytes(buffer.toString());
>
> -------------------------
>
> sherman
>
>
>
> On 5/1/18, 10:21 PM, Philipp Kunz wrote:
>> Hi,
>>
>> Recently, I tried to fix only bug 6202130 with the intention to fix 
>> bug 6443578 later with the intention to get some opportunity for 
>> feedback, but haven't got any, and propose now a fix for both 
>> together which in my opinion makes more sense.
>>
>> See attached patch.
>>
>> Some considerations, assumptions, and explanations
>>
>>  * In my opinion, the code for writing manifests was distributed in the
>>    two classes Attributes and Manifest in an elegant way but somewhat
>>    difficult to explain the coherence. I chose to group the code that
>>    writes manifests into a new class ManifestWriter. The main incentive
>>    for that was to prevent or reduce duplicated code I would have had
>>    to change twice otherwise. This also results in a source file of a
>>    suitable size.
>>  * I could not support the assumption that the write and writeMain
>>    methods in Attributes couldn't be referenced anywhere so I
>>    deprecated them rather than having them removed.
>>  * I assumed the patch will not make it into JDK 10 and, hence, the
>>    deprecated annotations are attributed with since = 11.
>>  * I could not figure out any reason for the use of DataOutputStream
>>    and did not use it.
>>  * Performance-wise I assume that the code is approximately comparable
>>    to the previous version. The biggest improvement in this respect I
>>    hope comes from removing the String that contains the byte array
>>    constructed with deprecated String(byte[], int, int, int) and then
>>    copying it over again to a StringBuffer and from there to a String
>>    again and then Characters. On the other hand, keeping whole
>>    characters together when breaking lines might make it slightly
>>    slower. I hope my changes are an overall improvement, but I haven't
>>    measured it.
>>  * For telling first from continuation bytes of utf-8 characters apart
>>    I re-used a method isNotUtfContinuationByte from either StringCoding
>>    or UTF_8.Decoder. Unfortunately I found no way not to duplicate it.
>>  * Where it said before "XXX Need to handle UTF8 values and break up
>>    lines longer than 72 bytes" in Attributes#writeMain I did not dare
>>    to remove the comment completely because it still does not deal
>>    correctly with version headers longer than 72 bytes and the set of
>>    allowed values. I changed it accordingly. Two similar comments are
>>    removed in the patch.
>>  * I added two tests, WriteDeprecated and NullKeysAndValues, to
>>    demonstrate compatibility as good as I could. Might however not be
>>    desired to keep and having to maintain.
>>  * LineBrokenMultiByteCharacter for jarsigner should not be removed or
>>    not so immediately because someone might attempt to sign an older
>>    jarfile created without that patch with a newer jarsigner that
>>    already contains it.
>>
>>
>>
>> suggested changes or additions to the bug database: (i have no 
>> permissions to edit it myself)
>>
>>  * Re-combine copies of isNotUtfContinuationByte (three by now).
>>    Relates to 6184334. Worth to file another issue?
>>  * Manifest versions have specific specifications, cannot break across
>>    lines and can contain a subset of characters only. Bug 6910466
>>    relates but is not exactly the same. If someone else is convinced
>>    that writing a manifest should issue a warning or any other way to
>>    deal with a version that does not conform to the specification, I'd
>>    suggest to create a separate bug for that.
>>
>>
>> Now, I would be glad if someone sponsored a review. This is only my 
>> third attempt to submit a patch which is why I chose a lesser 
>> important subject to fix in order to get familiar and now I 
>> understand it's not the most attractive patch to review. Please don't 
>> hesitate to suggest what I could do better or differently.
>>
>> As a bonus, with these changes, manifest files will always be 
>> displayed correctly with just any utf capable viewer even if they 
>> contain multi-byte utf characters that would have been broken across 
>> a line break with the current/previous implementation and all 
>> manifests will become also valid strings in Java.
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6202130.02.patch
Type: text/x-patch
Size: 10458 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20180504/ae868873/6202130.02-0001.patch>