RFR: 8170769 Provide a simple hexdump facility for binary data

Thu Dec 20 21:40:48 UTC 2018

Hi Vincent,

This looks good.

The doc for HEXDUMP_FORMATTER should include an example of the formatting.
Its easier to see it than to reconstruct it from the example code. like:
     00000000  61 62 63 64 65 66 67 68  69 6a 6b 6c 6d 6e 6f 70 
|abcdefghijklmnop|
     00000010  71 72 73 74 75 76 77 78  79 7a                    
|qrstuvwxyz|

I'm not sure the reference to hexdump(1) is useful, it could be removed
without loss of precision or usefulness.

Thanks, Roger

On 12/12/2018 01:21 PM, Vincent Ryan wrote:
> FYI I’ve updated the webrev/javadoc with latest edits including the 2 new ByteBuffer methods:
>
> http://cr.openjdk.java.net/~vinnie/8170769/webrev.09/ <http://cr.openjdk.java.net/~vinnie/8170769/webrev.09/>
> http://cr.openjdk.java.net/~vinnie/8170769/javadoc.09/api/java.base/java/util/HexFormat.html <http://cr.openjdk.java.net/~vinnie/8170769/javadoc.09/api/java.base/java/util/HexFormat.html>
> (NOTE: internal link to HexFormat.Formatter class is currently broken)
>
>
> If that is an acceptable compromise then I believe all outstanding issues have now been addressed.
> Thanks to all those who provided review comments.
>
>
>> On 12 Dec 2018, at 12:32, Vincent Ryan <vincent.x.ryan at oracle.com> wrote:
>>
>> Thanks for your proposal.
>> Comments below.
>>
>>> On 12 Dec 2018, at 02:35, Stuart Marks <stuart.marks at oracle.com> wrote:
>>>
>>>
>>>
>>> On 12/11/18 1:21 PM, Vincent Ryan wrote:
>>>> My preference is for PrintStream rather than Writer, for the same reason as Roger: it’s more convenient
>>>> for handling System.out. Does that address your concern?
>>> PrintStream is fine with me.
>>>
>>>> I cannot simply cast 8859-1 characters into UTF-8 because UTF-8 is multi-byte charset so some 0x8X characters
>>>> Will trigger the multi-byte sequence and will end up being misinterpreted. Hence my rather awkward conversion to a String.
>>>> Is there a better way?
>>> In toPrintableString(),
>>>
>>> 259         StringBuilder printable = new StringBuilder(toIndex - fromIndex);
>>> 260         for (int i = fromIndex; i < toIndex; i++) {
>>> 261             if (bytes[i] > 0x1F && bytes[i] < 0x7F) {
>>> 262                 printable.append((char) bytes[i]);
>>> 263             } else if (bytes[i] > (byte)0x9F && bytes[i] <= (byte)0xFF) {
>>> 264                 printable.append(new String(new byte[]{bytes[i]}, ISO_8859_1));
>>> 265
>>> 266             } else {
>>> 267                 printable.append('.');
>>> 268             }
>>> 269         }
>>>
>>> It works to cast ASCII bytes char, because the 7-bit ASCII range overlaps the low 7 bits of the UTF-16 char range. The bytes values of ISO 8859-1 overlap the low 8 bits of UTF-16, so casts work for them too.
>>>
>>> For any other charset, you'd need to do codeset conversion. But you're cleverly supporting only ISO 8859-1, so you don't have to do any conversion. :-)
>>>
>>>> I’m not sure I’ve addressed your concern regarding IOExceptions - can you elaborate?
>>> Taking out the OutputStream overloads addressed my concerns. In at least one case the code would wrap the OutputStream into a PrintStream, print stuff to it, and then throw away the PrintStream. If an output error occurred, any error state in the PrintStream would also be thrown away. The creation of the PrintStream wrapper would also use the system's default charset instead of letting the caller control it.
>>>
>>> The dump() overloads now all take PrintStream, so it's the caller's responsibility to ensure that the PrintStream is using the right charset and to check for errors after. So this is all OK now.
>>>
>>> Note that the internal getPrintStream(), to wrap an OutputStream in a PrintStream, is now obsolete and can be removed.
>>>
>>> (Oh, I see Roger has said much the same things. Oh well, the peril of parallel reviews.)
>>>
>>> **
>>>
>>>> BTW updated webrev/javadoc available:
>>>> http://cr.openjdk.java.net/~vinnie/8170769/webrev.08/
>>>> http://cr.openjdk.java.net/~vinnie/8170769/javadoc.08/api/java.base/java/util/HexFormat.html
>>> Now we have a somewhat unsatisfying asymmetry in the APIs.
>>>
>>> There are four kinds of inputs:
>>>
>>> 1. byte[]
>>> 2. byte[] subrange
>>> 3. InputStream
>>> 4. ByteBuffer
>>>
>>> and two kinds of outputs:
>>>
>>> 1. PrintStream
>>> 2. Stream<String>
>>>
>>> and two variations of formatters:
>>>
>>> 1. default formatter
>>> 2. custom formatter + chunk size
>>>
>>> This is a total of 16 combinations. But there are only eight methods: three PrintStream methods with choice of input, two stream-output methods using the default formatter, and three stream-output methods using custom chunk+formatter.
>>>
>>> You don't have to provide ALL combinations, but what's here is an odd subset with some apparently arbitrary choices. For example, if I have a ByteBuffer and I want to dump it to System.out using default formatting, I have to go the Stream.forEachOrdered route AND provide the default chunk size and formatter.
>>>
>>>    HexFormat.dumpAsStream(buf, DEFAULT_CHUNK_SIZE, HEXDUMP_FORMATTER)
>>>             .forEachOrdered(System.out::println);
>>>
>>> These aren't huge deals, but they're easily stumbled over.
>>>
>>> One approach to organizing this is to have a HexFormat instance that contains the setup information and then to have instance methods that either update the setup or perform conversion and output. I'd use static factory methods instead of constructors. For example, you could do this:
>>>
>>> static factories methods:
>>> - from(byte[])
>>> - from(byte[], fromIndex, toIndex)
>>> - from(InputStream)
>>> - from(ByteBuffer)
>>>
>>> formatter setup instance methods:
>>> - format(chunksize, formatter)
>>>
>>> output:
>>> - void dump(PrintStream)
>>> - Stream<String> stream()
>>>
>>> Using this approach my example from above could be performed as follows:
>>>
>>>   HexFormat.from(buf).dump(System.out);
>>>
>>> Note, I'm not saying that you HAVE to do it this way. (In particular, the naming could use work.) This is quite possibly overkill. But it's something you might consider, as it gets you all 16 combinations using seven methods, compared to the eight static methods in the current proposal that cover only half of the combinations.
>>>
>>> Alternatively, pare down the set of static methods to a bare minimum. Provide one that can do everything, and then provide one or two more that are essentially the same as the first but with some hardwired defaults. For example, to help minimize things, you can wrap a ByteBuffer around a byte array subrange, or get an InputStream from a byte array subrange. But you can't get an InputStream from a ByteBuffer or vice-versa, without a lot of work.
>>>
>>> (I haven't looked at the to* or from* methods.)
>>>
>>> s’marks
>> Your chaining approach has merit and the method chaining is attractive but it would be a significant change to the API at this advanced stage.
>>
>> I agree that there are gaps in the combinations and perhaps that could be addressed by introducing a few convenience methods.
>> I think ByteBuffer is under-represented and propose adding the following two methods to handle the simplified, default cases:
>>
>>     public static Stream<String> dumpAsStream(ByteBuffer buffer)
>>
>>     public static void dump(ByteBuffer buffer, PrintStream out)
>>
>>
>> That would be 10 combos for 10 methods versus 16 combos for 7 methods which is still not full coverage but perhaps more methods could
>> be added in the future if there is demand for them.
>>
>>
>>
>>