RFR: 8170769 Provide a simple hexdump facility for binary data

Tue Dec 11 02:11:33 UTC 2018

On 12/7/18 10:22 AM, Vincent Ryan wrote:
>> I'm not convinced that the overloads that send output to an OutputStream pull their weight. They basically wrap the OutputStream in a PrintStream, which conveniently doesn't declare IOException, making it easy to use from a lambda passed to forEachOrdered(). If an error writing the output occurs, this is recorded by the PrintStream wrapper; however, the wrapper is then thrown away, making it impossible for the caller to check its error status.
> The intent is to support a trivial convenience method call that generates the well-known hexdump format.
> Especially for users that are interested in the hexdump data rather than the low-level details of how to terminate a stream.
> The dumpAsStream methods are available to support cases that differ from that format.
> 
> Have you a suggestion to improve the dump() methods, or you’d like to see them omitted?
> 
>> The PrintStream wrapper also uses the platform default charset, and doesn't provide any way for the caller to override the charset.
> Is there a need for that? Originally the requirement was driven by the hexdump format which is ASCII-only.
> Recently the class has been enhanced to also support the printable characters from ISO 8859-1.
> A custom formatter be supplied to dumpAsStream() to cater for all other cases?

OK, let's step back from this a bit. I see this hexdump as a little subsystem 
that has the following facets:

1) a source of bytes
2) a converter to hex
3) a destination

The converter is HexDump.Formatter, which converts and formats a subrange of 
byte[] to a String. Since the user can supply the Formatter function, the result 
String can contain any unicode character. Thus, the destination needs to handle 
any unicode character. It can be a Writer, which accepts String data. Or if you 
want it to write bytes, it can be an OutputStream, which raises the issue of 
encoding (charset). I would recommend against relying on the platform default 
charset, as this has been a source of subtle bugs. The preferred approach these 
days is to default to UTF-8 and provide an overload that takes an explicit charset.

An alternative is PrintStream. (This overlaps somewhat with your recent exchange 
with Roger on this topic.) PrintStream also does charset encoding, and the 
charset it uses depends on how it's created. I think the same approach should be 
applied as I described above with OutputStream, namely avoid the platform 
default charset; default to UTF-8; and provide an overload that takes an 
explicit charset.

I'm not sure which of these is the right thing. You should decide which is the 
most convenient for the use cases you expect to see. However, the solution needs 
to handle charset encoding. (And it should also properly deal with I/O 
exceptions, per my previous message.)

**

ISO 8859-1 comes up in a different place. The toPrintableString() method (used 
by the default formatter) considers a byte "printable" if it encodes a valid ISO 
8859-1 character. The byte is properly decoded to a String, so this is ok. Note 
this is a distinct issue from the encoding of the OutputStream or PrintStream as 
described above.

(As an aside I think that the encoding of ISO 8859-1 matches the corresponding 
code units of UTF-16, so you don't have to do the new String(..., ISO_8859_1) 
jazz. You can just cast the byte to a char and append it to the StringBuilder.)

s'marks