RFR 8251989: Hex encoder and decoder utility

Thu Aug 27 00:27:53 UTC 2020

Hi Mark,

The updates to the API based on the comments received are in the works.

On 8/25/20 8:01 PM, mark.reinhold at oracle.com wrote:
> 2020/8/20 13:59:39 -0700, roger.riggs at oracle.com:
>> On 8/20/20 3:10 PM, mark.reinhold at oracle.com wrote:
>>> ...
>>>
>>> A few comments:
>>>
>>>    - Why do the short-form `encoder` factory methods return encoders that
>>>      produce upper-case hex strings?  `Integer::toHexString` and other
>>>      existing `toHex` methods return lower-case hex strings.  That’s also
>>>      what you get from common Unix CLI tools (e.g., `od -tx1`).
>>>
>>>      Please consider changing these methods to return lower-case encoders.
>> It's (almost) a toss up and easy to change; many of the existing uses
>> produce uppercase.
> Please change the default to lower case.  It’s what people are going to
> expect.
ok
>
>> Perhaps the short form no-arg should be replaced with a short form
>> constructor that
>> takes true/false, so it is explicit at the use site or put the case in
>> the name.
>> encodeToUpper(), encodeToLower().
>> (A boolean parameter is not very informative, a enum would be better but
>> perhaps a bit heavyweight).
> Either a boolean or an enum parameter would be massive overkill.

See the use cases (in the JDK and tests) below.
>
>>>    - Is it worth having static `Hex.encode(byte[])` and
>>>      `Hex.decode(CharSequence)` convenience methods for the simplest
>>>      cases?
>> There was some discussion of that but it idea was to minimize the
>> surface area.
> Sorry, where was that discussion?  I can’t find it in the core-libs-dev
> archive.
Several colleagues suggested a minimal API to start and to add 
convenience APIs
as the need is apparent.
>
>>>    - [Warning: Bikeshed] The verbs “encode” and “decode” seem unfortunate.
>>>
>>>      Over in `java.nio.charsets` we have encoders that transform
>>>      characters to bytes, and decoders that transform bytes to characters.
>>>      The coded thing is the bytes; the uncoded thing is the characters.
>>>
>>>      In `java.util` we already have the `Base64` class, which provides
>>>      encoders that transform bytes to characters, and decoders that
>>>      transform characters to bytes.  The coded thing is the characters;
>>>      the uncoded thing is the bytes.
>>>
>>>      The use of “encode” and “decode” in `Base64` was likely inspired by
>>>      the fact that the format has been known as “base 64 encoding” for
>>>      decades, having originated as a hack for transporting non-ASCII data
>>>      via SMTP.  Developers looking to do base-64 operations will, thus,
>>>      expect this terminology.
>> The bias of encoding vs decoding terminology is subtle, based the 'native'
>> form of the data. For the Charset classes, the native form of the data
>> is characters,
>> and the encoded form is bytes. For Base64, the native form of the data
>> is bytes,
>> and the encoded form is Base64 lines.
> Indeed -- and my point is that, in this case, encoding and decoding
> might not be the best concepts for API discoverability.  (But, maybe
> they are.)
>
> It’s technically true that this API does encoding and decoding,
> whichever way you look at it.  What I wonder about are the typical use
> cases.  Will the primary uses of this API be to encode bytes into
> characters so that they can be transported via a medium that can only
> handle characters, and then decode the characters back into bytes on the
> other end?  Or will the primary uses be to format bytes into a readable
> form, and likewise parse arbitrary-length hex strings into bytes?

Within the JDK, there are several different cases.
The network classes encode non-printable characters to meet protocol 
requirements.

Most of the other uses present binary information in exceptions and 
messages,
so it is one-way bytes to Strings.  For example, hashes on Modules that 
don't match in the Resolver.

In the security  classes, most of the use is for debugging support, to 
report
unexpected differences.  The parsing of Strings to binary is mostly used to
encode binary objects into source code, for example certificates for 
testing.

Uppercase is more frequently used than lower case, in the security tests and
debugging support of 81 uses of Hex.formaters, 52 use uppercase characters.

>
>>>      Here you’re proposing that the `Hex` class follow the `Base64` class.
>>>      Consistency with existing nearby APIs is a worthy goal.  If I were
>>>      just looking to convert a byte array into a readable hex string,
>>>      however, I’d probably want to “format” it rather than “encode” it,
>>>      something like `String.format("%x")` on steroids.  Likewise, if I
>>>      were looking to convert a hex string into bytes then I’d want to
>>>      “parse” it rather than “decode” it, i.e., `Integer::parseInt` on
>>>      steroids.
>>>
>>>      If you were to rename the nested classes to `Hex.Formatter` and
>>>      `Hex.Parser`, and rename all methods accordingly, then this API would
>>>      be inconsistent with the nearby `Base64` but likely more consistent
>>>      with developer expectations.
>>>
>>>      (`Hex` is already inconsistent with `Base64` in that it doesn’t
>>>       prefix the names of its factory methods with `get`, which is a good
>>>       thing.)
>> The question is at what level of control does "encoding" become formatting.
>> There are very few formatting features in the API, no control of leading
>> zeros, no control
>> over indicental whitespace, and no control over width. Similarly with
>> parsing,
>> what flexibility does parsing imply that decoding does not, is
>> whitespace ignored, line endings/joins, etc.
> Optional support for prefix, suffix, and delimiter characters in both
> operations, and support for both lower- and upper-case characters on
> input, sure look like formatting and parsing to me.

ok

Thanks, Roger

>
> - Mark