Richard Startin blog on AVX2 and Base64

Thu Aug 23 12:12:06 UTC 2018

It seems there's a whole class of algorithms which become feasible only 
if certain amount of hardware support is available. And programmers who 
code them are ready to exchange portability for better performance.

I see 2 complementary problems here:
   (1) reliable access to individual platform-specific instructions
   (2) platform introspection capabilities to decide whether proper 
support is available

In C/C++ land, users solves that by using compiler intrinsics and CPU 
dispatching (CPUID on x86 or equivalents). Machine code snippets 
experiment (vectorSnippets branch) explored that approach in Java, but I 
believe we can achieve similar results in pure Java (based on Vector API).

A library of methods which JIT reliably translates into proper 
instructions when hardware support is available turns #1 into a 
performance problem: users have to wisely choose what methods to use 
depending on the hardware they run on, otherwise they'll "fall off the 
performance cliff".

Such approach definitely has scaling issues (e.g., it doesn't cover AES 
instructions), but, as recent experiment with VPMULDQ [1] demonstrates, 
it works pretty well for exposing ISA irregularities on API level.

Introspection capabilities can be provided through an additional API 
point to query whether a library method has JVM & hardware support or 
not. Users have to perform right set of queries before using some 
implementation to get expected performance behavior. It is similar to 
CPU dispatching, but doesn't compromise safety/correctness guarantees 
even if proper JVM/HW support is absent.

Moreover, a fail-fast mode can be implemented to diagnose improper 
usages. In that case, JVM is free to throw an exception when a method 
without proper support is called.

Best regards,
Vladimir Ivanov

[1] 
http://mail.openjdk.java.net/pipermail/panama-dev/2018-August/002440.html

On 15/08/2018 03:14, John Rose wrote:
> from http://richardstartin.uk/base64-encoding/ :
>> Would it be possible to implement a permutation like this in the
>> Vector API? I expect this will be too complex to be expressed
>> precisely because it works around and therefore embraces vpshufb
>> performing two independent 128-bit permutations, rather than a single
>> 256-bit permutation. This could be achieved with two SSE 128-bit loads
>> and permutes, but loading 256-bit vectors from pairs of 128-bit
>> vectors is convoluted as things stand.
>>
>> For the extraction step, I doubt the semantics of _mm256_mulhi_epi16
>> or _mm256_mullo_epi16 will ever be exposed to Java programmers, but it
>> is possible to take the slow path and perform independent shifts. It
>> just so happens that the calculation of the offset key relies on
>> unsigned 8-bit arithmetic which does not exist in Java, but there are
>> simpler but slower techniques to calculate the offset key. AVX2 is
>> weird, with an abundance of unexpected capabilities alongside
>> screaming feature gaps, and all the AVX2 code I read is teeming with
>> ingenious hacks. I can imagine language designers being reticent to
>> enshrine these peculiarities in a higher level language.
> 
> The cited algorithm uses the "semantic snowflake" vpshufb to perform
> a required bit-level movement within a vector.
> 
> I hope we can find ways to more directly model such semantic snowflakes
> in some adjunct to the Vector API, even though they are often hardware
> specific.  The alternative is to "fall off the usability cliff" when somebody
> like Richard needs to get to the best possible performance.  Worse, from
> my perspective, is that we (the OpenJDK engineers) are then presented
> with an inscrutable bunch of assembly code for integration into our source
> base.  Such code is hard to understand regardless of assembly or Java
> syntax, but it is *much easier* to read, validate, test, debug, repair, or
> otherwise work with if it is in Java (compared to assembly or even C).
> 
> I won't say when, but I have seen new security bugs go into security
> accelerator code because the author was compelled to write in assembly
> code, and didn't fully understand the very tricky problems involved with
> integrating said assembly safely into our platform.  Such risks would
> surely also exist if the Vector API were used instead, but they would
> surely be easier to manage.
> 
> Also, there are a number of vector-unit instructions which are not
> complete semantic snowflakes, but are odd enough not to fit into
> the center of the Vector API design, which is lanewise-SIMD
> operations on Java primitive values, using Java primitive oeprators.
> You can fall out of that center if you need to do arithmetic on
> values larger than 64 bits, or if you need carryless multiply or
> similar bit-linear operations, or if you want to tweak overflow behavior
> (saturate? treat as unsigned?) or if need an operation that is
> standard but which Java doesn't make a primitive.  In all of those
> cases, there are legitimate portable algorithms which need writing
> on our platform, we we should be able to write in a Java IDE,
> rather than by taking a mud-bath in assembly code.
> 
> In that last category of fall-outs, a standard primitive not found
> on a four-function calculator, is my favorite bit-twiddling operation,
> the AES step.  A pair of these will efficiently blenderize 128 bits;
> the resulting data smoothie can be used for many interesting
> purposes (beyond crypto), notably fast tweakable hash codes.
> I don't foresee effectively using this operation for Java, though,
> until there is a Java-level way to invoke it.  And with full efficiency
> of course.  And portably, of course (the ARM and X86 versions
> of AES being trivially interconvertible).
> 
> Those are the reasons why I'd like to push the Vector API toward
> a full-service solution for AVX hackers, at least on our platform.
> 
> One good way to move in that direction is to curate an API for each
> hardware platform that holds miscellaneous "snowflake" operations,
> as well as hard-to-categorize ones like CLMUL and AESENC.  It
> would be admittedly non-portable but it would serve well as the basis
> for experimentation, and for creating portable implementations that
> optimize properly on each major platform.
> 
> — John
>