Richard Startin blog on AVX2 and Base64

Wed Aug 15 00:14:30 UTC 2018

from http://richardstartin.uk/base64-encoding/ :
> Would it be possible to implement a permutation like this in the
> Vector API? I expect this will be too complex to be expressed
> precisely because it works around and therefore embraces vpshufb
> performing two independent 128-bit permutations, rather than a single
> 256-bit permutation. This could be achieved with two SSE 128-bit loads
> and permutes, but loading 256-bit vectors from pairs of 128-bit
> vectors is convoluted as things stand.
> 
> For the extraction step, I doubt the semantics of _mm256_mulhi_epi16
> or _mm256_mullo_epi16 will ever be exposed to Java programmers, but it
> is possible to take the slow path and perform independent shifts. It
> just so happens that the calculation of the offset key relies on
> unsigned 8-bit arithmetic which does not exist in Java, but there are
> simpler but slower techniques to calculate the offset key. AVX2 is
> weird, with an abundance of unexpected capabilities alongside
> screaming feature gaps, and all the AVX2 code I read is teeming with
> ingenious hacks. I can imagine language designers being reticent to
> enshrine these peculiarities in a higher level language.

The cited algorithm uses the "semantic snowflake" vpshufb to perform
a required bit-level movement within a vector.

I hope we can find ways to more directly model such semantic snowflakes
in some adjunct to the Vector API, even though they are often hardware
specific.  The alternative is to "fall off the usability cliff" when somebody
like Richard needs to get to the best possible performance.  Worse, from
my perspective, is that we (the OpenJDK engineers) are then presented
with an inscrutable bunch of assembly code for integration into our source
base.  Such code is hard to understand regardless of assembly or Java
syntax, but it is *much easier* to read, validate, test, debug, repair, or
otherwise work with if it is in Java (compared to assembly or even C).

I won't say when, but I have seen new security bugs go into security
accelerator code because the author was compelled to write in assembly
code, and didn't fully understand the very tricky problems involved with
integrating said assembly safely into our platform.  Such risks would
surely also exist if the Vector API were used instead, but they would
surely be easier to manage.

Also, there are a number of vector-unit instructions which are not
complete semantic snowflakes, but are odd enough not to fit into
the center of the Vector API design, which is lanewise-SIMD
operations on Java primitive values, using Java primitive oeprators.
You can fall out of that center if you need to do arithmetic on
values larger than 64 bits, or if you need carryless multiply or
similar bit-linear operations, or if you want to tweak overflow behavior
(saturate? treat as unsigned?) or if need an operation that is
standard but which Java doesn't make a primitive.  In all of those
cases, there are legitimate portable algorithms which need writing
on our platform, we we should be able to write in a Java IDE,
rather than by taking a mud-bath in assembly code.

In that last category of fall-outs, a standard primitive not found
on a four-function calculator, is my favorite bit-twiddling operation,
the AES step.  A pair of these will efficiently blenderize 128 bits;
the resulting data smoothie can be used for many interesting
purposes (beyond crypto), notably fast tweakable hash codes.
I don't foresee effectively using this operation for Java, though,
until there is a Java-level way to invoke it.  And with full efficiency
of course.  And portably, of course (the ARM and X86 versions
of AES being trivially interconvertible).

Those are the reasons why I'd like to push the Vector API toward
a full-service solution for AVX hackers, at least on our platform.

One good way to move in that direction is to curate an API for each
hardware platform that holds miscellaneous "snowflake" operations,
as well as hard-to-categorize ones like CLMUL and AESENC.  It
would be admittedly non-portable but it would serve well as the basis
for experimentation, and for creating portable implementations that
optimize properly on each major platform.

— John