Questions re: loading 512-bit constant data through ExternalAddress calls

Thu Feb 10 21:03:14 UTC 2022

Hello all,

Sorry in advance for the long email, but I thought it might be good to 
give a little background on what I'm trying to do:

I have an intrinsic that I'm working on for the ChaCha20 block 
function.  I have versions of it to support different processor 
capabilities, specifically SSE2+AVX, AVX2 and AVX512.  The first two are 
working great.  The AVX512 is giving me some headaches with a couple 
specific instructions.

I prototyped all of these in C using inline assembly before I got down 
to playing in hotspot and for the AVX512 implementation, there are a few 
places where one of the arguments for the EVEX.512 variant of vpaddd 
would be literal data at a memory location.  I achieved this in assembly 
like this:

// state is backed by uint32[16] and keystream is uint8[256]
void cc2Ax512(uint32_t *state, uint8_t *keystream) {
     asm (
         ".data;"
"ctrAddMaskAvx512:;"
             ".long 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0;"

         ".text;"
             // load data into zmm0/1/2/3 - that all works fine
             "vpaddd %%zmm3, %%zmm3, [ctrAddMaskAvx512];"

             // complete the rest of the function

         : // No output registers
         : "m"(state), "m"(keystream)
         : "rbx", "rdx", "ecx"
     );
}

I didn't put the whole routine in there for brevity, but the data loads 
from that ctrAddMaskAvx512 address and adds properly to the values 
already in zmm3 at the time of the vpaddd.

When it came time to try the equivalent approach in hotspot, I looked 
around for anything else that might be doing this and I found some 
examples of constant values being created in functions and passed into 
what look like ExternalAddress calls (or are they constructors?  I 
haven't gone looking in-depth yet on that one).

I used the ghash_shufflemask_addr() as a template for what I thought I 
was supposed to do:

  * in stubGenerator_x86_64.cpp I created a function
    chacha20_ctradd_avx512() that is basically 8 emit_data64() calls
    with the 512 bits of data I wanted to reference.  The ghash function
    I was using as a reference only writes 128-bits, but otherwise my
    function is put together the same way.
  * further down in stubGenerator_x86_64.cpp I also assign this function
    to a StubRoutines::x86 field,
    "StubRoutines::x86::_chacha20_counter_addmask_avx512 =
    chacha20_ctradd_avx512();"
  * in stubRoutines_x86.cpp/hpp I define
    "_chacha20_counter_addmask_avx512" and create a method
    chacha20_counter_addmask_avx512() that simply returns
    _chacha20_counter_addmask_avx512.  All of this follows the
    ghash_shufflemask_addr approach.
  * Finally when I wish to use it, say for a vpaddd call, it would look
    something like this:
      o __ vpaddd(zmm_dVec, zmm_dVec,
        ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()),
        Assembler::AVX_512bit, rax);
      o By comparison, the ghash approach I was using as a template is
        used with movdqu, so it's just a 128-bit move, but the source is
        an ExternalAddress similar to what I'm doing so I thought the
        technique would work more or less for EVEX variants that can
        have memory source addresses.

This all compiles and for some weird reason I saw it actually work 
correctly one time.  But most of the time the output after the add call 
is completely unrecognizable, as if it's adding data from some oddball 
address.  If I comment out that vpaddd statement, the data in the 
register is exactly what I would expect it to be before the add takes 
place.  So I'm fairly confident that the statements before that 
particular add are correct.

Here's where it gets weird.  I have a similar method for my AVX2 version 
of the intrinsic.  In that case, it's only doing 4 emit_data64 calls, 
and it passes it the same way into a vpaddd, but of course the 
vector_len is Assembler::AVX_256bit.  It works perfectly every time.  I 
don't have a good sense of why it's always working there but not with my 
512-bit counterpart.

I could definitely use some hotspot insights.  This approach in general 
was my best guess at loading/using 512-bit literals as source arguments 
but I'm definitely open to alternatives.  I have also tried the built-in 
generate_vector_custom_i32() function since that would allow me to do 
away with my own custom functions, so long as it doesn't hurt from a 
performance standpoint.  It seems to fail in the same way that my own 
functions do.

I am fairly new to assembly and these intrinsics so if you have 
suggestions/comments bear in mind that I don't eat/sleep/breathe hotspot 
like I would imagine some of the folks on this alias do. :)  But at 
least from a functional perspective, I know once I can get these literal 
512-bit values working the rest of the intrinsic function should fall 
into place because my C/assembly prototype works like a champ for all 
vector length variants.

Definitely open to your insights/comments,

Thanks,

--Jamil