Questions re: loading 512-bit constant data through ExternalAddress calls
Jamil Nimeh
jamil.j.nimeh at oracle.com
Thu Feb 10 21:03:14 UTC 2022
Hello all,
Sorry in advance for the long email, but I thought it might be good to
give a little background on what I'm trying to do:
I have an intrinsic that I'm working on for the ChaCha20 block
function. I have versions of it to support different processor
capabilities, specifically SSE2+AVX, AVX2 and AVX512. The first two are
working great. The AVX512 is giving me some headaches with a couple
specific instructions.
I prototyped all of these in C using inline assembly before I got down
to playing in hotspot and for the AVX512 implementation, there are a few
places where one of the arguments for the EVEX.512 variant of vpaddd
would be literal data at a memory location. I achieved this in assembly
like this:
// state is backed by uint32[16] and keystream is uint8[256]
void cc2Ax512(uint32_t *state, uint8_t *keystream) {
asm (
".data;"
"ctrAddMaskAvx512:;"
".long 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0;"
".text;"
// load data into zmm0/1/2/3 - that all works fine
"vpaddd %%zmm3, %%zmm3, [ctrAddMaskAvx512];"
// complete the rest of the function
: // No output registers
: "m"(state), "m"(keystream)
: "rbx", "rdx", "ecx"
);
}
I didn't put the whole routine in there for brevity, but the data loads
from that ctrAddMaskAvx512 address and adds properly to the values
already in zmm3 at the time of the vpaddd.
When it came time to try the equivalent approach in hotspot, I looked
around for anything else that might be doing this and I found some
examples of constant values being created in functions and passed into
what look like ExternalAddress calls (or are they constructors? I
haven't gone looking in-depth yet on that one).
I used the ghash_shufflemask_addr() as a template for what I thought I
was supposed to do:
* in stubGenerator_x86_64.cpp I created a function
chacha20_ctradd_avx512() that is basically 8 emit_data64() calls
with the 512 bits of data I wanted to reference. The ghash function
I was using as a reference only writes 128-bits, but otherwise my
function is put together the same way.
* further down in stubGenerator_x86_64.cpp I also assign this function
to a StubRoutines::x86 field,
"StubRoutines::x86::_chacha20_counter_addmask_avx512 =
chacha20_ctradd_avx512();"
* in stubRoutines_x86.cpp/hpp I define
"_chacha20_counter_addmask_avx512" and create a method
chacha20_counter_addmask_avx512() that simply returns
_chacha20_counter_addmask_avx512. All of this follows the
ghash_shufflemask_addr approach.
* Finally when I wish to use it, say for a vpaddd call, it would look
something like this:
o __ vpaddd(zmm_dVec, zmm_dVec,
ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()),
Assembler::AVX_512bit, rax);
o By comparison, the ghash approach I was using as a template is
used with movdqu, so it's just a 128-bit move, but the source is
an ExternalAddress similar to what I'm doing so I thought the
technique would work more or less for EVEX variants that can
have memory source addresses.
This all compiles and for some weird reason I saw it actually work
correctly one time. But most of the time the output after the add call
is completely unrecognizable, as if it's adding data from some oddball
address. If I comment out that vpaddd statement, the data in the
register is exactly what I would expect it to be before the add takes
place. So I'm fairly confident that the statements before that
particular add are correct.
Here's where it gets weird. I have a similar method for my AVX2 version
of the intrinsic. In that case, it's only doing 4 emit_data64 calls,
and it passes it the same way into a vpaddd, but of course the
vector_len is Assembler::AVX_256bit. It works perfectly every time. I
don't have a good sense of why it's always working there but not with my
512-bit counterpart.
I could definitely use some hotspot insights. This approach in general
was my best guess at loading/using 512-bit literals as source arguments
but I'm definitely open to alternatives. I have also tried the built-in
generate_vector_custom_i32() function since that would allow me to do
away with my own custom functions, so long as it doesn't hurt from a
performance standpoint. It seems to fail in the same way that my own
functions do.
I am fairly new to assembly and these intrinsics so if you have
suggestions/comments bear in mind that I don't eat/sleep/breathe hotspot
like I would imagine some of the folks on this alias do. :) But at
least from a functional perspective, I know once I can get these literal
512-bit values working the rest of the intrinsic function should fall
into place because my C/assembly prototype works like a champ for all
vector length variants.
Definitely open to your insights/comments,
Thanks,
--Jamil
More information about the hotspot-dev
mailing list