[vectorIntrinsics] RFR: Add utf8 decoding benchmarks

Wed Nov 25 14:46:01 UTC 2020

On Wed, 25 Nov 2020 07:53:58 GMT, John R Rose <jrose at openjdk.org> wrote:

> I wonder if there's any benefit to intrinsifying some or all of the steps between deriving a syndrome number and applying the corresponding selected shuffle(s). In this example the steps are: Do a compare, convert the comparison to a scalar bit mask (syndrome number), use it as a get key on a Java object, make some more indirections, grab a shuffle vector, and finally use it to steer the original data. There's also bits of control flow interspersed. 

Yes, you summarize the steps well. Other than the validation step, all of them are fairly straightforward. Offloading as much as possible ahead-of-time (the lookup table) really is what gets you the speed (in theory). However, as discussed on the mailing list, this has the disadvantage of computing all this information at startup.

What's interesting is that the same algorithm implemented in C++ yields significant improvements, but we're not getting similar gains with the Vector API. I'm learning my way through why the JIT isn't able to generate the right code, and I'll come back with my results to the mailing list. In the meantime, would merging the PR with this current implementation be good for you?

> That's a lot of stuff for the JIT to "see through".

Do you mean to "see through" in this case of using the Vector API? To try to help the JIT, I've tried storing in DecoderLutEntry only the byte[] and short[] backing the decoding and validation vectors, and load the vectors from that. However, I didn't see any performance improvement with that, and I even observed a non-negligible regression.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/26