[vectorIntrinsics] RFR: Add utf8 decoding benchmarks

Wed Nov 25 15:08:25 UTC 2020

>> I wonder if there's any benefit to intrinsifying some or all of the steps between deriving a syndrome number and applying the corresponding selected shuffle(s). In this example the steps are: Do a compare, convert the comparison to a scalar bit mask (syndrome number), use it as a get key on a Java object, make some more indirections, grab a shuffle vector, and finally use it to steer the original data. There's also bits of control flow interspersed.
> 
> Yes, you summarize the steps well. Other than the validation step, all of them are fairly straightforward. Offloading as much as possible ahead-of-time (the lookup table) really is what gets you the speed (in theory). However, as discussed on the mailing list, this has the disadvantage of computing all this information at startup.
> 
> What's interesting is that the same algorithm implemented in C++ yields significant improvements, but we're not getting similar gains with the Vector API. I'm learning my way through why the JIT isn't able to generate the right code, and I'll come back with my results to the mailing list. In the meantime, would merging the PR with this current implementation be good for you?

Implementation-wise, there are multiple places in the JVM where it can 
currently break before the actual algorithm starts to matter: failures 
during intrinsification or vector box elimination are usually the ones 
which cause the most severe slowdowns.

Best regards,
Vladimir Ivanov