RFR: 8314124: RISC-V: implement Base64 intrinsic - decoding

Tue Aug 20 18:31:12 UTC 2024

On Fri, 5 Jul 2024 13:48:24 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> ## Performance
>> benchmarks run on CanVM-K230
>> 
>> data
>> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
>> Benchmark m2+m1+scalar | (addSpecial) | (errorIndex) | (lineSize) | (maxNumBytes) | Mode | Cnt | Score +intrinsic+rvv | Score -intrinsic | Error | Units | Improvement
>> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 1 | avgt | 10 | 97.771 | 98.506 | 0.713 | ns/op | 1.008
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 3 | avgt | 10 | 117.715 | 118.422 | 0.428 | ns/op | 1.006
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 7 | avgt | 10 | 174.625 | 172.767 | 7.671 | ns/op | 0.989
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 32 | avgt | 10 | 286.391 | 317.175 | 11.443 | ns/op | 1.107
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 64 | avgt | 10 | 336.932 | 503.257 | 15.738 | ns/op | 1.494
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 80 | avgt | 10 | 418.894 | 625.485 | 7.21 | ns/op | 1.493
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 96 | avgt | 10 | 353.813 | 698.67 | 15.485 | ns/op | 1.975
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 112 | avgt | 10 | 499.243 | 866.909 | 4.427 | ns/op | 1.736
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 512 | avgt | 10 | 1451.277 | 3530.048 | 3.685 | ns/op | 2.432
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 1000 | avgt | 10 | 2258.785 | 5964.066 | 9.075 | ns/op | 2.64
>> Base64Decode.testBase64Decode | 0 | 144 | 4 | 20000 | avgt | 10 | 39689.204 | 122334.929 | 255.195 | ns/op | 3.082
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 1 | avgt | 10 | 187.032 | 158.558 | 7.606 | ns/op | 0.848
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 3 | avgt | 10 | 209.558 | 200.774 | 7.648 | ns/op | 0.958
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 7 | avgt | 10 | 556.696 | 505.072 | 8.748 | ns/op | 0.907
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 32 | avgt | 10 | 2139.767 | 1876.825 | 13.787 | ns/op | 0.877
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 64 | avgt | 10 | 6142.353 | 3818.199 | 35.622 | ns/op | 0.622
>> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 80 | avgt | 10 | 8746.205 | 4787.155 | 109.819 | ns/op ...
>
> To continue the discussion at https://github.com/openjdk/jdk/pull/19973#issuecomment-2210907011.
> 
> vrgroup implementation bring some regression compared with current implementation in this pr in large size data (vrgroup also bring regression in small size data, but we can ignore the regression in small size data, as current implementation use scalar version when data size is small, it's expected.)
> A implementation with vrgroup is at https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:baes64-decode-vrgroup?expand=1
> 
> comparison between this implementation and vrgroup
> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
> Benchmark +/- vrgroup | (addSpecial) | (errorIndex) | (lineSize) | (maxNumBytes) | Mode | Cnt | Score +intrinsic+rvv+vrgroup | Score +intrinsic+rvv-vrgroup | Error | Units | Improvement of vrgroup
> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 1 | avgt | 10 | 101.993 | 99.2 | 0.781 | ns/op | 0.973
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 3 | avgt | 10 | 117.832 | 117.596 | 2.431 | ns/op | 0.998
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 7 | avgt | 10 | 429.577 | 174.873 | 4.125 | ns/op | 0.407
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 32 | avgt | 10 | 1760.438 | 286.046 | 3.946 | ns/op | 0.162
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 64 | avgt | 10 | 1060.156 | 339.35 | 1.789 | ns/op | 0.32
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 80 | avgt | 10 | 1929.515 | 422.906 | 48.816 | ns/op | 0.219
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 96 | avgt | 10 | 398.397 | 340.595 | 1.805 | ns/op | 0.855
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 112 | avgt | 10 | 1257.429 | 495.14 | 1.849 | ns/op | 0.394
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 512 | avgt | 10 | 3115.738 | 1451.795 | 17.349 | ns/op | 0.466
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 1000 | avgt | 10 | 4719.422 | 2321.598 | 582.276 | ns/op | 0.492
> Base64Decode.testBase64Decode | 0 | 144 | 4 | 20000 | avgt | 10 | 48630.78 | 40487.502 | 370.749 | ns/op | 0.833
> Base64Decode.testBase64MIMEDecode | 0 | 144 | 4 | 1 | avgt | 10 | 252.071 | 187.793 | 12.937 | ns/op | 0.745
> Base64Decode.testBase64MIMEDe...

@Hamlin-Li Yeah, you are right, your is faster on the C908, and also X60.
I measured that yours took C908: 0.93x  and X60: 0.85x the amount of time mine took. (Note: I modified my code that was linked in the base64simd issue, because I thought I found an easy optimization, and added that code to my first post here, turns out that change made it slightly slower, so I used the original variant: https://godbolt.org/z/hrs61x9aP).

I think I have an idea of what's going on, first look at these 4-bit LUT benchmarks, specifically `rvv_gathers_m1` and `rvv_vluxei8_m2`:

On The [C908](https://camel-cdr.github.io/rvv-bench-results/canmv_k230/LUT4.html), using LMUL=1 vrgather for lookup tables is roughly twice as fast as using `rvv_vluxei8_m2`, however your code uses four vluxei8, while mine uses twelve LMUL=1 vrgather, so yours ends up faster.

Now for the [X60](https://camel-cdr.github.io/rvv-bench-results/bpi_f3/LUT4.html), there LMUL=1 vrgather is about four times faster for smaller input sizes (presumably when it fits into cache), and somehow slower than vluxei8 on larger input sizes (presumably doing memory accesses). I mentioned that yours is faster on the X60, but after looking at this graph I tried restricting the input to under 200KB, and as predicted now mine is 1.3x faster.

On the [C920](https://camel-cdr.github.io/rvv-bench-results/milkv_pioneer/LUT4.html) LMUL=1 vrgather ~~is about 4.5 times faster than using vluxei8 for smaller inputs, but somehow its up to 8.5x faster for large inputs??~~ Sorry I accidentally looked at the `rvv_m1_gather_m2` graph, the `rvv_gather_m1` graph is 15x faster for small inputs and 8x faster for large ones.

That all seems very weird, but I think I know what's going on, and what distinguishes current ooo implementations from the in-order ones: vector chaining support

I think it's safe to say that the X60 does use vector chaining for its load stores, but not for vrgather, that can explain how it ended up slower than the vluxei8 variant, because vrgather isn't chained and needs all elements ready, while vluxei8 chain with the other load/stores, one both need to access memory directly. If you compare the measurements in the graps, you'll see that vluxei8 takes about 0.88x less time than vrgather for large inputs, that's quite close to the 0.85x I measured for the base64 decode. Although that is probably a bit of an coincidence.

I'm don't know if ooo makes vector chaining considerably harder, but I'll make the prediction that most first gen ooo processors won't implement it, because the people working on those cores have a lot of experience doing fixed width SIMD without chaining on arm or x86 cores. Chaining is also more useful for DLEN<VLEN, which are unlikely to be common for the targets high performance ooo implementations target.

~~I'm not sure why the performance of vrgather does the 2x jump. It might be a measurement artifact, considering that the peaks for smaller inputs go up to that speed.~~

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20026#issuecomment-2212611777