RFR: 8314125: RISC-V: implement Base64 intrinsic - encoding [v2]

Thu Jul 4 17:56:20 UTC 2024

On Mon, 1 Jul 2024 15:36:03 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   use pure scalar version when rvv is not supported
>
> with pure scalar impelmentation, it also bring some performance imrpovement in all source size, so also enable the intrinsic when rvv is not supported.
> 
> performance data
> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
> Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsic | Score +instrinsic, scalar | Error | Units | Perf opt
> -- | -- | -- | -- | -- | -- | -- | -- | --
> Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.784 | 86.75 | 0.38 | ns/op | 1
> Base64Encode.testBase64Encode | 2 | avgt | 10 | 93.71 | 93.824 | 1.954 | ns/op | 0.999
> Base64Encode.testBase64Encode | 3 | avgt | 10 | 121.824 | 123.487 | 0.559 | ns/op | 0.987
> Base64Encode.testBase64Encode | 6 | avgt | 10 | 138.984 | 137.697 | 0.273 | ns/op | 1.009
> Base64Encode.testBase64Encode | 7 | avgt | 10 | 161.243 | 157.696 | 0.875 | ns/op | 1.022
> Base64Encode.testBase64Encode | 9 | avgt | 10 | 169.724 | 155.223 | 1.908 | ns/op | 1.093
> Base64Encode.testBase64Encode | 10 | avgt | 10 | 185.92 | 176.339 | 5.875 | ns/op | 1.054
> Base64Encode.testBase64Encode | 48 | avgt | 10 | 408.467 | 347.269 | 1.799 | ns/op | 1.176
> Base64Encode.testBase64Encode | 512 | avgt | 10 | 3665.34 | 2718.442 | 26.954 | ns/op | 1.348
> Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7022.025 | 5290.003 | 33.216 | ns/op | 1.327
> Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135819.7 | 101988.94 | 2209.887 | ns/op | 1.332
> 
> </google-sheets-html-origin>

@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: https://github.com/WojciechMula/base64simd/issues/9 (see issue for benchmark, and repo for code)

I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" ([`encode`](https://github.com/WojciechMula/base64simd/blob/master/encode/encode.rvv.cpp#L60C14-L60C20) + [`lookup_pshufb_improved`](https://github.com/WojciechMula/base64simd/blob/master/encode/lookup.rvv.cpp#L7)), as this only uses instructions with predictable performance, except for LMUL=1 `vrgather.vv`, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

For decode, I'm not really happy with any implementation. Yours uses multiple `vluxei8` + `vlsege4` + `vssege3`, the others from base64simd use LMUL=8 `vrgather.vv`, which will take `LMUL^2=8^2=64` times the amount of cycles a LMUL=1 `vrgather.vv` takes (on sane implementations, [see my reasoning](https://gitlab.com/riseproject/riscv-optimization-guide/-/issues/1#note_1977583125)). As I said, I'm fairly certain LMUL=1 `vrgather.vv` will have to be relatively fast, so if I had to choose, I'd prefer [my implementation](https://godbolt.org/z/7qc1xhMao) that uses LMUL=1 `vrgather.vv`s +  `vlsege4` + `vssege3`, but using `vsseg*` is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19973#issuecomment-2209403751