RFR: 8314125: RISC-V: implement Base64 intrinsic - encoding [v4]

Fri Jul 5 13:48:24 UTC 2024

On Tue, 2 Jul 2024 14:16:35 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Hi,
>> Can you help to review the patch?
>> 
>> I'm also working a base64 decode instrinsic, but there is some performance regression in some cases, and decode and encode are totally independent with each other, so I will send out review of decode in another pr when I fix the performance regression in it.
>> 
>> Thanks.
>> 
>> ## Test
>> benchmarks run on CanVM-K230
>> 
>> I've tried several implementations, respectively with vector group
>> * m2+m1+scalar
>> * m2+scalar
>> * m1+scalar
>> * pure scalar
>> The best one is combination of m2+m1, it have best performance in all source size.
>> 
>> this implementation (m2+m1)
>> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
>> Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsic | Score + instrinsic, m1+m2 | Error | Units | -intrinsic/+intrinsic
>> -- | -- | -- | -- | -- | -- | -- | -- | --
>> Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.784 | 86.996 | 0.459 | ns/op | 0.9975631063
>> Base64Encode.testBase64Encode | 2 | avgt | 10 | 93.603 | 94.026 | 1.081 | ns/op | 0.9955012443
>> Base64Encode.testBase64Encode | 3 | avgt | 10 | 121.927 | 123.227 | 0.342 | ns/op | 0.989450364
>> Base64Encode.testBase64Encode | 6 | avgt | 10 | 139.554 | 137.4 | 1.221 | ns/op | 1.015676856
>> Base64Encode.testBase64Encode | 7 | avgt | 10 | 160.698 | 162.25 | 2.36 | ns/op | 0.9904345146
>> Base64Encode.testBase64Encode | 9 | avgt | 10 | 161.085 | 153.772 | 1.505 | ns/op | 1.047557423
>> Base64Encode.testBase64Encode | 10 | avgt | 10 | 187.963 | 174.763 | 1.204 | ns/op | 1.075530862
>> Base64Encode.testBase64Encode | 48 | avgt | 10 | 405.212 | 199.4 | 6.374 | ns/op | 2.032156469
>> Base64Encode.testBase64Encode | 512 | avgt | 10 | 3652.555 | 1111.009 | 3.462 | ns/op | 3.287601631
>> Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7217.187 | 2011.943 | 227.784 | ns/op | 3.587172698
>> Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135165.706 | 33864.592 | 57.557 | ns/op | 3.991357876
>> 
>> </google-sheets-html-origin>
>> 
>> vector with only m2
>> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: st...
>
> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
> 
>   move label

Thanks a lot for sharing the information.

> @Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: [WojciechMula/base64simd#9](https://github.com/WojciechMula/base64simd/issues/9) (see issue for benchmark, and repo for code)

Agree, I think your observation is right.

> I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I don't know how that (`it's very slow on the ooo`) happens and currently I don't have these types of machine. And it's bit strange that they are very slow with those instructions, could it be that they are not fully optimized for those instructions on these machines?

> I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" ([`encode`](https://github.com/WojciechMula/base64simd/blob/master/encode/encode.rvv.cpp#L60C14-L60C20) + [`lookup_pshufb_improved`](https://github.com/WojciechMula/base64simd/blob/master/encode/lookup.rvv.cpp#L7)), as this only uses instructions with predictable performance, except for LMUL=1 `vrgather.vv`, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

My current tests on k230 shows that m2+m1+scalar bring the best performance on all size values, I'd like to see test data on other hardwares if someone can help test and get the data.
And, for current implementation it's easy to adjust lmul value in the algorithm. So I'm flexiable to either lmul value based the test data.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19973#issuecomment-2210902916