RFR: 8270340: Base64 decodeBlock intrinsic for Power64 needs cleanup [v2]
Kazunori Ogata
ogatak at openjdk.java.net
Sun Jul 18 20:11:54 UTC 2021
On Sat, 17 Jul 2021 02:01:23 GMT, Corey Ashford <cashford at openjdk.org> wrote:
>> This series of commits was written to accomplish several cleanup and Power10 optimization tasks for the Base64 decodeBlock intrinsic for Power64:
>> * Remove the ISA 3.1+ (Power10+) pextd instruction optimization in decodeBlock. This "optimization" turned out to actually cause a performance hit. Removing it gains back about 3% in performance.
>> * Introduce a constant block, similar to that in use by encodeBlock() to speed up constant loading.
>> * Add the ISA 3.1+ xxpermx instruction and align_prefix() method for use in a Power10 optimization for decodeBlock. Please see the commit log for my concerns about this change.
>> * Implement the xxpermx-based decodeBlock algorithm for Power10+, which gives about a 5% performance boost.
>>
>> More details can be found in the commit logs.
>>
>> I want to note here that I looked into changing the loop_unrolls constant, and found that at large buffer sizes, the values of 2 and 4 give some extra performance gain. For example, on a 20001-byte destination buffer, I see an increase from 4.7X over intrinsic disabled (loop_unrolls=1), to 5.3X over intrinsic disabled (loop_unrolls=4), but on smaller buffer sizes, up to about 512, it causes performance degradation over loop_unrolls=1, so I have decided to stick with the original value of 1, since I don't know where to focus the performance versus buffer length tradeoff.
>
> Corey Ashford has updated the pull request incrementally with one additional commit since the last revision:
>
> stubGenerator_ppc.cpp: fixes for feedback from Kazunori Ogata and Martin Doerr
>
> decodeBlock changes:
>
> * Remove unroll loop and associated comments
> * Change comments referring to "P10+" to "P10 (or later)" to remove ambiguity
> * Make clear the lack of need for "aligh_prefix()" calls when using xxpermx due to the align(32) call
>
> The following change isn't based on feedback:
>
> encodeBlock changes:
> * Fix a comment that still referred to "unrolled loop". Unrolling was removed in an earlier commit.
The changes look fine to me besides the comment about align_prefix(), though I'm not a Reviewer.
src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 3966:
> 3964: // Note that due to align(32) call above, the xxpermx instructions do
> 3965: // not require align_prefix() calls, since the final xxpermx
> 3966: // prefix+opcode is at byte 24.
The question is whether we should leave align_prefix() in macroAssembler.[ch]pp. Since it appears convenient, someone may want to use it without noticing its intention that is only allowed (or acceptable) for generating intrinsics. If we leave it, I think it is better to add assertion statement to check if it is used for intrinsics generation (though I don't know how to implement it.)
-------------
PR: https://git.openjdk.java.net/jdk/pull/4762
More information about the hotspot-runtime-dev
mailing list