RFR: 8270340: Base64 decodeBlock intrinsic for Power64 needs cleanup [v2]

Sun Jul 18 20:11:54 UTC 2021

On Sat, 17 Jul 2021 02:01:23 GMT, Corey Ashford <cashford at openjdk.org> wrote:

>> This series of commits was written to accomplish several cleanup and Power10 optimization tasks for the Base64 decodeBlock intrinsic for Power64:
>> * Remove the ISA 3.1+ (Power10+) pextd instruction optimization in decodeBlock.  This "optimization" turned out to actually cause a performance hit.  Removing it gains back about 3% in performance.
>> * Introduce a constant block, similar to that in use by encodeBlock() to speed up constant loading.
>> * Add the ISA 3.1+  xxpermx instruction and align_prefix() method for use in a Power10 optimization for decodeBlock.  Please see the commit log for my concerns about this change.
>> * Implement the xxpermx-based decodeBlock algorithm for Power10+, which gives about a 5% performance boost.
>> 
>> More details can be found in the commit logs.
>> 
>> I want to note here that I looked into changing the loop_unrolls constant, and found that at large buffer sizes, the values of 2 and 4 give some extra performance gain. For example, on a 20001-byte destination buffer, I see an increase from 4.7X over intrinsic disabled (loop_unrolls=1), to 5.3X over intrinsic disabled (loop_unrolls=4), but on smaller buffer sizes, up to about 512, it causes performance degradation over loop_unrolls=1, so I have decided to stick with the original value of 1, since I don't know where to focus the performance versus buffer length tradeoff.
>
> Corey Ashford has updated the pull request incrementally with one additional commit since the last revision:
> 
>   stubGenerator_ppc.cpp: fixes for feedback from Kazunori Ogata and Martin Doerr
>   
>   decodeBlock changes:
>   
>   * Remove unroll loop and associated comments
>   * Change comments referring to "P10+" to "P10 (or later)" to remove ambiguity
>   * Make clear the lack of need for "aligh_prefix()" calls when using xxpermx due to the align(32) call
>   
>   The following change isn't based on feedback:
>   
>   encodeBlock changes:
>   * Fix a comment that still referred to "unrolled loop".  Unrolling was removed in an earlier commit.

The changes look fine to me besides the comment about align_prefix(), though I'm not a Reviewer.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 3966:

> 3964:       // Note that due to align(32) call above, the xxpermx instructions do
> 3965:       // not require align_prefix() calls, since the final xxpermx
> 3966:       // prefix+opcode is at byte 24.

The question is whether we should leave align_prefix() in macroAssembler.[ch]pp.  Since it appears convenient, someone may want to use it without noticing its intention that is only allowed (or acceptable) for generating intrinsics.  If we leave it, I think it is better to add assertion statement to check if it is used for intrinsics generation (though I don't know how to implement it.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/4762