RFR: JDK-8270340: Base64 decodeBlock intrinsic for Power64 needs cleanup

Kazunori Ogata ogatak at openjdk.java.net
Thu Jul 15 17:22:15 UTC 2021


On Tue, 13 Jul 2021 04:28:47 GMT, Corey Ashford <cashford at openjdk.org> wrote:

> This series of commits was written to accomplish several cleanup and Power10 optimization tasks for the Base64 decodeBlock intrinsic for Power64:
> * Remove the ISA 3.1+ (Power10+) pextd instruction optimization in decodeBlock.  This "optimization" turned out to actually cause a performance hit.  Removing it gains back about 3% in performance.
> * Introduce a constant block, similar to that in use by encodeBlock() to speed up constant loading.
> * Add the ISA 3.1+  xxpermx instruction and align_prefix() method for use in a Power10 optimization for decodeBlock.  Please see the commit log for my concerns about this change.
> * Implement the xxpermx-based decodeBlock algorithm for Power10+, which gives about a 5% performance boost.
> 
> More details can be found in the commit logs.
> 
> I want to note here that I looked into changing the loop_unrolls constant, and found that at large buffer sizes, the values of 2 and 4 give some extra performance gain. For example, on a 20001-byte destination buffer, I see an increase from 4.7X over intrinsic disabled (loop_unrolls=1), to 5.3X over intrinsic disabled (loop_unrolls=4), but on smaller buffer sizes, up to about 512, it causes performance degradation over loop_unrolls=1, so I have decided to stick with the original value of 1, since I don't know where to focus the performance versus buffer length tradeoff.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 3869:

> 3867:     VectorSRegister M                       = VSR8;
> 3868: 
> 3869:     // P10+ VSR lookup variables

"P10+" should better be "P10 or above" (or something grammatically correct) because it appears like an enhanced version of P10, like P7+.  Although IBM didn't released "+" models after P8, they have manufactured with reduced semiconductor process and apply enhancements, including larger caches, higher clock frequency, etc.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 3995:

> 3993:         __ align_prefix(); __ xxpermx(xlate_b, table_64_79, table_80_95, input->to_vsr(), 2);
> 3994:         __ xxlor(xlate_b, xlate_a, xlate_b);
> 3995:         __ align_prefix(); __ xxpermx(xlate_a, table_96_111, table_112_127, input->to_vsr(), 3);

This is just a reminder comment.  align_prefix() does nothing when loop_unrolls is 1 because the label unrolled_loop_start is aligned to a 32-byte boundary and these three prefixed instructions are within 32bytes after the label, and thus, they never cross a 64-byte boundary.  Since the comment above mentions that the best unrolling factor here is 1, align_prefix() are just safe guards for the case someone try to unroll this loop.

src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 4038:

> 4036:       __ bne_predict_not_taken(CCR6, unrolled_loop_exit);
> 4037: 
> 4038:       // The Base64 characters had no errors, so add the offsets

It may be helpful to add comments for P10 case, where each byte of input contains (decoded 6-bit binary | 0x80) and offsets contains 0x80, and vaddubm  is used to clear the MSB.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4762


More information about the hotspot-runtime-dev mailing list