RFR: 8256431: [PPC64] Implement Base64 encodeBlock() for Power64-LE [v5]

Mon Dec 14 21:43:11 UTC 2020

> Add a vector-based implementation of the Base64 encodeBlock intrinsic for Power9 and Power10, little-endian Linux only.
> 
> This implementation is based upon a paper (linked in comments) describing an Intel SSE vector-based implementation of Base64 encoding.  Although the Intel SSE instruction set and the Power VMX/VSX instruction sets are different, the method used in the paper is adaptable to Power.  In addition there are a few places in the algorithm where it's possible to gain some performance by using more optimal instruction sequences for VMX/VSX, and some additional benefit is gained from the ISA 3.1 additions available in Power10.
> 
> There is one controversial method I used in this implementation: I defined a macro to emit the instruction sequence for encoding 12 bytes in a vector to 16 bytes, because this sequence is needed in three places.  Turning it into a function would have been possible, but I would have needed to pass quite a few register numbers into the function.  I would have liked to have used a nested function, to give the function visibility to the register numbers declared in the outer scope, but alas nested functions are not possible in C++.
> 
> The overall performance advantage on Power9 is about 4.0X, based on the main/java/org/openjdk/micro/bench/java/util/Base64VarLenEncode.java benchmark.  This benchmark covers random buffer lengths from 8 to 20007 bytes.  Buffers that are short won't perform as well, approaching the performance of the pure Java code (or slightly worse for very short buffers),  Buffers that are consistently long will perform a little better than 4.0X.

Corey Ashford has updated the pull request incrementally with one additional commit since the last revision:

  stubGenerator_ppc.cpp: improve code that loads vector constants

    * Remove extra load of base64_48_63
    * Instead of loading each vector constant via an initialized pointer to
      its constant data, place all vector data in a single block constant and
      use offsets into the block to choose which constant to load.
      Unfortunately this method requires using fixed offsets in the table
      which are not easy to name.  Instead I just documented in the offsets
      in the constant block, and used them in the code.
    * I looked at loading pairs of vectors at a time using lxvp, but due to
      using little endian mode, the vectors get loaded in the order that is
      opposite to order listed in the constant block.  This makes it
      confusing to read and maintain the code, so I decided against that
      approach.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/1245/files
  - new: https://git.openjdk.java.net/jdk/pull/1245/files/99417c0e..5badbf6a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=1245&range=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=1245&range=03-04

  Stats: 45 lines in 1 file changed: 5 ins; 11 del; 29 mod
  Patch: https://git.openjdk.java.net/jdk/pull/1245.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/1245/head:pull/1245

PR: https://git.openjdk.java.net/jdk/pull/1245