[vector] Assembly stubs and alternative approaches

Tue Aug 21 15:30:28 UTC 2018

Hi,

Recently a number of highly optimized implementations for trigonometric 
functions went into the repository:
   http://hg.openjdk.java.net/panama/dev/rev/d8b30ae359ec
   http://hg.openjdk.java.net/panama/dev/rev/4408db20792c
   http://hg.openjdk.java.net/panama/dev/rev/3111a0877994

Short summary how it works:
   (a) the stubs are coded in assembly and bundled as .s files
   (b) all stubs are built into the JVM
   (c) at startup JVM chooses the most appropriate versions
   (d) C2 links Vector API calls to the stubs

Though performance implications on x86 are impressive, the overall 
approach poses some challenges, especially from maintenance perspective.

First of all, it is a lot of code:

   $ ls -1 src/hotspot/os_cpu/linux_x86/svml* | wc -l
       36
   $ du -ch src/hotspot/os_cpu/linux_x86/svml*
       6.8M total

I can hardly believe there's any other way to change/evolve them other 
than simply regenerating from original SVML intrinsics [1].

Moreover, due to difference in assembly syntax between different tool 
chains, there's no way to share them even between x86 platforms:

   $ ls -1 src/hotspot/os_cpu/windows_x86/svml* | wc -l
       36
   $ du -ch src/hotspot/os_cpu/windows_x86/svml*
       4.1M total

Another aspect which can be problematic in practice is difference in 
results between interpreter/C1 and C2. Right now, SVML variants are used 
only in C2 and JVM fallbacks to scalar implementations in other modes.

SVML has the following statement [1]:

   "Using SVML intrinsics is faster than repeatedly calling the scalar 
math functions. However, the intrinsics differ from the scalar functions 
in accuracy."

It means users may observe different results across the same run which 
usually perceived by users as a bug (based on previous experience with 
optimizing math functions).

So, if possible, it would be good to use exactly the same implementation 
across all execution modes, but it requires much more intrusive changes 
across the JVM to bind the stubs to all relevant locations.

Quick summary of the problems:
   - hard to maintain
     * lots of new code in JVM
     * severe code duplication between platforms

   - difference in behavior between C2 and interpreter/C1

I see 3 alternative approaches:
   (1) Port to macro-assembler
   (2) Rewrite in Java using Vector API itself
   (3) Put SVML stubs into a native library

More details follow to support my opinion, but I'm in favor of moving to 
#3, followed by #2 at some point (in a longer term).

#3 requires additional JVM support to make native functions competitive 
with JVM stubs on vector calls.

#2 should put much less pressure on non-x86 platforms (once they follow) 
to provide optimized math functions from the very beginning: used as 
default implementations giving good enough performance without requiring 
  to code them in platform-specific and low-level way (either in 
assembler or not).

======

Comparing those 3 alternative, #1 (port to macro-assembler) looks like a 
waste of resources. It gives some short-term benefits, but doesn't 
address maintenance problems (and even worsen them if repeated updates 
are considered).

Pros:
   + single source on x86

Cons:
   - Huge amount of manual work
     * ... which have to be performed on every update;
     * or semi-automatic if a converter tool is created;
     * stubs has ABI-specific parts, need to unify them

   - Extensive additions into MacroAssembler on x86 are needed
     * MacroAssembler instruction coverage on x86 is far from complete
     * all instructions occured in the stubs should be supported

   - Stubs are still part of the JVM
     * the stubs need to be wired to int/C1/C2

======

In a longer term, having vectorized math functions written in Vector API 
  itself (#2) looks very attractive.

Building on top of simpler vector primitives provided by the API should 
make the implementation much smaller, simpler, and more manageable.

It would be a very good real world test for the API & implementation 
themselves and feedback loop based on comparisons with SVML should give 
lots of data to improve on.

The downsides are (1) it's a separate and significant engineering effort 
on its own; and (2) it'll be very hard (if possible) to compete with 
highly optimized low-level alternatives (like SVML).

Pros:
   + Smaller, simpler, and much more manageable

   + Leverage the investments into the implementation

   + Dogfooding
     * very attractive, considering there's a performant alternative to 
compare against

   + Cross-platform (potentially)

   + Consistent results across different execution modes
     * ... if the vector primitives the implementation relies on
       behave consistently

Cons:
   - Have to be rewritten from scratch
     * Hard to reuse anything from SVML, except the algorigthms

   - More performant than scalar versions, but it's unlikely it'll be 
able match highly optimized SVML stubs

I briefly looked into some SVML stubs to compare SSE, AVX/AVX2, and 
AVX512 variants. For example, in case of acos [2] for FloatVector, there 
are 6 implementations provided:
   (1) Float64Vector/Float128Vector::acos
     (a) __svml_acosf4_ha_ex (SSE)
     (b) __svml_acosf4_ha_e9 (AVX)
     (c) __svml_acosf4_ha_l9 (AVX2/AVX512)
   (2) Float256Vector::acos
     (a) __svml_acosf8_ha_e9 (AVX)
     (b) __svml_acosf8_ha_l9 (AVX2/AVX512)
   (3) Float512Vector::acos
     (a) __svml_acosf16_ha_z0 (AVX512)

Some observations:
   * (1a) and (1b) are identical;
   * (1c/2a) differ from (1ab/2b) only in usage of FMA instructions
   * (3a) uses different algorithm:
      while all other variants (1abc2ab) fallback to scalar 
implementation (__svml_sacos_ha_cout_rare_internal) to cover corner 
cases, (3a) doesn't need that

Unifying (1abc) and (2ab) in a single implementation looks doable, but
for Float512Vector it would require a dedicated Java implementation to 
match SVML.

======

Reusing SVML stubs in some way is very attractive from both performance 
and maintenance perspective.

 From maintenance perspective, having them bundled into a separate 
native library (#3) and called from Vector API implementation looks much 
more attractive than keeping them part of the JVM. Also, sharing the 
same implementation between C2 and interpreter/C1 comes for free.

To minimize JNI invocation overhead, some assistance from JVM is needed. 
There's already a concept of Critical JNI which allows to define 
customized entry points with lower invocation overhead. It can be 
extended to vector functions in a similar way.

The Vector API implementation can interact with the library through a 
clean & stable interface:
   * 2 entries per function: boxed & scalarized
     * scalarized is vectorized SVML stub itself
     * boxed variant delegates to scalarized variant, but does unboxing
       of arguments before and boxes the result after the call
   * C2 bypasses unboxing and calls directly into vector stubs through
     scalarized entry points;
   * C1/interpreter goes through boxed entry point

Thread state transition (Java -> native -> Java) will still have to 
happen during invocation, but:

   (1) vector box elimination shouldn't be affected;

   (2) IMO, considering the size of the SVML stubs, invocation cost 
should be a dominating factor;

   (3) more performant FFI alternatives explored in Panama can be 
utilized once they become available:
     * first, trusted calls which avoids state transitions altogether 
(NativeMethodHandles and MH::linkToNative(...) from vectorSnippets 
branch) can be used;
     * once JIT is able to optimize excessive Java<->native state 
transitions (akin to lock coarsening but to native calls), there should 
be much less pressure to sacrifice safety for performance

Pros:
   + Much smaller changes in JVM
   + Keeps SVML performance
   + The library can evolve at its own pace and easily 
replaceable/upgradeable

Cons:
   - Additional JVM support needed for vector calls

Best regards,
Vladimir Ivanov

[1] https://software.intel.com/en-us/node/524289
[2] 
http://hg.openjdk.java.net/panama/dev/file/4aa617dd5a8b/src/hotspot/os_cpu/linux_x86/svml_s_acos_linux_x86.s