[vector] Assembly stubs and alternative approaches
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Aug 21 15:30:28 UTC 2018
Hi,
Recently a number of highly optimized implementations for trigonometric
functions went into the repository:
http://hg.openjdk.java.net/panama/dev/rev/d8b30ae359ec
http://hg.openjdk.java.net/panama/dev/rev/4408db20792c
http://hg.openjdk.java.net/panama/dev/rev/3111a0877994
Short summary how it works:
(a) the stubs are coded in assembly and bundled as .s files
(b) all stubs are built into the JVM
(c) at startup JVM chooses the most appropriate versions
(d) C2 links Vector API calls to the stubs
Though performance implications on x86 are impressive, the overall
approach poses some challenges, especially from maintenance perspective.
First of all, it is a lot of code:
$ ls -1 src/hotspot/os_cpu/linux_x86/svml* | wc -l
36
$ du -ch src/hotspot/os_cpu/linux_x86/svml*
6.8M total
I can hardly believe there's any other way to change/evolve them other
than simply regenerating from original SVML intrinsics [1].
Moreover, due to difference in assembly syntax between different tool
chains, there's no way to share them even between x86 platforms:
$ ls -1 src/hotspot/os_cpu/windows_x86/svml* | wc -l
36
$ du -ch src/hotspot/os_cpu/windows_x86/svml*
4.1M total
Another aspect which can be problematic in practice is difference in
results between interpreter/C1 and C2. Right now, SVML variants are used
only in C2 and JVM fallbacks to scalar implementations in other modes.
SVML has the following statement [1]:
"Using SVML intrinsics is faster than repeatedly calling the scalar
math functions. However, the intrinsics differ from the scalar functions
in accuracy."
It means users may observe different results across the same run which
usually perceived by users as a bug (based on previous experience with
optimizing math functions).
So, if possible, it would be good to use exactly the same implementation
across all execution modes, but it requires much more intrusive changes
across the JVM to bind the stubs to all relevant locations.
Quick summary of the problems:
- hard to maintain
* lots of new code in JVM
* severe code duplication between platforms
- difference in behavior between C2 and interpreter/C1
I see 3 alternative approaches:
(1) Port to macro-assembler
(2) Rewrite in Java using Vector API itself
(3) Put SVML stubs into a native library
More details follow to support my opinion, but I'm in favor of moving to
#3, followed by #2 at some point (in a longer term).
#3 requires additional JVM support to make native functions competitive
with JVM stubs on vector calls.
#2 should put much less pressure on non-x86 platforms (once they follow)
to provide optimized math functions from the very beginning: used as
default implementations giving good enough performance without requiring
to code them in platform-specific and low-level way (either in
assembler or not).
======
Comparing those 3 alternative, #1 (port to macro-assembler) looks like a
waste of resources. It gives some short-term benefits, but doesn't
address maintenance problems (and even worsen them if repeated updates
are considered).
Pros:
+ single source on x86
Cons:
- Huge amount of manual work
* ... which have to be performed on every update;
* or semi-automatic if a converter tool is created;
* stubs has ABI-specific parts, need to unify them
- Extensive additions into MacroAssembler on x86 are needed
* MacroAssembler instruction coverage on x86 is far from complete
* all instructions occured in the stubs should be supported
- Stubs are still part of the JVM
* the stubs need to be wired to int/C1/C2
======
In a longer term, having vectorized math functions written in Vector API
itself (#2) looks very attractive.
Building on top of simpler vector primitives provided by the API should
make the implementation much smaller, simpler, and more manageable.
It would be a very good real world test for the API & implementation
themselves and feedback loop based on comparisons with SVML should give
lots of data to improve on.
The downsides are (1) it's a separate and significant engineering effort
on its own; and (2) it'll be very hard (if possible) to compete with
highly optimized low-level alternatives (like SVML).
Pros:
+ Smaller, simpler, and much more manageable
+ Leverage the investments into the implementation
+ Dogfooding
* very attractive, considering there's a performant alternative to
compare against
+ Cross-platform (potentially)
+ Consistent results across different execution modes
* ... if the vector primitives the implementation relies on
behave consistently
Cons:
- Have to be rewritten from scratch
* Hard to reuse anything from SVML, except the algorigthms
- More performant than scalar versions, but it's unlikely it'll be
able match highly optimized SVML stubs
I briefly looked into some SVML stubs to compare SSE, AVX/AVX2, and
AVX512 variants. For example, in case of acos [2] for FloatVector, there
are 6 implementations provided:
(1) Float64Vector/Float128Vector::acos
(a) __svml_acosf4_ha_ex (SSE)
(b) __svml_acosf4_ha_e9 (AVX)
(c) __svml_acosf4_ha_l9 (AVX2/AVX512)
(2) Float256Vector::acos
(a) __svml_acosf8_ha_e9 (AVX)
(b) __svml_acosf8_ha_l9 (AVX2/AVX512)
(3) Float512Vector::acos
(a) __svml_acosf16_ha_z0 (AVX512)
Some observations:
* (1a) and (1b) are identical;
* (1c/2a) differ from (1ab/2b) only in usage of FMA instructions
* (3a) uses different algorithm:
while all other variants (1abc2ab) fallback to scalar
implementation (__svml_sacos_ha_cout_rare_internal) to cover corner
cases, (3a) doesn't need that
Unifying (1abc) and (2ab) in a single implementation looks doable, but
for Float512Vector it would require a dedicated Java implementation to
match SVML.
======
Reusing SVML stubs in some way is very attractive from both performance
and maintenance perspective.
From maintenance perspective, having them bundled into a separate
native library (#3) and called from Vector API implementation looks much
more attractive than keeping them part of the JVM. Also, sharing the
same implementation between C2 and interpreter/C1 comes for free.
To minimize JNI invocation overhead, some assistance from JVM is needed.
There's already a concept of Critical JNI which allows to define
customized entry points with lower invocation overhead. It can be
extended to vector functions in a similar way.
The Vector API implementation can interact with the library through a
clean & stable interface:
* 2 entries per function: boxed & scalarized
* scalarized is vectorized SVML stub itself
* boxed variant delegates to scalarized variant, but does unboxing
of arguments before and boxes the result after the call
* C2 bypasses unboxing and calls directly into vector stubs through
scalarized entry points;
* C1/interpreter goes through boxed entry point
Thread state transition (Java -> native -> Java) will still have to
happen during invocation, but:
(1) vector box elimination shouldn't be affected;
(2) IMO, considering the size of the SVML stubs, invocation cost
should be a dominating factor;
(3) more performant FFI alternatives explored in Panama can be
utilized once they become available:
* first, trusted calls which avoids state transitions altogether
(NativeMethodHandles and MH::linkToNative(...) from vectorSnippets
branch) can be used;
* once JIT is able to optimize excessive Java<->native state
transitions (akin to lock coarsening but to native calls), there should
be much less pressure to sacrifice safety for performance
Pros:
+ Much smaller changes in JVM
+ Keeps SVML performance
+ The library can evolve at its own pace and easily
replaceable/upgradeable
Cons:
- Additional JVM support needed for vector calls
Best regards,
Vladimir Ivanov
[1] https://software.intel.com/en-us/node/524289
[2]
http://hg.openjdk.java.net/panama/dev/file/4aa617dd5a8b/src/hotspot/os_cpu/linux_x86/svml_s_acos_linux_x86.s
More information about the panama-dev
mailing list