RFR: 8327652: S390x: Implements SLP support [v7]

Mon Aug 26 10:36:07 UTC 2024

On Mon, 1 Apr 2024 09:01:30 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

>>> > I think we shouldn't allow `MacroAssembler::string_compress(...)` and `MacroAssembler::string_expand(...)` to use vector registers without specifying this effect. That can be solved by adding a KILL effect for all vector registers which are killed. Alternatively, we could revert to the old implementation before [d5adf1d](https://github.com/openjdk/jdk/commit/d5adf1df921e5ecb8ff4c7e4349a12660069ed28) which doesn't use vector registers. The benefit was not huge if I remember correctly.
>>> 
>>> Agreed. My proposed circumvention is a too dirty hack.
>>> 
>>> I would prefer to add KILL effects to the match rules. I believe the vector implementation had a substantial performance effect. Unfortunately, I can't find any records of performance results from back then.
>>> 
>>> Reverting the commit @TheRealMDoerr mentioned is not possible. It contains many additions that may have been used by unrelated code. The vector code is well encapsulated and could be removed by deleting the
>>> 
>>> ```
>>>   if (VM_Version::has_VectorFacility()) {
>>>   } 
>>> ```
>>> 
>>> block. I would not like that, though.
>> 
>> I didn't mean to back out the whole commit. Only the implementation of string_compress and string_expand. The benefit of the vector version certainly depends on what kind of strings are used. (Effect may also be negative in some cases.) I think that classical benchmarks didn't show a significant performance impact, but I don't remember exactly, either. I'll leave the s390 maintainers free to decide if they want to adapt the vector version or go for the short and simple implementation.
>
>> @TheRealMDoerr and @RealLucy Just for my understanding why GPR and FPR doesn't get affected in intrinsic code as they are also allocated outside of register allocator? why only vector registers usage in intrinsic code get affected or am I missing anything here?
> 
> GPRs and FPRs already have an `effect` specified in the match rules. (If a GPR or FPR is used by a `MachNode` without proper specification, it is a critical bug.) Before this PR, it is legal to use VRs without `effect` because they are not used by register allocation. This is exploited in `MacroAssembler::string_compress(...)` and `MacroAssembler::string_expand(...)`.
> 
> With your change, these 2 intrinsics may overwrite live values! Unfortunately, they use many VRs. So, specifying a `KILL` effect for all of them may cause the register allocation to insert much spill code. May impact performance and code size.

Hi @TheRealMDoerr, 

I added changes on top of SLP PR: https://github.com/openjdk/jdk/commit/3fd8074a52f3cc8b9c3a8e588dd352f4e6c25a76 , and faced build break with this error: 

* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/amit/jdk/src/hotspot/share/opto/chaitin.cpp:997), pid=3968896, tid=3969099
#  assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecX)) failed: vector should be aligned
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.amit.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.amit.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-s390x)
# Problematic frame:
# V  [libjvm.so+0x5a2d8a]  PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/amit/jdk/make/core.3968896)
#
# An error report file with more information is saved as:
# /home/amit/jdk/make/hs_err_pid3968896.log

BT:
Stack: [0x000003ff2f900000,0x000003ff2fd00000],  sp=0x000003ff2fcfb7d8,  free space=4077k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5a2d8a]  PhaseChaitin::gather_lrg_masks(bool) [clone .constprop.1]+0x175a  (chaitin.cpp:997)
V  [libjvm.so+0x5aa122]  PhaseChaitin::Register_Allocate()+0x1ba  (chaitin.cpp:405)
V  [libjvm.so+0x6f181e]  Compile::Code_Gen()+0x2fe  (compile.cpp:2966)
V  [libjvm.so+0x6f4154]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1794  (compile.cpp:885)
V  [libjvm.so+0x52d32a]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1b2  (c2compiler.cpp:142)
V  [libjvm.so+0x7011ae]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xce6  (compileBroker.cpp:2303)
V  [libjvm.so+0x701cb2]  CompileBroker::compiler_thread_loop()+0x512  (compileBroker.cpp:1961)
V  [libjvm.so+0xb649c6]  JavaThread::thread_main_inner()+0xfe  (javaThread.cpp:758)
V  [libjvm.so+0x137066c]  Thread::call_run()+0xc4  (thread.cpp:225)
V  [libjvm.so+0x109470a]  thread_native_entry(Thread*)+0x132  (os_linux.cpp:858)

Do you think using `Op_VecX` is causing issue here and we should revert https://github.com/openjdk/jdk/pull/18162/commits/3caa470c0f89be306e5b43c5da4ca9e625abfe6b

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18162#issuecomment-2309888699