RFR: 8342662: C2: Add new phase for backend-specific lowering [v3]

Wed Dec 11 04:54:39 UTC 2024

On Mon, 11 Nov 2024 08:28:42 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Thanks everyone for the discussion. I've pushed a commit that restructures the pass, removing the backend-specific node definition and making the pass extend `PhaseIterGVN` so that nodes can do further idealizations during lowering without complicating the main lowering switch. I also added a shared component to lowering, to facilitate moving transforms that impact multiple backends like `DivMod` to it. Lowering is also now the final phase before final graph reshaping now, since late inlines could also use IGVN. Some more comments:
>> 
>>> It looks attractive at first, but the downside is subsequent passes may start to require platform-specific code as well (e.g., think of final graph reshaping which operates on Node opcodes).
>> 
>> This makes sense to me. I agree that the extra complexity required to deal with this change in other parts of the code isn't worth it. The new commit removes this part of the changeset.
>> 
>>> BTW it's not clear to me now what particular benefits IGVN brings. `DivMod` transformation doesn't use IGVN and after examining `MacroLogicV` code it can be rewritten to avoid IGVN as well.
>> 
>> The main benefits are being able to reuse node hashing to de-duplicate redundant nodes and being able to use the existing IGVN types that were calculated (which #21244 uses). Some examples where GVN could be useful in final graph reshaping is when reshaping shift nodes and `Op_CmpUL`, where new nodes are created to approximate existing nodes on platforms without support. While I think it is unlikely that any of the created nodes would common with existing nodes except the `ConNode`s, I think it would be nice to reduce the possibility of redundant nodes in the graph before matching. This would include `DivMod` in the cases where the backend doesn't support the `DivMod` node, as multiplication and subtraction is emitted instead. I'm working on refactoring these cases in my example patch. I think it would be nice to make lowering where these platform specific optimizations occur while final graph reshaping focuses on preparing the graph for matching.
>> 
>>> I'd say that if we want the lowering pass being discussed to be truly scalable, it's better to follow the same pattern. I have some doubts that platform-specific ad-hoc IR tweaks scale will scale well.
>> 
>> My main concern with the macro-expansion style is that with the proposed transforms unconditional expansion/lowering of nodes isn't always possible. For example, In final graph reshaping for `DivM...
>
> Hi @jaskarth ,
> 
> I was trying to lower LShiftVB/URShiftVB IR to LShiftVS/URShiftVS for the x86 backend. I intend to factor out through GVN upfront byte vector to short vector conversion for both input and shift vectors if these are shared across two operations since the x86 ISA does not support direct byte vector shifts.
> 
> To begin with, I made the following diff expecting status quo, but getting the following Fatal error at build time, can you kindly check?
> 
> 
> diff --git a/src/hotspot/cpu/x86/c2_lowering_x86.cpp b/src/hotspot/cpu/x86/c2_lowering_x86.cpp
> index cf4c014ffda..bc8df186396 100644
> --- a/src/hotspot/cpu/x86/c2_lowering_x86.cpp
> +++ b/src/hotspot/cpu/x86/c2_lowering_x86.cpp
> @@ -32,6 +32,6 @@ Node* PhaseLowering::lower_node_platform(Node* n) {
>  }
> 
>  bool PhaseLowering::should_lower() {
> -  return false;
> +  return true;
>  }
>  #endif // COMPILER2
> ```  
> 
> 
> ERROR: Build failed for target 'images' in configuration 'linux-x86_64-server-fastdebug' (exit code 2)
> 
> === Output from failing command(s) repeated here ===
> * For target support_interim-image-jlink__jlink_interim_image_exec:
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (/home/jatinbha/sandboxes/jdk-trunk/jdk/src/hotspot/share/opto/node.hpp:960), pid=1961256, tid=1961293
> #  assert(is_MachReturn()) failed: invalid node class: Con
> #
> # JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.root.jdk)
> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.root.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x140f939]  Matcher::Fixup_Save_On_Entry()+0x279
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/jatinbha/sandboxes/jdk-trunk/jdk/make/core.1961256)
> #
> # An error report file with more information is saved as:
> # /home/jatinbha/sandboxes/jdk-trunk/jdk/make/hs_err_pid1961256.log
>    ... (rest of output omitted)
> 
> * All command lines available in /home/jatinbha/sandboxes/jdk-trunk/jdk/build/linux-x86_64-server-fastdebug/make-support/failure-logs.

@jatin-bhateja Thanks for the bug report! I was able to reproduce it on my system as well. I apologize for the late response, I had been busy with my studies and only had a chance to look recently. It looks like this bug is caused because we run `Identity()` on all of the nodes, which finds new optimizations by replacing `LoadNode`s with a non-load identity. This causes an uncommon trap branch to be optimized out and the `HaltNode` replaced with a TOP `ConNode`. Usually these are removed by `RootNode::Ideal`, but since we don't do regular `Ideal` in lowering it ends up running into an assert later.

This is interesting because it suggests that there are potentially calls to `Identity()` that aren't run in earlier IGVN rounds. I didn't take a very in-depth look but it might be good to investigate separately. Because of that I'm not sure if we can rely on `Identity()` here, especially if there are identity transforms that rely on other `Ideal` calls. @merykitty do you have any thoughts on the best way to fix this?

@eme64 Ah, do you mean the recursive reduction op generation based on vector length? I think that could be handled here as well by splitting the reduction operation into multiple backend-specific nodes. It would still need to be implemented per-backend, but there could be benefits with optimizing more complex reductions, as you mention.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2533628139