[code-reflection] RFR: HAT - New examples for optimizing matmul

Tue Sep 30 14:06:42 UTC 2025

This PR includes new examples and tests to that how HAT could optimize matmuls. This PR shows examples using 2D Cache + Loop tiling and 2D Register Tiling. 

Two implementations are provided. One more specific to how CUDA handles threads, and another one that can be ported to both CUDA and OpenCL. Both implementations can be further tuned, depending on the GPU card. 

The goal is to show how matmul, or any other HAT kernel, can be tuned with the current building blocks of HAT. These examples makes use of local/shared data structaures, private data structures, and local/thread-block IDs to access data. 

How to test? 

HAT=SHOW_CODE java @hat/run ffi-opencl matmul 2DRTPORTABLE

HAT=SHOW_CODE java @hat/run ffi-cuda matmul 2DRTPORTABLE

-------------

Commit messages:
 - [hat] Fix CUDA scheduler
 - Merge branch 'code-reflection' into hat/mxm/opts
 - [hat] Example of matmul with 2D register tiling

Changes: https://git.openjdk.org/babylon/pull/587/files
  Webrev: https://webrevs.openjdk.org/?repo=babylon&pr=587&range=00
  Stats: 732 lines in 9 files changed: 704 ins; 5 del; 23 mod
  Patch: https://git.openjdk.org/babylon/pull/587.diff
  Fetch: git fetch https://git.openjdk.org/babylon.git pull/587/head:pull/587

PR: https://git.openjdk.org/babylon/pull/587