[code-reflection] RFR: [proposal][hat] Enabling local/shared memory and local barriers within compute methods

Thu Sep 4 12:36:36 UTC 2025

This patch enables a new extension for HAT to allow the use of local memory (as in OpenCL) / shared memory (CUDA). This extension is preliminary, and the implementation can be improved to use a new HAT dialect with new Ops from code reflection (Babylon).

This extension brings two new calls to the API:
a) Declaration and allocation of data structures in local/shared memory
b) Access to local barriers

Example:

private interface MyArray extends Buffer {
    void array(long index, int value);
    int array(long index);

    Schema<MyArray> schema = Schema.of(MyArray.class,
            myArray -> myArray
                    .array("array", 16));

    static MyArray create(Accelerator accelerator) {
        return schema.allocate(accelerator, 1);
    }

    static MyArray createLocal() {
            return create(new Accelerator(MethodHandles.lookup(), Backend.FIRST));
     }
}

@CodeReflection
private static void reduceLocal(@RO KernelContext context, @RW S32Array input, @RW S32Array partialSums) {
    int localId = context.lix;
    int localSize = context.lsx;
    int blockId = context.bix;

    // Prototype: allocate in shared memory an array of 16 ints
    MyArray sharedArray = MyArray.create();

    // Copy from global to shared memory
    sharedArray.array(localId, input.array(context.gix));

    // Reduction using local memory
    for (int offset = localSize / 2; offset > 0; offset /= 2) {
        context.barrier();
        if (localId < offset) {
            sharedArray.array(localId,  sharedArray.array(localId) +  sharedArray.array(localId + offset));
        }
    }
    if (localId == 0) {
        // copy from shared memory to global memory
        partialSums.array(blockId,  sharedArray.array(0));
    }
}
``` 

This brings low-level GPU programming to HAT, but for now this is just experimentation. 

### Running a few examples:

a) Reductions:

HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl Reduction

HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-cuda Reduction

b) Fast Matrix Mult using loop tiling and local memory:

java @hat/run ffi-opencl matmul 2DTILING 

java @hat/run ffi-cuda matmul 2DTILING 

c) Another example of using shared memory data structures:

HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl LocalArray

-------------

Commit messages:
 - [hat] Remove old methods to allocate DS in local memory
 - [hat] Allocation of UDS in private memory disabled
 - [hat] Proposal for shared memory and barriers. Private mem support removed
 - [hat] Move again the isIfaceAccessor method to the OpTk class
 - Fix check for iFaceAccessor without the need for the OpTk old methods
 - minor change
 - Merge branch 'code-reflection' into hat/api/shared
 - [hat] Fix example in private memory
 - Unified interface for handling private and shared mem types
 - minor change
 - ... and 60 more: https://git.openjdk.org/babylon/compare/dbadf0d0...01c7ccfd

Changes: https://git.openjdk.org/babylon/pull/531/files
  Webrev: https://webrevs.openjdk.org/?repo=babylon&pr=531&range=00
  Stats: 922 lines in 22 files changed: 859 ins; 15 del; 48 mod
  Patch: https://git.openjdk.org/babylon/pull/531.diff
  Fetch: git fetch https://git.openjdk.org/babylon.git pull/531/head:pull/531

PR: https://git.openjdk.org/babylon/pull/531