[code-reflection] RFR: [proposal][hat] Enabling local/shared memory and local barriers within compute methods

Thu Sep 4 12:41:01 UTC 2025

On Fri, 22 Aug 2025 11:18:40 GMT, Juan Fumero <jfumero at openjdk.org> wrote:

> This patch enables a new extension for HAT to allow the use of local memory (as in OpenCL) / shared memory (CUDA). This extension is preliminary, and the implementation can be improved to use a new HAT dialect with new Ops from code reflection (Babylon).
> 
> This extension brings two new calls to the API:
> a) Declaration and allocation of data structures in local/shared memory
> b) Access to local barriers
> 
> Example:
> 
> 
> 
> private interface MyArray extends Buffer {
>     void array(long index, int value);
>     int array(long index);
> 
>     Schema<MyArray> schema = Schema.of(MyArray.class,
>             myArray -> myArray
>                     .array("array", 16));
> 
>     static MyArray create(Accelerator accelerator) {
>         return schema.allocate(accelerator, 1);
>     }
> 
>     static MyArray createLocal() {
>             return create(new Accelerator(MethodHandles.lookup(), Backend.FIRST));
>      }
> }
> 
> @CodeReflection
> private static void reduceLocal(@RO KernelContext context, @RW S32Array input, @RW S32Array partialSums) {
>     int localId = context.lix;
>     int localSize = context.lsx;
>     int blockId = context.bix;
> 
>     // Prototype: allocate in shared memory an array of 16 ints
>     MyArray sharedArray = MyArray.create();
> 
>     // Copy from global to shared memory
>     sharedArray.array(localId, input.array(context.gix));
> 
>     // Reduction using local memory
>     for (int offset = localSize / 2; offset > 0; offset /= 2) {
>         context.barrier();
>         if (localId < offset) {
>             sharedArray.array(localId,  sharedArray.array(localId) +  sharedArray.array(localId + offset));
>         }
>     }
>     if (localId == 0) {
>         // copy from shared memory to global memory
>         partialSums.array(blockId,  sharedArray.array(0));
>     }
> }
> ``` 
> 
> This brings low-level GPU programming to HAT, but for now this is just experimentation. 
> 
> ### Running a few examples:
> 
> a) Reductions:
> 
> 
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl Reduction
> 
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-cuda Reduction
> 
> 
> b) Fast Matrix Mult using loop tiling and local memory:
> 
> 
> java @hat/run ffi-opencl matmul 2DTILING 
> 
> java @hat/run ffi-cuda matmul 2DTILING 
> 
> 
> c) Another example of using shared memory data structures:
> 
> 
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl LocalArray

@jjfumero 
Your change (at version 01c7ccfdef3e654f66fa9b2acb4110cdf0ca3331) is now ready to be sponsored by a Committer.

-------------

PR Comment: https://git.openjdk.org/babylon/pull/531#issuecomment-3253503272