[code-reflection] RFR: [proposal][hat] Enabling local/shared memory and local barriers within compute methods
duke
duke at openjdk.org
Thu Sep 4 12:41:01 UTC 2025
On Fri, 22 Aug 2025 11:18:40 GMT, Juan Fumero <jfumero at openjdk.org> wrote:
> This patch enables a new extension for HAT to allow the use of local memory (as in OpenCL) / shared memory (CUDA). This extension is preliminary, and the implementation can be improved to use a new HAT dialect with new Ops from code reflection (Babylon).
>
> This extension brings two new calls to the API:
> a) Declaration and allocation of data structures in local/shared memory
> b) Access to local barriers
>
> Example:
>
>
>
> private interface MyArray extends Buffer {
> void array(long index, int value);
> int array(long index);
>
> Schema<MyArray> schema = Schema.of(MyArray.class,
> myArray -> myArray
> .array("array", 16));
>
> static MyArray create(Accelerator accelerator) {
> return schema.allocate(accelerator, 1);
> }
>
> static MyArray createLocal() {
> return create(new Accelerator(MethodHandles.lookup(), Backend.FIRST));
> }
> }
>
> @CodeReflection
> private static void reduceLocal(@RO KernelContext context, @RW S32Array input, @RW S32Array partialSums) {
> int localId = context.lix;
> int localSize = context.lsx;
> int blockId = context.bix;
>
> // Prototype: allocate in shared memory an array of 16 ints
> MyArray sharedArray = MyArray.create();
>
> // Copy from global to shared memory
> sharedArray.array(localId, input.array(context.gix));
>
> // Reduction using local memory
> for (int offset = localSize / 2; offset > 0; offset /= 2) {
> context.barrier();
> if (localId < offset) {
> sharedArray.array(localId, sharedArray.array(localId) + sharedArray.array(localId + offset));
> }
> }
> if (localId == 0) {
> // copy from shared memory to global memory
> partialSums.array(blockId, sharedArray.array(0));
> }
> }
> ```
>
> This brings low-level GPU programming to HAT, but for now this is just experimentation.
>
> ### Running a few examples:
>
> a) Reductions:
>
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl Reduction
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-cuda Reduction
>
>
> b) Fast Matrix Mult using loop tiling and local memory:
>
>
> java @hat/run ffi-opencl matmul 2DTILING
>
> java @hat/run ffi-cuda matmul 2DTILING
>
>
> c) Another example of using shared memory data structures:
>
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl LocalArray
@jjfumero
Your change (at version 01c7ccfdef3e654f66fa9b2acb4110cdf0ca3331) is now ready to be sponsored by a Committer.
-------------
PR Comment: https://git.openjdk.org/babylon/pull/531#issuecomment-3253503272
More information about the babylon-dev
mailing list