[code-reflection] Integrated: [proposal][hat] Enabling local/shared memory and local barriers within compute methods
Juan Fumero
jfumero at openjdk.org
Thu Sep 4 13:05:05 UTC 2025
On Fri, 22 Aug 2025 11:18:40 GMT, Juan Fumero <jfumero at openjdk.org> wrote:
> This patch enables a new extension for HAT to allow the use of local memory (as in OpenCL) / shared memory (CUDA). This extension is preliminary, and the implementation can be improved to use a new HAT dialect with new Ops from code reflection (Babylon).
>
> This extension brings two new calls to the API:
> a) Declaration and allocation of data structures in local/shared memory
> b) Access to local barriers
>
> Example:
>
>
>
> private interface MyArray extends Buffer {
> void array(long index, int value);
> int array(long index);
>
> Schema<MyArray> schema = Schema.of(MyArray.class,
> myArray -> myArray
> .array("array", 16));
>
> static MyArray create(Accelerator accelerator) {
> return schema.allocate(accelerator, 1);
> }
>
> static MyArray createLocal() {
> return create(new Accelerator(MethodHandles.lookup(), Backend.FIRST));
> }
> }
>
> @CodeReflection
> private static void reduceLocal(@RO KernelContext context, @RW S32Array input, @RW S32Array partialSums) {
> int localId = context.lix;
> int localSize = context.lsx;
> int blockId = context.bix;
>
> // Prototype: allocate in shared memory an array of 16 ints
> MyArray sharedArray = MyArray.create();
>
> // Copy from global to shared memory
> sharedArray.array(localId, input.array(context.gix));
>
> // Reduction using local memory
> for (int offset = localSize / 2; offset > 0; offset /= 2) {
> context.barrier();
> if (localId < offset) {
> sharedArray.array(localId, sharedArray.array(localId) + sharedArray.array(localId + offset));
> }
> }
> if (localId == 0) {
> // copy from shared memory to global memory
> partialSums.array(blockId, sharedArray.array(0));
> }
> }
> ```
>
> This brings low-level GPU programming to HAT, but for now this is just experimentation.
>
> ### Running a few examples:
>
> a) Reductions:
>
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl Reduction
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-cuda Reduction
>
>
> b) Fast Matrix Mult using loop tiling and local memory:
>
>
> java @hat/run ffi-opencl matmul 2DTILING
>
> java @hat/run ffi-cuda matmul 2DTILING
>
>
> c) Another example of using shared memory data structures:
>
>
> HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl LocalArray
This pull request has now been integrated.
Changeset: a16b8bf6
Author: Juan Fumero <jfumero at openjdk.org>
Committer: Gary Frost <gfrost at openjdk.org>
URL: https://git.openjdk.org/babylon/commit/a16b8bf6b637cc6bc2c44dc0d733e19d5174512a
Stats: 922 lines in 22 files changed: 859 ins; 15 del; 48 mod
[proposal][hat] Enabling local/shared memory and local barriers within compute methods
-------------
PR: https://git.openjdk.org/babylon/pull/531
More information about the babylon-dev
mailing list