[code-reflection] RFR: [proposal][hat] Enabling local/shared memory and local barriers within compute methods
Juan Fumero
jfumero at openjdk.org
Thu Sep 4 12:36:36 UTC 2025
This patch enables a new extension for HAT to allow the use of local memory (as in OpenCL) / shared memory (CUDA). This extension is preliminary, and the implementation can be improved to use a new HAT dialect with new Ops from code reflection (Babylon).
This extension brings two new calls to the API:
a) Declaration and allocation of data structures in local/shared memory
b) Access to local barriers
Example:
private interface MyArray extends Buffer {
void array(long index, int value);
int array(long index);
Schema<MyArray> schema = Schema.of(MyArray.class,
myArray -> myArray
.array("array", 16));
static MyArray create(Accelerator accelerator) {
return schema.allocate(accelerator, 1);
}
static MyArray createLocal() {
return create(new Accelerator(MethodHandles.lookup(), Backend.FIRST));
}
}
@CodeReflection
private static void reduceLocal(@RO KernelContext context, @RW S32Array input, @RW S32Array partialSums) {
int localId = context.lix;
int localSize = context.lsx;
int blockId = context.bix;
// Prototype: allocate in shared memory an array of 16 ints
MyArray sharedArray = MyArray.create();
// Copy from global to shared memory
sharedArray.array(localId, input.array(context.gix));
// Reduction using local memory
for (int offset = localSize / 2; offset > 0; offset /= 2) {
context.barrier();
if (localId < offset) {
sharedArray.array(localId, sharedArray.array(localId) + sharedArray.array(localId + offset));
}
}
if (localId == 0) {
// copy from shared memory to global memory
partialSums.array(blockId, sharedArray.array(0));
}
}
```
This brings low-level GPU programming to HAT, but for now this is just experimentation.
### Running a few examples:
a) Reductions:
HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl Reduction
HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-cuda Reduction
b) Fast Matrix Mult using loop tiling and local memory:
java @hat/run ffi-opencl matmul 2DTILING
java @hat/run ffi-cuda matmul 2DTILING
c) Another example of using shared memory data structures:
HAT=SHOW_CODE java -cp job.jar hat.java exp ffi-opencl LocalArray
-------------
Commit messages:
- [hat] Remove old methods to allocate DS in local memory
- [hat] Allocation of UDS in private memory disabled
- [hat] Proposal for shared memory and barriers. Private mem support removed
- [hat] Move again the isIfaceAccessor method to the OpTk class
- Fix check for iFaceAccessor without the need for the OpTk old methods
- minor change
- Merge branch 'code-reflection' into hat/api/shared
- [hat] Fix example in private memory
- Unified interface for handling private and shared mem types
- minor change
- ... and 60 more: https://git.openjdk.org/babylon/compare/dbadf0d0...01c7ccfd
Changes: https://git.openjdk.org/babylon/pull/531/files
Webrev: https://webrevs.openjdk.org/?repo=babylon&pr=531&range=00
Stats: 922 lines in 22 files changed: 859 ins; 15 del; 48 mod
Patch: https://git.openjdk.org/babylon/pull/531.diff
Fetch: git fetch https://git.openjdk.org/babylon.git pull/531/head:pull/531
PR: https://git.openjdk.org/babylon/pull/531
More information about the babylon-dev
mailing list