[code-reflection] Integrated: Hat merge cuda ptx ffi backends

Wed Apr 2 14:20:27 UTC 2025

The merge between ptx and cuda backends continues. 

We still have a bug using cuda streams (which we need for minimizing buffer copies). 

Added a pure C++ test (which launches squares directly from C++) it replicates the error we see with the squares example. 

Lots of small tidy ups in the code here, but no real fixes. 

Expect more cuda/ptx merge  updates

-------------

Commit messages:
 - whitespace
 - slowly tracking cuda issue.  kernels can be launched provided we avoid using  streams but we need streams to minimize buffer copies
 - synced with jetson/cuda
 - cuda backend compiles and recieves ptx. Failing to execute kernel

Changes: https://git.openjdk.org/babylon/pull/377/files
  Webrev: https://webrevs.openjdk.org/?repo=babylon&pr=377&range=00
  Stats: 1607 lines in 18 files changed: 1161 ins; 399 del; 47 mod
  Patch: https://git.openjdk.org/babylon/pull/377.diff
  Fetch: git fetch https://git.openjdk.org/babylon.git pull/377/head:pull/377

PR: https://git.openjdk.org/babylon/pull/377