Building CUDA bindings for Windows with jextract

Mon Feb 18 18:00:01 UTC 2019

Hello,

I used the latest early access build (2019/2/12) to generate 
(experimental) bindings for CUDA on Windows with jextract. This was only 
a basic test, but it worked quite well. I could create an example of a 
GEMM computation with CUBLAS, the CUDA BLAS implementation. Further and 
more sophisticated test may follow. I'll describe what I did to test 
this, and below you'll find a list of possible issues (or at least 
discussion points).

---------------------------------------------------------------------------

Some context: CUBLAS is a GPU-based BLAS library, built on top of CUDA. 
In order to use CUBLAS, one needs some basic functions from CUDA, mainly 
for the memory management. There are some further caveats, e.g. the 
distinction between the "CUDA Runtime API" and the "CUDA Driver API", or 
the difference between CUBLAS and CUBLAS_v2, but I'll omit some of these 
details here. What's relevant: There are two library bindings involved. 
One for the CUDA Runtime API, and one for CUBLAS itself.

I packed the jextract calls into .BAT files that can be run at the 
standard Windows command prompt. Apologies for the odd, hard-coded paths 
- this is also addressed in the issue list below.

The call for generating the CUDA Runtime API bindings:

jextract.exe^
   -L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
   -I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
   -l cudart64_100.dll^
   -t org.jcuda.panama.cudart^
   --record-library-path "C:/Program Files/NVIDIA GPU Computing 
Toolkit/CUDA/v10.0/include/cuda_runtime.h"^
   -o cudart.jar^
   --log INFO

The call for generating the CUBLAS bindings:

jextract.exe^
   -L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
   -I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
   -l cublas64_100^
   -t org.jcuda.panama.cublas^
   --record-library-path "C:/Program Files/NVIDIA GPU Computing 
Toolkit/CUDA/v10.0/include/cublas.h"^
   -o cublas.jar^
   --log INFO

With the resulting "cudart.jar" and "cublas.jar", it was possible to run 
the basic tests from the attached file. (It's a bit unstructured and 
thrown together, but should only show the basic usage).

A high-level description of doing a GEMM (matrix multiplication) with 
CUBLAS is:
- Given are the matrices, as 1D Java float[] arrays
- CUDA: Allocate "device memory" (GPU memory) for the matrices
- CUDA: Copy the matrices from Java into the device memory
- CUBLAS: Perform the GEMM
- CUDA: Copy the result matrix from the device memory back to Java
- Profit

---------------------------------------------------------------------------
Issues/Discussion points:

1. There are the usual (minor) Windows-vs-Linux quirks.
- The windows path separator is "\", but can simply be replaced with "/"
- The line separator in .BAT files is not "\", but "^"
- Having the full path in the jextract call, including the "C:/Program 
Files"... part, looks a bit odd, but I don't think there is another 
reasonable solution for that. There is no fixed install/include/library 
directory otherwise.

---

2. The handling of the actual native library name
(This might not be immediately relevant or specific for Panama, but 
worth mentioning: )
Initially, I thought that the "-l" parameter required the name of the 
".lib" file (e.g. "cublas.lib"), but now I know that it is the name of 
the actual library, which is to be passed to System.loadLibrary 
eventually. The actual DLL for CUBLAS is called "cublas64_100.dll". So 
thanks to the System.loadLibrary magic (which appends the ".dll" part), 
the name has to be given as
-l cublas64_100
I know that this is different under Linux and Mac. I've been there: 
https://github.com/jcuda/jcuda/blob/master/JCudaJava/src/main/java/jcuda/LibUtils.java 
. And things like the "library.so.2.1" version numbering and symlinks, 
or this RPATH thingy on Mac always cause headaches here and there...

---

3. Sensible toString implementations on some classes could be nice.
E.g. printing a "BoundedPointer" involves
- jdk.internal.foreign.memory.LayoutTypeImpl at cb644e
- jdk.internal.foreign.memory.MemoryBoundInfo$1 at 13805618

---

4. Examining the results (API) of jextract
The proper jextract call creates a JAR. This is somewhat "opaque". E.g. 
for CUBLAS, there are roughly 50 classes generated, involving names like 
"cublas", "cublas_h", "cublas_api", "cublas_api_h", and it's hard to get 
an initial grip on that. One can drop the JAR into an IDE and rely on 
the code completion. I dropped the JAR into a decompiler, to get an 
overview.
I'm not sure how this can sensibly be addressed, though. In some cases, 
and to some extent, it is possible to obtain some comments from header 
files, and process them to generate JavaDoc, but this is brittle (and we 
know how great most native libraries are commented. Doxygen is a luxury 
here...)

---

5. Duplicate classes
As mentioned above, CUBLAS is a library that is built on top of CUDA. 
There are other libraries built on CUDA, like CUFFT for FFT, and many 
more. Therefore, there are basically headers
cuda.h
cublas.h (which includes cuda.h)
cufft.h (which includes cuda.h)
...
The most straightforward approach would then be to use jextract to 
generate JARs:
cuda.jar
cublas.jar (which depends on cuda.jar)
cufft.jar (which depends on cuda.jar)
But of course, each of the latter will contain all classes/functions 
that are also contained in the cuda.jar (only in different packages). 
I'm pretty sure that there are some ways to solve this technically 
(although using the "exclude-symbols" parameter is probably not feasible 
here, given that there are dozens of classes and possibly thousands of 
functions...)
What is the suggested solution for a scenario like this?

---

Some of the following might be too specific for CUDA/CUBLAS, so could be 
ignored:

---

6. Handling of #defines for functions.
NVIDIA used some trickery for migrating from their (old) CUBLAS API to a 
CUBLAS_v2 API. The header that is supposed to be included now is 
"cublas_v2.h". This "cublas_v2.h" header includes the "cublas_api.h" 
header. The latter contains the function declarations like
CUBLASAPI cublasStatus_t CUBLASWINAPI cublasSgemm_v2 (cublasHandle_t 
handle,  ...)
But these function declarations are re-wired in the "cublas_v2.h" 
header, by lines like
#define cublasSgemm          cublasSgemm_v2
The functions that are generated by jextract are, of course, the ones 
with the "_v2" suffix.
I'm not sure whether this is an issue or just a natural consquence of 
the abuse of the preprocessor...

---

7. Handling of typedefs
The CUBLAS header contains the following typedef:
     /* Opaque structure holding CUBLAS library context */
     struct cublasContext;
     typedef struct cublasContext *cublasHandle_t;
In the functions generated by jextract, thes "cublasHandle_t" type 
arrives as
     Pointer<cublasContext>
This might be expected. One could argue to generate an own type, like
     cublasHandle extends Pointer<cublasContext>
or so, just to retain the specific type information, but for many cases, 
this might not be appropriate, so there's probably no silver bullet...

---

8. The handling of enums
One of the CUDA headers contains the following enum:
     enum __device_builtin__ cudaMemcpyKind
     {
         cudaMemcpyHostToHost          =   0,      /**< Host -> Host */
         cudaMemcpyHostToDevice        =   1,      /**< Host -> Device */
         cudaMemcpyDeviceToHost        =   2,      /**< Device -> Host */
         cudaMemcpyDeviceToDevice      =   3,      /**< Device -> Device */
         cudaMemcpyDefault             =   4       /**< Direction of the 
transfer is inferred from the pointer values. Requires unified virtual 
addressing */
    };
This seems to be translated into the following structure:
     public abstract interface driver_types
     {
       ...
       @NativeLocation(file="C:\\Program Files\\NVIDIA GPU Computing 
Toolkit\\CUDA\\v10.0\\include\\driver_types.h", line=917, column=25)
       @Target(ElementType.TYPE_USE)
       @Retention(RetentionPolicy.RUNTIME)
       public static @interface cudaMemcpyKind {}
     }
So there is seemingly no way to obtain the actual enum values...

---

9. Accessing GPU memory
This is VERY specific for CUDA, so just for info: Accessing GPU memory 
from the host (i.e. from Java in this case) should not be possible. 
Attempting to do so will cause this InternalError:
     About to access device memory from host
     Exception in thread "main" java.lang.IllegalStateException: 
java.lang.InternalError: a fault occurred in a recent unsafe memory 
access operation in compiled Java code
         at java.base/java.foreign.memory.Pointer.get(Pointer.java:165)
         at PanamaCudaTest.basicCublasTest(PanamaCudaTest.java:48)
         at PanamaCudaTest.main(PanamaCudaTest.java:17)
     Caused by: java.lang.InternalError: a fault occurred in a recent 
unsafe memory access operation in compiled Java code
         at 
java.base/jdk.internal.foreign.memory.References$OfFloat.getFloat(References.java:286)
         at java.base/java.foreign.memory.Pointer.get(Pointer.java:163)
         ... 2 more
THIS IS FINE! I guess. In a native program, trying to access GPU memory 
from the main C program causes a segfault and a nasty crash. I guess 
this is what is captured with the InternalError in general. So in fact, 
this is positive: It does NOT crash the VM and generate a "hs_err" log, 
but graciously uses an exception to tell the programmer that he messed 
things up.

---

Sorry for the wall of text. I'm sure that some of this has already been 
discussed and addressed. I'll try to catch up with the mailing list 
discussion as far as reasonably possible.
Best regards,
Marco Hutter

-------------- next part --------------
import java.foreign.Libraries;
import java.foreign.NativeTypes;
import java.foreign.Scope;
import java.foreign.memory.Array;
import java.foreign.memory.LayoutType;
import java.foreign.memory.Pointer;
import java.lang.invoke.MethodHandles;
import java.util.Random;

import org.jcuda.panama.cublas.cublas_api;
import org.jcuda.panama.cublas.cublas_api.cublasContext;
import org.jcuda.panama.cudart.cuda_runtime_api;

public class PanamaCudaTest
{
    public static void main(String[] args)
    {
        // basicCudaTest();
        // basicCublasTest();
        cublasSgemmTest();
    }

    private static final int cudaSuccess = 0;
    private static final int cudaMemcpyHostToDevice = 1;
    private static final int cudaMemcpyDeviceToHost = 2;

    private static void cublasSgemmTest()
    {
        try (Scope sc = Scope.newNativeScope())
        {
            cuda_runtime_api cudart =
                Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);
            cublas_api cublas =
                Libraries.bind(MethodHandles.lookup(), cublas_api.class);

            // Create a CUBLAS handle
            Pointer<Pointer<cublasContext>> handlePointer =
                sc.allocate(LayoutType.ofStruct(cublasContext.class).pointer());
            cublas.cublasCreate_v2(handlePointer);
            Pointer<cublasContext> handle = handlePointer.get();

            float alpha = 0.3f;
            float beta = 0.7f;
            int n = 10;
            int nn = n * n;

            // Create input data on the host
            float A[] = createRandomFloatData(nn);
            float B[] = createRandomFloatData(nn);
            float C_initial[] = createRandomFloatData(nn);
            float C[] = C_initial.clone();
            Array<Float> arrayA = sc.allocateArray(NativeTypes.FLOAT, A);
            Array<Float> arrayB = sc.allocateArray(NativeTypes.FLOAT, B);
            Array<Float> arrayC = sc.allocateArray(NativeTypes.FLOAT, C);
            Pointer<Float> h_A = arrayA.elementPointer();
            Pointer<Float> h_B = arrayB.elementPointer();
            Pointer<Float> h_C = arrayC.elementPointer();

            // Allocate memory on the device
            Pointer<Float> d_A = allocateFloatDevicePointer(sc, cudart, nn);
            Pointer<Float> d_B = allocateFloatDevicePointer(sc, cudart, nn);
            Pointer<Float> d_C = allocateFloatDevicePointer(sc, cudart, nn);

            // Copy the memory from the host to the device
            cudart.cudaMemcpy(d_A, h_A, nn * Float.BYTES,
                cudaMemcpyHostToDevice);
            cudart.cudaMemcpy(d_B, h_B, nn * Float.BYTES,
                cudaMemcpyHostToDevice);
            cudart.cudaMemcpy(d_C, h_C, nn * Float.BYTES,
                cudaMemcpyHostToDevice);

            // Execute sgemm
            Array<Float> aAlpha =
                sc.allocateArray(NativeTypes.FLOAT, new float[]
                { alpha });
            Array<Float> aBeta = sc.allocateArray(NativeTypes.FLOAT, new float[]
            { beta });
            Pointer<Float> pAlpha = aAlpha.elementPointer();
            Pointer<Float> pBeta = aBeta.elementPointer();
            cublas.cublasSgemm_v2(handle, cublas.CUBLAS_OP_N(),
                cublas.CUBLAS_OP_N(), n, n, n, pAlpha, d_A, n, d_B, n, pBeta,
                d_C, n);

            cudart.cudaDeviceSynchronize();

            // Copy the result from the device to the host
            cudart.cudaMemcpy(h_C, d_C, nn * Float.BYTES,
                cudaMemcpyDeviceToHost);
            for (int i = 0; i < arrayC.length(); i++)
            {
                C[i] = arrayC.get(i);
            }

            // Compute the reference
            float C_ref[] = C_initial.clone();
            sgemmJava(n, alpha, A, B, beta, C_ref);
            System.out.println("Passed? " + equalByNorm(C, C_ref));

            // Clean up
            cudart.cudaFree(d_A);
            cudart.cudaFree(d_B);
            cudart.cudaFree(d_C);
            cublas.cublasDestroy_v2(handle);

            System.out.println("Done");
        }
    }

    private static void basicCublasTest()
    {
        try (Scope sc = Scope.newNativeScope())
        {

            cuda_runtime_api cudart =
                Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);
            cublas_api cublas =
                Libraries.bind(MethodHandles.lookup(), cublas_api.class);

            // Create a CUBLAS handle
            Pointer<Pointer<cublasContext>> handlePointer =
                sc.allocate(LayoutType.ofStruct(cublasContext.class).pointer());
            check(cublas.cublasCreate_v2(handlePointer));
            Pointer<cublasContext> handle = handlePointer.get();

            // Create input data on the host
            int n = 5;
            float A[] = createRandomFloatData(n);
            Array<Float> arrayA = sc.allocateArray(NativeTypes.FLOAT, A);
            Pointer<Float> h_A = arrayA.elementPointer();

            // Allocate memory on the device
            Pointer<Float> d_A = allocateFloatDevicePointer(sc, cudart, n);

            // XXX Only a test to see what happens
            // System.out.println("About to access device memory from host");
            // Float xxx = d_A.get();

            // Copy the memory from the host to the device
            check(cudart.cudaMemcpy(d_A, h_A, n * Float.BYTES,
                cudaMemcpyHostToDevice));

            // Compute the dot product
            Array<Float> aDot =
                sc.allocateArray(NativeTypes.FLOAT, new float[1]);
            Pointer<Float> pDot = aDot.elementPointer();
            cublas.cublasSetPointerMode_v2(handle,
                cublas.CUBLAS_POINTER_MODE_HOST());
            cublas.cublasSdot_v2(handle, n, d_A, 1, d_A, 1, pDot);

            cudart.cudaDeviceSynchronize();
            System.out.println("Dot: " + aDot.get(0));

            // Clean up
            cudart.cudaFree(d_A);
            cublas.cublasDestroy_v2(handle);

            System.out.println("Done");
        }
    }

    private static void basicCudaTest()
    {
        try (Scope sc = Scope.newNativeScope())
        {
            cuda_runtime_api cudart =
                Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);

            Pointer<Pointer<Long>> pointerToPointer =
                sc.allocate(NativeTypes.LONG.pointer());
            long size = 100;
            cudart.cudaMalloc(pointerToPointer, size);
            Pointer<Long> pointer = pointerToPointer.get();

            System.out.println("Pointer " + pointer);

            cudart.cudaFree(pointer);
        }
    }

    private static Pointer<Float> allocateFloatDevicePointer(
        Scope sc, cuda_runtime_api cudart, long size)
    {
        Pointer<Pointer<Float>> pointerToPointer =
            sc.allocate(NativeTypes.FLOAT.pointer());
        check(cudart.cudaMalloc(pointerToPointer, size * Float.BYTES));
        Pointer<Float> pointer = pointerToPointer.get();
        return pointer;
    }

    private static int check(int returnCode)
    {
        if (returnCode != cudaSuccess)
        {
            System.out.println("Error " + returnCode);
            // new Exception().printStackTrace();
        }
        return returnCode;
    }

    public static float[] createRandomFloatData(int n)
    {
        Random random = new Random(0);
        float a[] = new float[n];
        for (int i = 0; i < n; i++)
        {
            a[i] = random.nextFloat();
        }
        return a;
    }

    private static void sgemmJava(int n, float alpha, float A[], float B[],
        float beta, float C[])
    {
        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                float prod = 0;
                for (int k = 0; k < n; ++k)
                {
                    prod += A[k * n + i] * B[j * n + k];
                }
                C[j * n + i] = alpha * prod + beta * C[j * n + i];
            }
        }
    }

    public static boolean equalByNorm(float result[], float reference[])
    {
        if (result == null)
        {
            throw new NullPointerException("The result is null");
        }
        if (reference == null)
        {
            throw new NullPointerException("The reference is null");
        }
        if (result.length != reference.length)
        {
            throw new IllegalArgumentException(
                "The result and reference array have different lengths: "
                    + result.length + " and " + reference.length);
        }
        final float epsilon = 1e-6f;
        float errorNorm = 0;
        float refNorm = 0;
        for (int i = 0; i < result.length; ++i)
        {
            float diff = reference[i] - result[i];
            errorNorm += diff * diff;
            refNorm += reference[i] * result[i];
        }
        errorNorm = (float) Math.sqrt(errorNorm);
        refNorm = (float) Math.sqrt(refNorm);
        if (Math.abs(refNorm) < epsilon)
        {
            return false;
        }
        return (errorNorm / refNorm < epsilon);
    }

}