Building CUDA bindings for Windows with jextract
Marco Hutter
panama at jcuda.org
Mon Feb 18 18:00:01 UTC 2019
Hello,
I used the latest early access build (2019/2/12) to generate
(experimental) bindings for CUDA on Windows with jextract. This was only
a basic test, but it worked quite well. I could create an example of a
GEMM computation with CUBLAS, the CUDA BLAS implementation. Further and
more sophisticated test may follow. I'll describe what I did to test
this, and below you'll find a list of possible issues (or at least
discussion points).
---------------------------------------------------------------------------
Some context: CUBLAS is a GPU-based BLAS library, built on top of CUDA.
In order to use CUBLAS, one needs some basic functions from CUDA, mainly
for the memory management. There are some further caveats, e.g. the
distinction between the "CUDA Runtime API" and the "CUDA Driver API", or
the difference between CUBLAS and CUBLAS_v2, but I'll omit some of these
details here. What's relevant: There are two library bindings involved.
One for the CUDA Runtime API, and one for CUBLAS itself.
I packed the jextract calls into .BAT files that can be run at the
standard Windows command prompt. Apologies for the odd, hard-coded paths
- this is also addressed in the issue list below.
The call for generating the CUDA Runtime API bindings:
jextract.exe^
-L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
-I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
-l cudart64_100.dll^
-t org.jcuda.panama.cudart^
--record-library-path "C:/Program Files/NVIDIA GPU Computing
Toolkit/CUDA/v10.0/include/cuda_runtime.h"^
-o cudart.jar^
--log INFO
The call for generating the CUBLAS bindings:
jextract.exe^
-L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
-I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
-l cublas64_100^
-t org.jcuda.panama.cublas^
--record-library-path "C:/Program Files/NVIDIA GPU Computing
Toolkit/CUDA/v10.0/include/cublas.h"^
-o cublas.jar^
--log INFO
With the resulting "cudart.jar" and "cublas.jar", it was possible to run
the basic tests from the attached file. (It's a bit unstructured and
thrown together, but should only show the basic usage).
A high-level description of doing a GEMM (matrix multiplication) with
CUBLAS is:
- Given are the matrices, as 1D Java float[] arrays
- CUDA: Allocate "device memory" (GPU memory) for the matrices
- CUDA: Copy the matrices from Java into the device memory
- CUBLAS: Perform the GEMM
- CUDA: Copy the result matrix from the device memory back to Java
- Profit
---------------------------------------------------------------------------
Issues/Discussion points:
1. There are the usual (minor) Windows-vs-Linux quirks.
- The windows path separator is "\", but can simply be replaced with "/"
- The line separator in .BAT files is not "\", but "^"
- Having the full path in the jextract call, including the "C:/Program
Files"... part, looks a bit odd, but I don't think there is another
reasonable solution for that. There is no fixed install/include/library
directory otherwise.
---
2. The handling of the actual native library name
(This might not be immediately relevant or specific for Panama, but
worth mentioning: )
Initially, I thought that the "-l" parameter required the name of the
".lib" file (e.g. "cublas.lib"), but now I know that it is the name of
the actual library, which is to be passed to System.loadLibrary
eventually. The actual DLL for CUBLAS is called "cublas64_100.dll". So
thanks to the System.loadLibrary magic (which appends the ".dll" part),
the name has to be given as
-l cublas64_100
I know that this is different under Linux and Mac. I've been there:
https://github.com/jcuda/jcuda/blob/master/JCudaJava/src/main/java/jcuda/LibUtils.java
. And things like the "library.so.2.1" version numbering and symlinks,
or this RPATH thingy on Mac always cause headaches here and there...
---
3. Sensible toString implementations on some classes could be nice.
E.g. printing a "BoundedPointer" involves
- jdk.internal.foreign.memory.LayoutTypeImpl at cb644e
- jdk.internal.foreign.memory.MemoryBoundInfo$1 at 13805618
---
4. Examining the results (API) of jextract
The proper jextract call creates a JAR. This is somewhat "opaque". E.g.
for CUBLAS, there are roughly 50 classes generated, involving names like
"cublas", "cublas_h", "cublas_api", "cublas_api_h", and it's hard to get
an initial grip on that. One can drop the JAR into an IDE and rely on
the code completion. I dropped the JAR into a decompiler, to get an
overview.
I'm not sure how this can sensibly be addressed, though. In some cases,
and to some extent, it is possible to obtain some comments from header
files, and process them to generate JavaDoc, but this is brittle (and we
know how great most native libraries are commented. Doxygen is a luxury
here...)
---
5. Duplicate classes
As mentioned above, CUBLAS is a library that is built on top of CUDA.
There are other libraries built on CUDA, like CUFFT for FFT, and many
more. Therefore, there are basically headers
cuda.h
cublas.h (which includes cuda.h)
cufft.h (which includes cuda.h)
...
The most straightforward approach would then be to use jextract to
generate JARs:
cuda.jar
cublas.jar (which depends on cuda.jar)
cufft.jar (which depends on cuda.jar)
But of course, each of the latter will contain all classes/functions
that are also contained in the cuda.jar (only in different packages).
I'm pretty sure that there are some ways to solve this technically
(although using the "exclude-symbols" parameter is probably not feasible
here, given that there are dozens of classes and possibly thousands of
functions...)
What is the suggested solution for a scenario like this?
---
Some of the following might be too specific for CUDA/CUBLAS, so could be
ignored:
---
6. Handling of #defines for functions.
NVIDIA used some trickery for migrating from their (old) CUBLAS API to a
CUBLAS_v2 API. The header that is supposed to be included now is
"cublas_v2.h". This "cublas_v2.h" header includes the "cublas_api.h"
header. The latter contains the function declarations like
CUBLASAPI cublasStatus_t CUBLASWINAPI cublasSgemm_v2 (cublasHandle_t
handle, ...)
But these function declarations are re-wired in the "cublas_v2.h"
header, by lines like
#define cublasSgemm cublasSgemm_v2
The functions that are generated by jextract are, of course, the ones
with the "_v2" suffix.
I'm not sure whether this is an issue or just a natural consquence of
the abuse of the preprocessor...
---
7. Handling of typedefs
The CUBLAS header contains the following typedef:
/* Opaque structure holding CUBLAS library context */
struct cublasContext;
typedef struct cublasContext *cublasHandle_t;
In the functions generated by jextract, thes "cublasHandle_t" type
arrives as
Pointer<cublasContext>
This might be expected. One could argue to generate an own type, like
cublasHandle extends Pointer<cublasContext>
or so, just to retain the specific type information, but for many cases,
this might not be appropriate, so there's probably no silver bullet...
---
8. The handling of enums
One of the CUDA headers contains the following enum:
enum __device_builtin__ cudaMemcpyKind
{
cudaMemcpyHostToHost = 0, /**< Host -> Host */
cudaMemcpyHostToDevice = 1, /**< Host -> Device */
cudaMemcpyDeviceToHost = 2, /**< Device -> Host */
cudaMemcpyDeviceToDevice = 3, /**< Device -> Device */
cudaMemcpyDefault = 4 /**< Direction of the
transfer is inferred from the pointer values. Requires unified virtual
addressing */
};
This seems to be translated into the following structure:
public abstract interface driver_types
{
...
@NativeLocation(file="C:\\Program Files\\NVIDIA GPU Computing
Toolkit\\CUDA\\v10.0\\include\\driver_types.h", line=917, column=25)
@Target(ElementType.TYPE_USE)
@Retention(RetentionPolicy.RUNTIME)
public static @interface cudaMemcpyKind {}
}
So there is seemingly no way to obtain the actual enum values...
---
9. Accessing GPU memory
This is VERY specific for CUDA, so just for info: Accessing GPU memory
from the host (i.e. from Java in this case) should not be possible.
Attempting to do so will cause this InternalError:
About to access device memory from host
Exception in thread "main" java.lang.IllegalStateException:
java.lang.InternalError: a fault occurred in a recent unsafe memory
access operation in compiled Java code
at java.base/java.foreign.memory.Pointer.get(Pointer.java:165)
at PanamaCudaTest.basicCublasTest(PanamaCudaTest.java:48)
at PanamaCudaTest.main(PanamaCudaTest.java:17)
Caused by: java.lang.InternalError: a fault occurred in a recent
unsafe memory access operation in compiled Java code
at
java.base/jdk.internal.foreign.memory.References$OfFloat.getFloat(References.java:286)
at java.base/java.foreign.memory.Pointer.get(Pointer.java:163)
... 2 more
THIS IS FINE! I guess. In a native program, trying to access GPU memory
from the main C program causes a segfault and a nasty crash. I guess
this is what is captured with the InternalError in general. So in fact,
this is positive: It does NOT crash the VM and generate a "hs_err" log,
but graciously uses an exception to tell the programmer that he messed
things up.
---
Sorry for the wall of text. I'm sure that some of this has already been
discussed and addressed. I'll try to catch up with the mailing list
discussion as far as reasonably possible.
Best regards,
Marco Hutter
-------------- next part --------------
import java.foreign.Libraries;
import java.foreign.NativeTypes;
import java.foreign.Scope;
import java.foreign.memory.Array;
import java.foreign.memory.LayoutType;
import java.foreign.memory.Pointer;
import java.lang.invoke.MethodHandles;
import java.util.Random;
import org.jcuda.panama.cublas.cublas_api;
import org.jcuda.panama.cublas.cublas_api.cublasContext;
import org.jcuda.panama.cudart.cuda_runtime_api;
public class PanamaCudaTest
{
public static void main(String[] args)
{
// basicCudaTest();
// basicCublasTest();
cublasSgemmTest();
}
private static final int cudaSuccess = 0;
private static final int cudaMemcpyHostToDevice = 1;
private static final int cudaMemcpyDeviceToHost = 2;
private static void cublasSgemmTest()
{
try (Scope sc = Scope.newNativeScope())
{
cuda_runtime_api cudart =
Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);
cublas_api cublas =
Libraries.bind(MethodHandles.lookup(), cublas_api.class);
// Create a CUBLAS handle
Pointer<Pointer<cublasContext>> handlePointer =
sc.allocate(LayoutType.ofStruct(cublasContext.class).pointer());
cublas.cublasCreate_v2(handlePointer);
Pointer<cublasContext> handle = handlePointer.get();
float alpha = 0.3f;
float beta = 0.7f;
int n = 10;
int nn = n * n;
// Create input data on the host
float A[] = createRandomFloatData(nn);
float B[] = createRandomFloatData(nn);
float C_initial[] = createRandomFloatData(nn);
float C[] = C_initial.clone();
Array<Float> arrayA = sc.allocateArray(NativeTypes.FLOAT, A);
Array<Float> arrayB = sc.allocateArray(NativeTypes.FLOAT, B);
Array<Float> arrayC = sc.allocateArray(NativeTypes.FLOAT, C);
Pointer<Float> h_A = arrayA.elementPointer();
Pointer<Float> h_B = arrayB.elementPointer();
Pointer<Float> h_C = arrayC.elementPointer();
// Allocate memory on the device
Pointer<Float> d_A = allocateFloatDevicePointer(sc, cudart, nn);
Pointer<Float> d_B = allocateFloatDevicePointer(sc, cudart, nn);
Pointer<Float> d_C = allocateFloatDevicePointer(sc, cudart, nn);
// Copy the memory from the host to the device
cudart.cudaMemcpy(d_A, h_A, nn * Float.BYTES,
cudaMemcpyHostToDevice);
cudart.cudaMemcpy(d_B, h_B, nn * Float.BYTES,
cudaMemcpyHostToDevice);
cudart.cudaMemcpy(d_C, h_C, nn * Float.BYTES,
cudaMemcpyHostToDevice);
// Execute sgemm
Array<Float> aAlpha =
sc.allocateArray(NativeTypes.FLOAT, new float[]
{ alpha });
Array<Float> aBeta = sc.allocateArray(NativeTypes.FLOAT, new float[]
{ beta });
Pointer<Float> pAlpha = aAlpha.elementPointer();
Pointer<Float> pBeta = aBeta.elementPointer();
cublas.cublasSgemm_v2(handle, cublas.CUBLAS_OP_N(),
cublas.CUBLAS_OP_N(), n, n, n, pAlpha, d_A, n, d_B, n, pBeta,
d_C, n);
cudart.cudaDeviceSynchronize();
// Copy the result from the device to the host
cudart.cudaMemcpy(h_C, d_C, nn * Float.BYTES,
cudaMemcpyDeviceToHost);
for (int i = 0; i < arrayC.length(); i++)
{
C[i] = arrayC.get(i);
}
// Compute the reference
float C_ref[] = C_initial.clone();
sgemmJava(n, alpha, A, B, beta, C_ref);
System.out.println("Passed? " + equalByNorm(C, C_ref));
// Clean up
cudart.cudaFree(d_A);
cudart.cudaFree(d_B);
cudart.cudaFree(d_C);
cublas.cublasDestroy_v2(handle);
System.out.println("Done");
}
}
private static void basicCublasTest()
{
try (Scope sc = Scope.newNativeScope())
{
cuda_runtime_api cudart =
Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);
cublas_api cublas =
Libraries.bind(MethodHandles.lookup(), cublas_api.class);
// Create a CUBLAS handle
Pointer<Pointer<cublasContext>> handlePointer =
sc.allocate(LayoutType.ofStruct(cublasContext.class).pointer());
check(cublas.cublasCreate_v2(handlePointer));
Pointer<cublasContext> handle = handlePointer.get();
// Create input data on the host
int n = 5;
float A[] = createRandomFloatData(n);
Array<Float> arrayA = sc.allocateArray(NativeTypes.FLOAT, A);
Pointer<Float> h_A = arrayA.elementPointer();
// Allocate memory on the device
Pointer<Float> d_A = allocateFloatDevicePointer(sc, cudart, n);
// XXX Only a test to see what happens
// System.out.println("About to access device memory from host");
// Float xxx = d_A.get();
// Copy the memory from the host to the device
check(cudart.cudaMemcpy(d_A, h_A, n * Float.BYTES,
cudaMemcpyHostToDevice));
// Compute the dot product
Array<Float> aDot =
sc.allocateArray(NativeTypes.FLOAT, new float[1]);
Pointer<Float> pDot = aDot.elementPointer();
cublas.cublasSetPointerMode_v2(handle,
cublas.CUBLAS_POINTER_MODE_HOST());
cublas.cublasSdot_v2(handle, n, d_A, 1, d_A, 1, pDot);
cudart.cudaDeviceSynchronize();
System.out.println("Dot: " + aDot.get(0));
// Clean up
cudart.cudaFree(d_A);
cublas.cublasDestroy_v2(handle);
System.out.println("Done");
}
}
private static void basicCudaTest()
{
try (Scope sc = Scope.newNativeScope())
{
cuda_runtime_api cudart =
Libraries.bind(MethodHandles.lookup(), cuda_runtime_api.class);
Pointer<Pointer<Long>> pointerToPointer =
sc.allocate(NativeTypes.LONG.pointer());
long size = 100;
cudart.cudaMalloc(pointerToPointer, size);
Pointer<Long> pointer = pointerToPointer.get();
System.out.println("Pointer " + pointer);
cudart.cudaFree(pointer);
}
}
private static Pointer<Float> allocateFloatDevicePointer(
Scope sc, cuda_runtime_api cudart, long size)
{
Pointer<Pointer<Float>> pointerToPointer =
sc.allocate(NativeTypes.FLOAT.pointer());
check(cudart.cudaMalloc(pointerToPointer, size * Float.BYTES));
Pointer<Float> pointer = pointerToPointer.get();
return pointer;
}
private static int check(int returnCode)
{
if (returnCode != cudaSuccess)
{
System.out.println("Error " + returnCode);
// new Exception().printStackTrace();
}
return returnCode;
}
public static float[] createRandomFloatData(int n)
{
Random random = new Random(0);
float a[] = new float[n];
for (int i = 0; i < n; i++)
{
a[i] = random.nextFloat();
}
return a;
}
private static void sgemmJava(int n, float alpha, float A[], float B[],
float beta, float C[])
{
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
float prod = 0;
for (int k = 0; k < n; ++k)
{
prod += A[k * n + i] * B[j * n + k];
}
C[j * n + i] = alpha * prod + beta * C[j * n + i];
}
}
}
public static boolean equalByNorm(float result[], float reference[])
{
if (result == null)
{
throw new NullPointerException("The result is null");
}
if (reference == null)
{
throw new NullPointerException("The reference is null");
}
if (result.length != reference.length)
{
throw new IllegalArgumentException(
"The result and reference array have different lengths: "
+ result.length + " and " + reference.length);
}
final float epsilon = 1e-6f;
float errorNorm = 0;
float refNorm = 0;
for (int i = 0; i < result.length; ++i)
{
float diff = reference[i] - result[i];
errorNorm += diff * diff;
refNorm += reference[i] * result[i];
}
errorNorm = (float) Math.sqrt(errorNorm);
refNorm = (float) Math.sqrt(refNorm);
if (Math.abs(refNorm) < epsilon)
{
return false;
}
return (errorNorm / refNorm < epsilon);
}
}
More information about the panama-dev
mailing list