Building CUDA bindings for Windows with jextract

Mon Feb 18 18:35:07 UTC 2019

Hi Marcus,
many thanks for giving Panama a serious try and to put together the 
detailed feedback as per below. Some comments inline:

On 18/02/2019 18:00, Marco Hutter wrote:
> Hello,
>
> I used the latest early access build (2019/2/12) to generate 
> (experimental) bindings for CUDA on Windows with jextract. This was 
> only a basic test, but it worked quite well. I could create an example 
> of a GEMM computation with CUBLAS, the CUDA BLAS implementation. 
> Further and more sophisticated test may follow. I'll describe what I 
> did to test this, and below you'll find a list of possible issues (or 
> at least discussion points).
>
> --------------------------------------------------------------------------- 
>
>
> Some context: CUBLAS is a GPU-based BLAS library, built on top of 
> CUDA. In order to use CUBLAS, one needs some basic functions from 
> CUDA, mainly for the memory management. There are some further 
> caveats, e.g. the distinction between the "CUDA Runtime API" and the 
> "CUDA Driver API", or the difference between CUBLAS and CUBLAS_v2, but 
> I'll omit some of these details here. What's relevant: There are two 
> library bindings involved. One for the CUDA Runtime API, and one for 
> CUBLAS itself.
>
> I packed the jextract calls into .BAT files that can be run at the 
> standard Windows command prompt. Apologies for the odd, hard-coded 
> paths - this is also addressed in the issue list below.
>
> The call for generating the CUDA Runtime API bindings:
>
> jextract.exe^
>   -L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
>   -I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
>   -l cudart64_100.dll^
>   -t org.jcuda.panama.cudart^
>   --record-library-path "C:/Program Files/NVIDIA GPU Computing 
> Toolkit/CUDA/v10.0/include/cuda_runtime.h"^
>   -o cudart.jar^
>   --log INFO
>
> The call for generating the CUBLAS bindings:
>
> jextract.exe^
>   -L "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64/"^
>   -I "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/include"^
>   -l cublas64_100^
>   -t org.jcuda.panama.cublas^
>   --record-library-path "C:/Program Files/NVIDIA GPU Computing 
> Toolkit/CUDA/v10.0/include/cublas.h"^
>   -o cublas.jar^
>   --log INFO
>
>
> With the resulting "cudart.jar" and "cublas.jar", it was possible to 
> run the basic tests from the attached file. (It's a bit unstructured 
> and thrown together, but should only show the basic usage).
>
> A high-level description of doing a GEMM (matrix multiplication) with 
> CUBLAS is:
> - Given are the matrices, as 1D Java float[] arrays
> - CUDA: Allocate "device memory" (GPU memory) for the matrices
> - CUDA: Copy the matrices from Java into the device memory
> - CUBLAS: Perform the GEMM
> - CUDA: Copy the result matrix from the device memory back to Java
> - Profit
>
>
> --------------------------------------------------------------------------- 
>
> Issues/Discussion points:
>
> 1. There are the usual (minor) Windows-vs-Linux quirks.
> - The windows path separator is "\", but can simply be replaced with "/"
> - The line separator in .BAT files is not "\", but "^"
> - Having the full path in the jextract call, including the "C:/Program 
> Files"... part, looks a bit odd, but I don't think there is another 
> reasonable solution for that. There is no fixed 
> install/include/library directory otherwise.

Is this an issue (as in 'bug') or is this just a note on how to read the 
jextract command line above? Is there an alternate command line that you 
would have liked better but that doesn't work?

As for omitting "Program Files..." yes, that's not possible, since 
there's really no default path here.

>
> ---
>
> 2. The handling of the actual native library name
> (This might not be immediately relevant or specific for Panama, but 
> worth mentioning: )
> Initially, I thought that the "-l" parameter required the name of the 
> ".lib" file (e.g. "cublas.lib"), but now I know that it is the name of 
> the actual library, which is to be passed to System.loadLibrary 
> eventually. The actual DLL for CUBLAS is called "cublas64_100.dll". So 
> thanks to the System.loadLibrary magic (which appends the ".dll" 
> part), the name has to be given as
> -l cublas64_100
> I know that this is different under Linux and Mac. I've been there: 
> https://github.com/jcuda/jcuda/blob/master/JCudaJava/src/main/java/jcuda/LibUtils.java 
> . And things like the "library.so.2.1" version numbering and symlinks, 
> or this RPATH thingy on Mac always cause headaches here and there...

Yep - System.loadLibrary has two modes:

1) library name -> as in "cublas" this would be expanded into a shared 
library name - e.g. on linux libcublas.so
2) full absolute path to lib

If you use (1), the JDK will fill in some details (e.g. add the 'lib' 
prefix on linux, add an extension) and then expect an _exact_ match. If 
that fails, error.

The escape hatch for weird names is to use a full path - but that's not 
what jextract currently does. Sundar and I chatted a bit about this - 
and I think the path forward here would be to allow the "-l" option to 
take a full path, in addition to a library name, as now. That would 
ameliorate your issue and avoid the rename.

>
> ---
>
> 3. Sensible toString implementations on some classes could be nice.
> E.g. printing a "BoundedPointer" involves
> - jdk.internal.foreign.memory.LayoutTypeImpl at cb644e
> - jdk.internal.foreign.memory.MemoryBoundInfo$1 at 13805618
We can do that.
>
> ---
>
> 4. Examining the results (API) of jextract
> The proper jextract call creates a JAR. This is somewhat "opaque". 
> E.g. for CUBLAS, there are roughly 50 classes generated, involving 
> names like "cublas", "cublas_h", "cublas_api", "cublas_api_h", and 
> it's hard to get an initial grip on that. One can drop the JAR into an 
> IDE and rely on the code completion. I dropped the JAR into a 
> decompiler, to get an overview.
> I'm not sure how this can sensibly be addressed, though. In some 
> cases, and to some extent, it is possible to obtain some comments from 
> header files, and process them to generate JavaDoc, but this is 
> brittle (and we know how great most native libraries are commented. 
> Doxygen is a luxury here...)

The main issue here is what we have been calling 'library-centric' 
approach - that is, instead of generating many separate classes - 
generate a single root class which has all the required member functions 
(the ones that appear in the shared library), and have all required 
dependencies added in as inner classes. That would indicate more clearly 
where you have to  look for.

Note also that, for the purpose of looking inside classes (w/o using a 
jar) you can also use jextract with "-d <dirname>" and output classes 
into a folder, uncompressed.

>
> ---
>
> 5. Duplicate classes
> As mentioned above, CUBLAS is a library that is built on top of CUDA. 
> There are other libraries built on CUDA, like CUFFT for FFT, and many 
> more. Therefore, there are basically headers
> cuda.h
> cublas.h (which includes cuda.h)
> cufft.h (which includes cuda.h)
> ...
> The most straightforward approach would then be to use jextract to 
> generate JARs:
> cuda.jar
> cublas.jar (which depends on cuda.jar)
> cufft.jar (which depends on cuda.jar)
> But of course, each of the latter will contain all classes/functions 
> that are also contained in the cuda.jar (only in different packages). 
> I'm pretty sure that there are some ways to solve this technically 
> (although using the "exclude-symbols" parameter is probably not 
> feasible here, given that there are dozens of classes and possibly 
> thousands of functions...)
> What is the suggested solution for a scenario like this?

The long term solution would be the ability to reuse jextract runs by 
pointing jextract at a previous extracted library.

That said, I believe you should be able to extract all three libraries 
in a single shot, by giving the three headers as input to jextract (and 
the three libraries...); that will generate only one version of 
everything. I used this approach for OpenGL which also relies on a 
number of dependent headers - and generated a single jar.

>
> ---
>
> Some of the following might be too specific for CUDA/CUBLAS, so could 
> be ignored:
>
> ---
>
> 6. Handling of #defines for functions.
> NVIDIA used some trickery for migrating from their (old) CUBLAS API to 
> a CUBLAS_v2 API. The header that is supposed to be included now is 
> "cublas_v2.h". This "cublas_v2.h" header includes the "cublas_api.h" 
> header. The latter contains the function declarations like
> CUBLASAPI cublasStatus_t CUBLASWINAPI cublasSgemm_v2 (cublasHandle_t 
> handle,  ...)
> But these function declarations are re-wired in the "cublas_v2.h" 
> header, by lines like
> #define cublasSgemm          cublasSgemm_v2
> The functions that are generated by jextract are, of course, the ones 
> with the "_v2" suffix.
> I'm not sure whether this is an issue or just a natural consquence of 
> the abuse of the preprocessor...
>
Again, I believe this situation will be much improved when we'll move 
from an header-centric view (which of course expose all levels of 
brittleness) towards a more library-centric view of the extraction process.
> ---
>
> 7. Handling of typedefs
> The CUBLAS header contains the following typedef:
>     /* Opaque structure holding CUBLAS library context */
>     struct cublasContext;
>     typedef struct cublasContext *cublasHandle_t;
> In the functions generated by jextract, thes "cublasHandle_t" type 
> arrives as
>     Pointer<cublasContext>
> This might be expected. One could argue to generate an own type, like
>     cublasHandle extends Pointer<cublasContext>
> or so, just to retain the specific type information, but for many 
> cases, this might not be appropriate, so there's probably no silver 
> bullet...

I think this is the best we can do in the 'raw' binding mode. This is 
something you might want to do with more control over the extraction 
process - e.g. when you define special treatment for the API points. We 
will likely get there with a jextract plugin/API.

In this specific case, the 'civilizer' could define a type called 
CublasHandle and a pair of getter/setter method handle which tell the 
binder how to marshal/unmarshal values of that type. If CublasHandle is 
just a thin wrapper around a Panama pointer (and if we eventually get 
value types from Valhalla), it will be easier to make this both 
expressive and efficient.

>
> ---
>
> 8. The handling of enums
> One of the CUDA headers contains the following enum:
>     enum __device_builtin__ cudaMemcpyKind
>     {
>         cudaMemcpyHostToHost          =   0,      /**< Host -> Host */
>         cudaMemcpyHostToDevice        =   1,      /**< Host -> Device */
>         cudaMemcpyDeviceToHost        =   2,      /**< Device -> Host */
>         cudaMemcpyDeviceToDevice      =   3,      /**< Device -> 
> Device */
>         cudaMemcpyDefault             =   4       /**< Direction of 
> the transfer is inferred from the pointer values. Requires unified 
> virtual addressing */
>    };
> This seems to be translated into the following structure:
>     public abstract interface driver_types
>     {
>       ...
>       @NativeLocation(file="C:\\Program Files\\NVIDIA GPU Computing 
> Toolkit\\CUDA\\v10.0\\include\\driver_types.h", line=917, column=25)
>       @Target(ElementType.TYPE_USE)
>       @Retention(RetentionPolicy.RUNTIME)
>       public static @interface cudaMemcpyKind {}
>     }
> So there is seemingly no way to obtain the actual enum values...

enums are currently translated away as annotations - Java enums are not 
an option because in C enums are much closer to ints than to Java enums.

That said, the current translation scheme could be improved at least by 
grouping the methods of the enum constants under a common interface 
(which can also define the annotation). That would make it easier to see 
which constants are defined by an enum.

This looks like a bug or something that can be improved.

>
> ---
>
> 9. Accessing GPU memory
> This is VERY specific for CUDA, so just for info: Accessing GPU memory 
> from the host (i.e. from Java in this case) should not be possible. 
> Attempting to do so will cause this InternalError:
>     About to access device memory from host
>     Exception in thread "main" java.lang.IllegalStateException: 
> java.lang.InternalError: a fault occurred in a recent unsafe memory 
> access operation in compiled Java code
>         at java.base/java.foreign.memory.Pointer.get(Pointer.java:165)
>         at PanamaCudaTest.basicCublasTest(PanamaCudaTest.java:48)
>         at PanamaCudaTest.main(PanamaCudaTest.java:17)
>     Caused by: java.lang.InternalError: a fault occurred in a recent 
> unsafe memory access operation in compiled Java code
>         at 
> java.base/jdk.internal.foreign.memory.References$OfFloat.getFloat(References.java:286)
>         at java.base/java.foreign.memory.Pointer.get(Pointer.java:163)
>         ... 2 more
> THIS IS FINE! I guess. In a native program, trying to access GPU 
> memory from the main C program causes a segfault and a nasty crash. I 
> guess this is what is captured with the InternalError in general. So 
> in fact, this is positive: It does NOT crash the VM and generate a 
> "hs_err" log, but graciously uses an exception to tell the programmer 
> that he messed things up.

Well, yes - accessing non DRAM memory via Unsafe is bound to fail in 
mysterious ways. I'm less positive than you are re. gracefulness, in the 
sense that I don't know under which condition the VM is able to recover 
after a bad Unsafe::put/get. I would expect the mileage might vary here 
- but I'm no VM engineer and I'll leave this specific point to others.

>
> ---
>
> Sorry for the wall of text. I'm sure that some of this has already 
> been discussed and addressed. I'll try to catch up with the mailing 
> list discussion as far as reasonably possible.

Thanks again for the great feedback - this gives us a good number of 
actionable items. But I'm overall pleased that you could make this work 
in a relatively short span of time.

Maurizio

> Best regards,
> Marco Hutter
>
>
>
>
>