JVM crash by creating VarHandle

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Sun Feb 2 18:43:00 UTC 2020


Actually, I think I might have figured what's wrong - here

> MemoryArray<nvmlProcessInfo_t> structArray = new 
> nvmlProcessInfo_t().toArray(intPointer.getValue()); 
The toArray() function doesn't really do what the native library 
expects. Looking online (e.g. [1]), it seems like uses of this functions 
just allocate an array of processinfo a certain size, and then pass that 
array to the function. e.g.

unsigned int num_procs = 32;
nvmlProcessInfo_t procs[32];

...

result = (nvmlReturn_t) nvmlDeviceGetGraphicsRunningProcesses(device, 
&num_procs, procs);


But, your toArray() function is not just creating a contiguous array of 
nvmlProcessInfo_t - instead, it seems to create an array of pointers to 
nvmlProcessInfo_t structs - which is not what the function expects. In 
this case, at least on my machine, since num_procs is "1", allocating an 
array of one pointer means allocating 64 bits - but to be able to write 
one nvmlProcessInfo_t you need at least 128 bits. So the runtime doesn't 
have sufficient size to do the write.

Then, when you do the dereference, 
nvmlProcessInfo_tMemoryArray::getValue uses a VarHandle with a 
MemoryAddress carriers - but what you are extracting from the segment 
are the structs themselves, not just pointers.

I tried to made a couple of changes - which seemed to give the desired 
effect:

1) In the constructor of "nvmlProcessInfo_tMemoryArray" - you should 
replace this:

this.layout = Optional.of(MemoryLayout.ofSequence(length, 
MemoryLayouts.C_POINTER));

With this

this.layout = Optional.of(MemoryLayout.ofSequence(length, structlayout));

2) In nvmlProcessInfo_tMemoryArray::getValue, replace this:

return new 
nvmlProcessInfo_t((MemoryAddress)this.handle.get(segment.baseAddress()), 
this.structLayout);

With just this:

return new nvmlProcessInfo_t(segment.baseAddress(), this.structLayout);

(the setValue will likely need a similar change to bulk copy the 
incoming array in the right place).

With these changes, the test runs successfully an prints "27522". No 
idea if that's correct, but the output seems stable.

Maurizio

[1] - 
https://github.com/TANGO-Project/monitor-infrastructure/blob/master/Collectd/nvidia_plugin/nvidia_plugin.c

On 02/02/2020 17:59, Maurizio Cimadamore wrote:
> I managed to test remotely using a machine with an nvidia GPU. I 
> didn't get any VM crash, but I did get an issue with a misaligned 
> access in the instruction where you get the segfault:
>
> SUCCESS
> SUCCESS
> INSUFFICIENT_SIZE_ERROR
> SUCCESS
> Exception in thread "main" java.lang.IllegalStateException: Misaligned 
> access at address: 140131897994114
>     at 
> java.base/java.lang.invoke.VarHandleMemoryAddressBase.newIllegalStateExceptionForMisalignedAccess(VarHandleMemoryAddressBase.java:54)
>     at 
> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.offsetNoVMAlignCheck(VarHandleMemoryAddressAsInts.java:69)
>     at 
> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.get0(VarHandleMemoryAddressAsInts.java:79)
>     at 
> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800baf440.get(Unknown 
> Source)
>     at 
> java.base/java.lang.invoke.VarHandleGuards.guard_L_L(VarHandleGuards.java:41)
>     at 
> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:57)
>     at 
> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:14)
>     at org.goliath.bindings.nvml.main.Test.main(Test.java:45)
>
> Seems like the nvidia lib always puts an address inside the struct of 
> pointer which doesn't seem to be 4-byte aligned, so reading an int out 
> of it fails.
>
> That said, I'm not sure I fully understand how your crosspoint 
> machinery is supposed to work - I'm seeing a bunch of struct creation, 
> even before the nvidia routine is called to fill in the array, which 
> is odd, given that the code is not really allocating any struct. 
> Specifically, this:
>
>         MemoryArray<nvmlProcessInfo_t> structArray = new 
> nvmlProcessInfo_t().toArray(intPointer.getValue());
>
> Creates allocates two off heap structs - one is allocated by 
> nvmlProcessInfo_t - then another is created in the toArray() call. 
> Which seems completely odd given that the array is supposed to be 
> empty and filled by the nvidia routine?
>
> In any case, the crash doesn't happen on my machine - so I suspect 
> that we'll have to keep an eye out for that issue in case we see some 
> other example which ends up with same crash. It would be useful if you 
> could convert your test not to use the crosspoint library and see if 
> that still has the crash. This should not be super hard given that 
> there aren't many calls in there - and the static wrappers generated 
> by jextract should be good enough to run that test?
>
> Maurizio
>
> On 02/02/2020 07:59, Maurizio Cimadamore wrote:
>>
>> On 02/02/2020 01:51, Ty Young wrote:
>>> I'm not entirely sure what could be done differently but If you have 
>>> suggestions then I'd be glad to hear it. The thing to keep in mind 
>>> with NVML is that it's backwards and cross-platform compatible so 
>>> once things are defined there isn't anything to really worry about 
>>> later.
>>>
>>>
>>> In hindsight the NativeFunction implementations shouldn't force the 
>>> use of higher level abstractions - that should be the job of 
>>> nvml_h.java as it's what enforces type safety to begin with. 
>>
>> I wasn't suggesting you should change the API - just that there are 
>> many layers between the code you see in Test and the actual method 
>> handle, var handle calls - which makes it harder to diagnose.
>>
>> Re-reading the stack trace in the crash, it seems to be a problem 
>> related with classfile parsing, potentially of one of the synthetic 
>> VarHandle classes which we spin on the fly. I'll do more analysis 
>> next week.
>>
>> In the meantime it would be helpful to understand if the crash 
>> started to appear when you updated the Panama repository (which might 
>> suggest some relationship with recent commits, such as the one for 
>> adding VarHandle adapter support), or if it's a failure that you 
>> encountered writing a new test.
>>
>> Maurizio
>>


More information about the panama-dev mailing list