JVM crash by creating VarHandle

Ty Young youngty1997 at gmail.com
Sun Feb 2 20:41:38 UTC 2020


On 2/2/20 12:43 PM, Maurizio Cimadamore wrote:
> Actually, I think I might have figured what's wrong - here
>
>> MemoryArray<nvmlProcessInfo_t> structArray = new 
>> nvmlProcessInfo_t().toArray(intPointer.getValue()); 
> The toArray() function doesn't really do what the native library 
> expects. Looking online (e.g. [1]), it seems like uses of this 
> functions just allocate an array of processinfo a certain size, and 
> then pass that array to the function. e.g.
>
> unsigned int num_procs = 32;
> nvmlProcessInfo_t procs[32];
>
> ...
>
> result = (nvmlReturn_t) nvmlDeviceGetGraphicsRunningProcesses(device, 
> &num_procs, procs);
>
>
> But, your toArray() function is not just creating a contiguous array 
> of nvmlProcessInfo_t - instead, it seems to create an array of 
> pointers to nvmlProcessInfo_t structs - which is not what the function 
> expects. In this case, at least on my machine, since num_procs is "1", 
> allocating an array of one pointer means allocating 64 bits - but to 
> be able to write one nvmlProcessInfo_t you need at least 128 bits. So 
> the runtime doesn't have sufficient size to do the write.
>
> Then, when you do the dereference, 
> nvmlProcessInfo_tMemoryArray::getValue uses a VarHandle with a 
> MemoryAddress carriers - but what you are extracting from the segment 
> are the structs themselves, not just pointers.
>
> I tried to made a couple of changes - which seemed to give the desired 
> effect:
>
> 1) In the constructor of "nvmlProcessInfo_tMemoryArray" - you should 
> replace this:
>
> this.layout = Optional.of(MemoryLayout.ofSequence(length, 
> MemoryLayouts.C_POINTER));
>
> With this
>
> this.layout = Optional.of(MemoryLayout.ofSequence(length, structlayout));
>
> 2) In nvmlProcessInfo_tMemoryArray::getValue, replace this:
>
> return new 
> nvmlProcessInfo_t((MemoryAddress)this.handle.get(segment.baseAddress()), 
> this.structLayout);
>
> With just this:
>
> return new nvmlProcessInfo_t(segment.baseAddress(), this.structLayout);
>
> (the setValue will likely need a similar change to bulk copy the 
> incoming array in the right place).


You're right. I was treating the Struct MemoryAddress as a pointer. It 
works now. Segfault gone and all.


>
> With these changes, the test runs successfully an prints "27522". No 
> idea if that's correct, but the output seems stable.


It's the PID of a process so probably is. "nvidia-smi" is Nvidia's 
official utility which prints PIDs and memory usage of processes running 
on the GPU.


>
> Maurizio
>
> [1] - 
> https://github.com/TANGO-Project/monitor-infrastructure/blob/master/Collectd/nvidia_plugin/nvidia_plugin.c
>
> On 02/02/2020 17:59, Maurizio Cimadamore wrote:
>> I managed to test remotely using a machine with an nvidia GPU. I 
>> didn't get any VM crash, but I did get an issue with a misaligned 
>> access in the instruction where you get the segfault:
>>
>> SUCCESS
>> SUCCESS
>> INSUFFICIENT_SIZE_ERROR
>> SUCCESS
>> Exception in thread "main" java.lang.IllegalStateException: 
>> Misaligned access at address: 140131897994114
>>     at 
>> java.base/java.lang.invoke.VarHandleMemoryAddressBase.newIllegalStateExceptionForMisalignedAccess(VarHandleMemoryAddressBase.java:54)
>>     at 
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.offsetNoVMAlignCheck(VarHandleMemoryAddressAsInts.java:69)
>>     at 
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.get0(VarHandleMemoryAddressAsInts.java:79)
>>     at 
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800baf440.get(Unknown 
>> Source)
>>     at 
>> java.base/java.lang.invoke.VarHandleGuards.guard_L_L(VarHandleGuards.java:41)
>>     at 
>> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:57)
>>     at 
>> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:14)
>>     at org.goliath.bindings.nvml.main.Test.main(Test.java:45)
>>
>> Seems like the nvidia lib always puts an address inside the struct of 
>> pointer which doesn't seem to be 4-byte aligned, so reading an int 
>> out of it fails.
>>
>> That said, I'm not sure I fully understand how your crosspoint 
>> machinery is supposed to work - I'm seeing a bunch of struct 
>> creation, even before the nvidia routine is called to fill in the 
>> array, which is odd, given that the code is not really allocating any 
>> struct. Specifically, this:
>>
>>         MemoryArray<nvmlProcessInfo_t> structArray = new 
>> nvmlProcessInfo_t().toArray(intPointer.getValue());
>>
>> Creates allocates two off heap structs - one is allocated by 
>> nvmlProcessInfo_t - then another is created in the toArray() call. 
>> Which seems completely odd given that the array is supposed to be 
>> empty and filled by the nvidia routine?
>>
>> In any case, the crash doesn't happen on my machine - so I suspect 
>> that we'll have to keep an eye out for that issue in case we see some 
>> other example which ends up with same crash. It would be useful if 
>> you could convert your test not to use the crosspoint library and see 
>> if that still has the crash. This should not be super hard given that 
>> there aren't many calls in there - and the static wrappers generated 
>> by jextract should be good enough to run that test?
>>
>> Maurizio
>>
>> On 02/02/2020 07:59, Maurizio Cimadamore wrote:
>>>
>>> On 02/02/2020 01:51, Ty Young wrote:
>>>> I'm not entirely sure what could be done differently but If you 
>>>> have suggestions then I'd be glad to hear it. The thing to keep in 
>>>> mind with NVML is that it's backwards and cross-platform compatible 
>>>> so once things are defined there isn't anything to really worry 
>>>> about later.
>>>>
>>>>
>>>> In hindsight the NativeFunction implementations shouldn't force the 
>>>> use of higher level abstractions - that should be the job of 
>>>> nvml_h.java as it's what enforces type safety to begin with. 
>>>
>>> I wasn't suggesting you should change the API - just that there are 
>>> many layers between the code you see in Test and the actual method 
>>> handle, var handle calls - which makes it harder to diagnose.
>>>
>>> Re-reading the stack trace in the crash, it seems to be a problem 
>>> related with classfile parsing, potentially of one of the synthetic 
>>> VarHandle classes which we spin on the fly. I'll do more analysis 
>>> next week.
>>>
>>> In the meantime it would be helpful to understand if the crash 
>>> started to appear when you updated the Panama repository (which 
>>> might suggest some relationship with recent commits, such as the one 
>>> for adding VarHandle adapter support), or if it's a failure that you 
>>> encountered writing a new test.
>>>
>>> Maurizio
>>>


More information about the panama-dev mailing list