JVM crash by creating VarHandle
Ty Young
youngty1997 at gmail.com
Sun Feb 2 20:41:38 UTC 2020
On 2/2/20 12:43 PM, Maurizio Cimadamore wrote:
> Actually, I think I might have figured what's wrong - here
>
>> MemoryArray<nvmlProcessInfo_t> structArray = new
>> nvmlProcessInfo_t().toArray(intPointer.getValue());
> The toArray() function doesn't really do what the native library
> expects. Looking online (e.g. [1]), it seems like uses of this
> functions just allocate an array of processinfo a certain size, and
> then pass that array to the function. e.g.
>
> unsigned int num_procs = 32;
> nvmlProcessInfo_t procs[32];
>
> ...
>
> result = (nvmlReturn_t) nvmlDeviceGetGraphicsRunningProcesses(device,
> &num_procs, procs);
>
>
> But, your toArray() function is not just creating a contiguous array
> of nvmlProcessInfo_t - instead, it seems to create an array of
> pointers to nvmlProcessInfo_t structs - which is not what the function
> expects. In this case, at least on my machine, since num_procs is "1",
> allocating an array of one pointer means allocating 64 bits - but to
> be able to write one nvmlProcessInfo_t you need at least 128 bits. So
> the runtime doesn't have sufficient size to do the write.
>
> Then, when you do the dereference,
> nvmlProcessInfo_tMemoryArray::getValue uses a VarHandle with a
> MemoryAddress carriers - but what you are extracting from the segment
> are the structs themselves, not just pointers.
>
> I tried to made a couple of changes - which seemed to give the desired
> effect:
>
> 1) In the constructor of "nvmlProcessInfo_tMemoryArray" - you should
> replace this:
>
> this.layout = Optional.of(MemoryLayout.ofSequence(length,
> MemoryLayouts.C_POINTER));
>
> With this
>
> this.layout = Optional.of(MemoryLayout.ofSequence(length, structlayout));
>
> 2) In nvmlProcessInfo_tMemoryArray::getValue, replace this:
>
> return new
> nvmlProcessInfo_t((MemoryAddress)this.handle.get(segment.baseAddress()),
> this.structLayout);
>
> With just this:
>
> return new nvmlProcessInfo_t(segment.baseAddress(), this.structLayout);
>
> (the setValue will likely need a similar change to bulk copy the
> incoming array in the right place).
You're right. I was treating the Struct MemoryAddress as a pointer. It
works now. Segfault gone and all.
>
> With these changes, the test runs successfully an prints "27522". No
> idea if that's correct, but the output seems stable.
It's the PID of a process so probably is. "nvidia-smi" is Nvidia's
official utility which prints PIDs and memory usage of processes running
on the GPU.
>
> Maurizio
>
> [1] -
> https://github.com/TANGO-Project/monitor-infrastructure/blob/master/Collectd/nvidia_plugin/nvidia_plugin.c
>
> On 02/02/2020 17:59, Maurizio Cimadamore wrote:
>> I managed to test remotely using a machine with an nvidia GPU. I
>> didn't get any VM crash, but I did get an issue with a misaligned
>> access in the instruction where you get the segfault:
>>
>> SUCCESS
>> SUCCESS
>> INSUFFICIENT_SIZE_ERROR
>> SUCCESS
>> Exception in thread "main" java.lang.IllegalStateException:
>> Misaligned access at address: 140131897994114
>> at
>> java.base/java.lang.invoke.VarHandleMemoryAddressBase.newIllegalStateExceptionForMisalignedAccess(VarHandleMemoryAddressBase.java:54)
>> at
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.offsetNoVMAlignCheck(VarHandleMemoryAddressAsInts.java:69)
>> at
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts.get0(VarHandleMemoryAddressAsInts.java:79)
>> at
>> java.base/java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800baf440.get(Unknown
>> Source)
>> at
>> java.base/java.lang.invoke.VarHandleGuards.guard_L_L(VarHandleGuards.java:41)
>> at
>> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:57)
>> at
>> org.goliath.crosspoint.fields.NumberField.getValue(NumberField.java:14)
>> at org.goliath.bindings.nvml.main.Test.main(Test.java:45)
>>
>> Seems like the nvidia lib always puts an address inside the struct of
>> pointer which doesn't seem to be 4-byte aligned, so reading an int
>> out of it fails.
>>
>> That said, I'm not sure I fully understand how your crosspoint
>> machinery is supposed to work - I'm seeing a bunch of struct
>> creation, even before the nvidia routine is called to fill in the
>> array, which is odd, given that the code is not really allocating any
>> struct. Specifically, this:
>>
>> MemoryArray<nvmlProcessInfo_t> structArray = new
>> nvmlProcessInfo_t().toArray(intPointer.getValue());
>>
>> Creates allocates two off heap structs - one is allocated by
>> nvmlProcessInfo_t - then another is created in the toArray() call.
>> Which seems completely odd given that the array is supposed to be
>> empty and filled by the nvidia routine?
>>
>> In any case, the crash doesn't happen on my machine - so I suspect
>> that we'll have to keep an eye out for that issue in case we see some
>> other example which ends up with same crash. It would be useful if
>> you could convert your test not to use the crosspoint library and see
>> if that still has the crash. This should not be super hard given that
>> there aren't many calls in there - and the static wrappers generated
>> by jextract should be good enough to run that test?
>>
>> Maurizio
>>
>> On 02/02/2020 07:59, Maurizio Cimadamore wrote:
>>>
>>> On 02/02/2020 01:51, Ty Young wrote:
>>>> I'm not entirely sure what could be done differently but If you
>>>> have suggestions then I'd be glad to hear it. The thing to keep in
>>>> mind with NVML is that it's backwards and cross-platform compatible
>>>> so once things are defined there isn't anything to really worry
>>>> about later.
>>>>
>>>>
>>>> In hindsight the NativeFunction implementations shouldn't force the
>>>> use of higher level abstractions - that should be the job of
>>>> nvml_h.java as it's what enforces type safety to begin with.
>>>
>>> I wasn't suggesting you should change the API - just that there are
>>> many layers between the code you see in Test and the actual method
>>> handle, var handle calls - which makes it harder to diagnose.
>>>
>>> Re-reading the stack trace in the crash, it seems to be a problem
>>> related with classfile parsing, potentially of one of the synthetic
>>> VarHandle classes which we spin on the fly. I'll do more analysis
>>> next week.
>>>
>>> In the meantime it would be helpful to understand if the crash
>>> started to appear when you updated the Panama repository (which
>>> might suggest some relationship with recent commits, such as the one
>>> for adding VarHandle adapter support), or if it's a failure that you
>>> encountered writing a new test.
>>>
>>> Maurizio
>>>
More information about the panama-dev
mailing list