Random values from NVML functions
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri May 15 00:17:59 UTC 2020
I took a look at the code, I couldn't spot anything obviously wrong with
the way the function in question is used, but maybe I found something.
One thing I noted was that you seem to call longValue() on the
NativeInteger modelling the "samplesSizePointer". Perhaps you do that
because in the API that pointer is an unsigned int pointer. But that's
32 bit and if you read it as a negative int value (because of overflow),
casting to long as you do here:
https://github.com/BlueGoliath/Crosspoint/blob/master/src/main/java/org/goliath/crosspoint/numbers/NativeInteger.java#L29
Isn't going to help much. Would be better to just use intValue() and
then just call Integer.toUnsignedLong if you really want to protect
against that.
On a more weird note, the layout for nvmlProcessUtilizationSample_t
seems off:
https://github.com/BlueGoliath/java-nvidia-bindings/blob/3445ea5dc42e3901942a328a4d990cde288d55e7/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/structs/nvmlProcessUtilizationSample_t.java#L13
I tried to grab the struct definition from the nvidia header and paste
it into an header file and then process it with clang:
$ cat Foo.c
struct nvmlProcessUtilizationSample_st
{
unsigned int pid; //!< PID of process
unsigned long long timeStamp; //!< CPU Timestamp in microseconds
unsigned int smUtil; //!< SM (3D/Compute) Util Value
unsigned int memUtil; //!< Frame Buffer Memory Util Value
unsigned int encUtil; //!< Encoder Util Value
unsigned int decUtil; //!< Decoder Util Value
};
struct nvmlProcessUtilizationSample_st str;
$ clang -cc1 -fdump-record-layouts -emit-llvm Foo.c
*** Dumping AST Record Layout
0 | struct nvmlProcessUtilizationSample_st
0 | unsigned int pid
8 | unsigned long long timeStamp
16 | unsigned int smUtil
20 | unsigned int memUtil
24 | unsigned int encUtil
28 | unsigned int decUtil
| [sizeof=32, align=8]
Can this be the issue? Your layout has lots more padding - is it
possible that you are just trying to read from padding?
Maurizio
On 14/05/2020 01:21, Ty Young wrote:
>
> On 5/13/20 7:01 PM, Maurizio Cimadamore wrote:
>>
>> Btw - nice-looking app! (I looked at the pic :-) )
>>
>
> Thanks!
>
>
>> If I understand correctly, the place where you are getting garbage
>> values out of is this:
>>
>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426
>>
>>
>> More specifically, after the call, the array of
>> nvmlProcessUtilizationSample_t doesn't contain what you think it
>> should contain. Am I correct?
>>
>
> Right, almost as if the memory isn't being sliced correctly. Although,
> I'm not sure how incorrectly sliced memory, if zero'd, would give
> those numbers to begin with.
>
>
>> Can I see the client code which calls this function, so that I can
>> take a look at all the pieces?
>>
>
> Of course:
>
>
> https://github.com/BlueGoliath/GoliathEnviousNative/blob/master/modules/org.goliath.envious.nvml/src/main/java/org/goliath/envious/nvml/local/attributes/NVMLGPUProcessAttributeData.java
>
>
>
> Be warned though, the code isn't as pretty as the GUI.
>
>
>> Thanks
>> Maurizio
>>
>> On 14/05/2020 00:55, Ty Young wrote:
>>>
>>> On 5/13/20 6:38 PM, Maurizio Cimadamore wrote:
>>>> Hi,
>>>> is this a regression? E.g. did this work before and now it started
>>>> behave differently all of a sudden (e.g. after a rebuild on panama)
>>>> or is this a new function you are trying to call and you are
>>>> getting an odd behavior?
>>>
>>>
>>> Not sure.
>>>
>>>
>>> After converting everything to FMA from pointer it started giving me
>>> 0 for everything where the Pointer API would give me seemingly
>>> correct non-zero values the majority of the time, but would
>>> sometimes give random garbage. Because the old Pointer API never
>>> zero'd memory I have no idea if those values were valid or not, so I
>>> didn't think much of always getting 0.
>>>
>>>
>>> Yesterday I did some cleanups in the OO code(layer under JavaFX),
>>> including converting NativeValue<Integer> instances to
>>> NativeInteger(same for longlong) and it started doing this, which I
>>> think is partially correct: if I start a GPU benchmarking
>>> application(Unigine Superposition) and view the processes content in
>>> the GUI, I do see seemingly correct utilization rates that match
>>> in-app On-Screen-Display FPS.
>>>
>>>
>>> The issue is with Memory Utilization and Video encoder/decoder
>>> Utilization.
>>>
>>>
>>>>
>>>> Maurizio
>>>>
>>>> On 14/05/2020 00:00, Ty Young wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> Currently I'm getting random values[1] from this NVML function[2].
>>>>> I've spent a few hours dumping sizes and re-checking my
>>>>> abstraction layer code in order to figure out why it's doing this
>>>>> but am not seeing anything. I'm wondering if there ware any recent
>>>>> bug fixes in FMA that might cause this that were fixed. If not I'm
>>>>> going to have to try asking on the Nvidia forums.
>>>>>
>>>>>
>>>>> For reference, the function binding can be found here:
>>>>>
>>>>>
>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426
>>>>>
>>>>>
>>>>>
>>>>> and the abstraction layer here:
>>>>>
>>>>>
>>>>> https://github.com/BlueGoliath/Crosspoint/tree/master/src/main/java/org/goliath/crosspoint
>>>>>
>>>>>
>>>>>
>>>>> I'm able to read/write other structs just fine, such as:
>>>>>
>>>>>
>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvctrl/src/main/java/org/goliath/bindings/nvctrl/structs/NVCTRLAttributeValidValuesRec.java
>>>>>
>>>>>
>>>>>
>>>>> and again, all byte sizes seem correct(48 bytes for the NVML
>>>>> struct), so I'm really lost here.
>>>>>
>>>>>
>>>>>
>>>>> [1] https://imgur.com/a/wrQtOXq
>>>>>
>>>>> [2]
>>>>> https://docs.nvidia.com/deploy/nvml-api/group__nvmlGridQueries.html#group__nvmlGridQueries_1gb0ea5236f5e69e63bf53684a11c233bd
>>>>>
More information about the panama-dev
mailing list