Random values from NVML functions
Ty Young
youngty1997 at gmail.com
Fri May 15 00:52:45 UTC 2020
On 5/14/20 7:17 PM, Maurizio Cimadamore wrote:
>
> I took a look at the code, I couldn't spot anything obviously wrong
> with the way the function in question is used, but maybe I found
> something.
>
>
> One thing I noted was that you seem to call longValue() on the
> NativeInteger modelling the "samplesSizePointer". Perhaps you do that
> because in the API that pointer is an unsigned int pointer. But that's
> 32 bit and if you read it as a negative int value (because of
> overflow), casting to long as you do here:
>
> https://github.com/BlueGoliath/Crosspoint/blob/master/src/main/java/org/goliath/crosspoint/numbers/NativeInteger.java#L29
>
> Isn't going to help much. Would be better to just use intValue() and
> then just call Integer.toUnsignedLong if you really want to protect
> against that.
>
There wasn't any reason besides the constructor for the array accepting
a long value. The function should never return negatives.
>
> On a more weird note, the layout for nvmlProcessUtilizationSample_t
> seems off:
>
> https://github.com/BlueGoliath/java-nvidia-bindings/blob/3445ea5dc42e3901942a328a4d990cde288d55e7/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/structs/nvmlProcessUtilizationSample_t.java#L13
>
> I tried to grab the struct definition from the nvidia header and paste
> it into an header file and then process it with clang:
>
> $ cat Foo.c
> struct nvmlProcessUtilizationSample_st
> {
> unsigned int pid; //!< PID of process
> unsigned long long timeStamp; //!< CPU Timestamp in microseconds
> unsigned int smUtil; //!< SM (3D/Compute) Util Value
> unsigned int memUtil; //!< Frame Buffer Memory Util
> Value
> unsigned int encUtil; //!< Encoder Util Value
> unsigned int decUtil; //!< Decoder Util Value
> };
>
> struct nvmlProcessUtilizationSample_st str;
>
> $ clang -cc1 -fdump-record-layouts -emit-llvm Foo.c
>
> *** Dumping AST Record Layout
> 0 | struct nvmlProcessUtilizationSample_st
> 0 | unsigned int pid
> 8 | unsigned long long timeStamp
> 16 | unsigned int smUtil
> 20 | unsigned int memUtil
> 24 | unsigned int encUtil
> 28 | unsigned int decUtil
> | [sizeof=32, align=8]
>
>
> Can this be the issue? Your layout has lots more padding - is it
> possible that you are just trying to read from padding?
>
Seems like it? Memory utilization now no longer returns 0 and I'm not
getting random numbers right now. I could have sworn the layout was from
the first jextract build though as NVML was the first native library I
used it on.
However, the numbers don't match what I'm getting from another
third-party application and I don't know why. It seems to report stale
numbers unless a process is using significant load. Maybe just an issue
with the timestamp.
Anyway, thanks.
> Maurizio
>
>
> On 14/05/2020 01:21, Ty Young wrote:
>>
>> On 5/13/20 7:01 PM, Maurizio Cimadamore wrote:
>>>
>>> Btw - nice-looking app! (I looked at the pic :-) )
>>>
>>
>> Thanks!
>>
>>
>>> If I understand correctly, the place where you are getting garbage
>>> values out of is this:
>>>
>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426
>>>
>>>
>>> More specifically, after the call, the array of
>>> nvmlProcessUtilizationSample_t doesn't contain what you think it
>>> should contain. Am I correct?
>>>
>>
>> Right, almost as if the memory isn't being sliced correctly.
>> Although, I'm not sure how incorrectly sliced memory, if zero'd,
>> would give those numbers to begin with.
>>
>>
>>> Can I see the client code which calls this function, so that I can
>>> take a look at all the pieces?
>>>
>>
>> Of course:
>>
>>
>> https://github.com/BlueGoliath/GoliathEnviousNative/blob/master/modules/org.goliath.envious.nvml/src/main/java/org/goliath/envious/nvml/local/attributes/NVMLGPUProcessAttributeData.java
>>
>>
>>
>> Be warned though, the code isn't as pretty as the GUI.
>>
>>
>>> Thanks
>>> Maurizio
>>>
>>> On 14/05/2020 00:55, Ty Young wrote:
>>>>
>>>> On 5/13/20 6:38 PM, Maurizio Cimadamore wrote:
>>>>> Hi,
>>>>> is this a regression? E.g. did this work before and now it started
>>>>> behave differently all of a sudden (e.g. after a rebuild on
>>>>> panama) or is this a new function you are trying to call and you
>>>>> are getting an odd behavior?
>>>>
>>>>
>>>> Not sure.
>>>>
>>>>
>>>> After converting everything to FMA from pointer it started giving
>>>> me 0 for everything where the Pointer API would give me seemingly
>>>> correct non-zero values the majority of the time, but would
>>>> sometimes give random garbage. Because the old Pointer API never
>>>> zero'd memory I have no idea if those values were valid or not, so
>>>> I didn't think much of always getting 0.
>>>>
>>>>
>>>> Yesterday I did some cleanups in the OO code(layer under JavaFX),
>>>> including converting NativeValue<Integer> instances to
>>>> NativeInteger(same for longlong) and it started doing this, which I
>>>> think is partially correct: if I start a GPU benchmarking
>>>> application(Unigine Superposition) and view the processes content
>>>> in the GUI, I do see seemingly correct utilization rates that match
>>>> in-app On-Screen-Display FPS.
>>>>
>>>>
>>>> The issue is with Memory Utilization and Video encoder/decoder
>>>> Utilization.
>>>>
>>>>
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 14/05/2020 00:00, Ty Young wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Currently I'm getting random values[1] from this NVML
>>>>>> function[2]. I've spent a few hours dumping sizes and re-checking
>>>>>> my abstraction layer code in order to figure out why it's doing
>>>>>> this but am not seeing anything. I'm wondering if there ware any
>>>>>> recent bug fixes in FMA that might cause this that were fixed. If
>>>>>> not I'm going to have to try asking on the Nvidia forums.
>>>>>>
>>>>>>
>>>>>> For reference, the function binding can be found here:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426
>>>>>>
>>>>>>
>>>>>>
>>>>>> and the abstraction layer here:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/Crosspoint/tree/master/src/main/java/org/goliath/crosspoint
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm able to read/write other structs just fine, such as:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvctrl/src/main/java/org/goliath/bindings/nvctrl/structs/NVCTRLAttributeValidValuesRec.java
>>>>>>
>>>>>>
>>>>>>
>>>>>> and again, all byte sizes seem correct(48 bytes for the NVML
>>>>>> struct), so I'm really lost here.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://imgur.com/a/wrQtOXq
>>>>>>
>>>>>> [2]
>>>>>> https://docs.nvidia.com/deploy/nvml-api/group__nvmlGridQueries.html#group__nvmlGridQueries_1gb0ea5236f5e69e63bf53684a11c233bd
>>>>>>
More information about the panama-dev
mailing list