Random values from NVML functions

Ty Young youngty1997 at gmail.com
Fri May 15 00:52:45 UTC 2020


On 5/14/20 7:17 PM, Maurizio Cimadamore wrote:
>
> I took a look at the code, I couldn't spot anything obviously wrong 
> with the way the function in question is used, but maybe I found 
> something.
>
>
> One thing I noted was that you seem to call longValue() on the 
> NativeInteger modelling the "samplesSizePointer". Perhaps you do that 
> because in the API that pointer is an unsigned int pointer. But that's 
> 32 bit and if you read it as a negative int value (because of 
> overflow), casting to long as you do here:
>
> https://github.com/BlueGoliath/Crosspoint/blob/master/src/main/java/org/goliath/crosspoint/numbers/NativeInteger.java#L29
>
> Isn't going to help much. Would be better to just use intValue() and 
> then just call Integer.toUnsignedLong if you really want to protect 
> against that.
>

There wasn't any reason besides the constructor for the array accepting 
a long value. The function should never return negatives.


>
> On a more weird note, the layout for nvmlProcessUtilizationSample_t 
> seems off:
>
> https://github.com/BlueGoliath/java-nvidia-bindings/blob/3445ea5dc42e3901942a328a4d990cde288d55e7/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/structs/nvmlProcessUtilizationSample_t.java#L13
>
> I tried to grab the struct definition from the nvidia header and paste 
> it into an header file and then process it with clang:
>
> $ cat Foo.c
> struct nvmlProcessUtilizationSample_st
> {
>     unsigned int pid;                   //!< PID of process
>     unsigned long long timeStamp;       //!< CPU Timestamp in microseconds
>     unsigned int smUtil;                //!< SM (3D/Compute) Util Value
>     unsigned int memUtil;               //!< Frame Buffer Memory Util 
> Value
>     unsigned int encUtil;               //!< Encoder Util Value
>     unsigned int decUtil;               //!< Decoder Util Value
> };
>
> struct nvmlProcessUtilizationSample_st str;
>
> $ clang -cc1 -fdump-record-layouts -emit-llvm Foo.c
>
> *** Dumping AST Record Layout
>          0 | struct nvmlProcessUtilizationSample_st
>          0 |   unsigned int pid
>          8 |   unsigned long long timeStamp
>         16 |   unsigned int smUtil
>         20 |   unsigned int memUtil
>         24 |   unsigned int encUtil
>         28 |   unsigned int decUtil
>            | [sizeof=32, align=8]
>
>
> Can this be the issue? Your layout has lots more padding - is it 
> possible that you are just trying to read from padding?
>

Seems like it? Memory utilization now no longer returns 0 and I'm not 
getting random numbers right now. I could have sworn the layout was from 
the first jextract build though as NVML was the first native library I 
used it on.


However, the numbers don't match what I'm getting from another 
third-party application and I don't know why. It seems to report stale 
numbers unless a process is using significant load. Maybe just an issue 
with the timestamp.


Anyway, thanks.


> Maurizio
>
>
> On 14/05/2020 01:21, Ty Young wrote:
>>
>> On 5/13/20 7:01 PM, Maurizio Cimadamore wrote:
>>>
>>> Btw - nice-looking app! (I looked at the pic :-) )
>>>
>>
>> Thanks!
>>
>>
>>> If I understand correctly, the place where you are getting garbage 
>>> values out of is this:
>>>
>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426 
>>>
>>>
>>> More specifically, after the call, the array of 
>>> nvmlProcessUtilizationSample_t doesn't contain what you think it 
>>> should contain. Am I correct?
>>>
>>
>> Right, almost as if the memory isn't being sliced correctly. 
>> Although, I'm not sure how incorrectly sliced memory, if zero'd, 
>> would give those numbers to begin with.
>>
>>
>>> Can I see the client code which calls this function, so that I can 
>>> take a look at all the pieces?
>>>
>>
>> Of course:
>>
>>
>> https://github.com/BlueGoliath/GoliathEnviousNative/blob/master/modules/org.goliath.envious.nvml/src/main/java/org/goliath/envious/nvml/local/attributes/NVMLGPUProcessAttributeData.java 
>>
>>
>>
>> Be warned though, the code isn't as pretty as the GUI.
>>
>>
>>> Thanks
>>> Maurizio
>>>
>>> On 14/05/2020 00:55, Ty Young wrote:
>>>>
>>>> On 5/13/20 6:38 PM, Maurizio Cimadamore wrote:
>>>>> Hi,
>>>>> is this a regression? E.g. did this work before and now it started 
>>>>> behave differently all of a sudden (e.g. after a rebuild on 
>>>>> panama) or is this a new function you are trying to call and you 
>>>>> are getting an odd behavior?
>>>>
>>>>
>>>> Not sure.
>>>>
>>>>
>>>> After converting everything to FMA from pointer it started giving 
>>>> me 0 for everything where the Pointer API would give me seemingly 
>>>> correct non-zero values the majority of the time, but would 
>>>> sometimes give random garbage. Because the old Pointer API never 
>>>> zero'd memory I have no idea if those values were valid or not, so 
>>>> I didn't think much of always getting 0.
>>>>
>>>>
>>>> Yesterday I did some cleanups in the OO code(layer under JavaFX), 
>>>> including converting NativeValue<Integer> instances to 
>>>> NativeInteger(same for longlong) and it started doing this, which I 
>>>> think is partially correct: if I start a GPU benchmarking 
>>>> application(Unigine Superposition) and view the processes content 
>>>> in the GUI, I do see seemingly correct utilization rates that match 
>>>> in-app On-Screen-Display FPS.
>>>>
>>>>
>>>> The issue is with Memory Utilization and Video encoder/decoder 
>>>> Utilization.
>>>>
>>>>
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 14/05/2020 00:00, Ty Young wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Currently I'm getting random values[1] from this NVML 
>>>>>> function[2]. I've spent a few hours dumping sizes and re-checking 
>>>>>> my abstraction layer code in order to figure out why it's doing 
>>>>>> this but am not seeing anything. I'm wondering if there ware any 
>>>>>> recent bug fixes in FMA that might cause this that were fixed. If 
>>>>>> not I'm going to have to try asking on the Nvidia forums.
>>>>>>
>>>>>>
>>>>>> For reference, the function binding can be found here:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvml/src/main/java/org/goliath/bindings/nvml/main/nvml_h.java#L426 
>>>>>>
>>>>>>
>>>>>>
>>>>>> and the abstraction layer here:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/Crosspoint/tree/master/src/main/java/org/goliath/crosspoint 
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm able to read/write other structs just fine, such as:
>>>>>>
>>>>>>
>>>>>> https://github.com/BlueGoliath/java-nvidia-bindings/blob/master/modules/org.goliath.bindings.nvctrl/src/main/java/org/goliath/bindings/nvctrl/structs/NVCTRLAttributeValidValuesRec.java 
>>>>>>
>>>>>>
>>>>>>
>>>>>> and again, all byte sizes seem correct(48 bytes for the NVML 
>>>>>> struct), so I'm really lost here.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://imgur.com/a/wrQtOXq
>>>>>>
>>>>>> [2] 
>>>>>> https://docs.nvidia.com/deploy/nvml-api/group__nvmlGridQueries.html#group__nvmlGridQueries_1gb0ea5236f5e69e63bf53684a11c233bd
>>>>>>


More information about the panama-dev mailing list