performance and memory optimization of layouts

Thu Aug 13 15:27:30 UTC 2020

On 13/08/2020 15:53, Ty Young wrote:
>
> On 8/13/20 8:46 AM, Maurizio Cimadamore wrote:
>> I can no longer find your repository.
>>
>> I think I've suggested something in the past related to a similar 
>> issue, not sure if you acted on in or not.
>>
>> Basically, the suggestion was to define a set of your own layout 
>> constants, which contained a special attribute which could be used 
>> for deciding whether something is a NativeInteger, or something else. 
>> This is the same approach used by the ABI layer and works very well.
>>
>> With something like that there is no need to do an equals() - you 
>> just have to get the value of a well-known attribute (e.g. lookup in 
>> an HashMap).
>
>
> I am doing constants for layouts already.
>
>
> Regardless, doing this still generates a lot of garbage and presumably 
> isn't efficient CPU wise either since you're accessing a HashMap 
> under-the-hood. Again, this is being done in order to make sense of 
> struct fields in quick succession. Each struct field needs safety 
> attribute checks, which have to check if certain attributes of a 
> ValueLayout exists(e.g. class, handle, type, etc). Each struct is 
> stored in an array and there are multiple arrays of structs.
>
>
> Without generating garbage and taking whatever CPU time the HashMap 
> accessing takes, I can't see a way of doing this without changes from 
> FMA's end. What you're suggesting, if I'm understanding correctly, can 
> only be done with the least amount of garbage and CPU time if 
> ValueLayout was extended so that an instanceof check could be used.
>
>
So, this started as - how can I avoid using equals() for layouts, since 
that's slow (and there's no way to speed that up, since it has to 
compare everything).

It seems like (but again I'm musing, since I cannot see your code) that 
you need the equality test to check that the layout is a "known" one, 
and, if so, create one wrapper or another.

By using attributes, the needs for equality disappears. Will 
performances still suffer? I don't know, but I doubt that you are gonna 
be affected by the hashmap lookup.

As for the garbage, I think you are perhaps giving too much importance 
to it. Yes, if the API returns Optional (curious, in another thread you 
suggested to change a lookup function to return optional :-) ), there 
will be some allocation. But the GC is typically very (very very very) 
smart about getting rid of objects that are discarded soon after they 
are created. So, don't assume that every object being allocated will 
affect the performance of your application in the same way. In fact, I'd 
be surprised if performances were affected at all in this particular case.

Profilers like the one you are using are not always the best tool to 
measure performances; they are good at finding obvious issues (and in 
this case, perhaps, the repeated call to .equals are such an issue), but 
the information they report should always be taken with a pinch of salt. 
It happened to me time and again to fix what looked like an obvious 
performance pothole in JVisualVM just to see that, after the fix, the 
numbers were unaffected (or not _as affected_ as the profiler was 
suggesting).

That said, stepping back, if you need performances to be truly great, 
you need to rethink the API to minimize the amount of guessing that goes 
on every time a native object is to be created. Going straight from 
layout to native object, which has been the approach you have been 
pursuing since the start, has the obvious issue that, in order to create 
a native object for a structured layout, you need to inspect the entire 
layout and "classify" it. While this is possible, of course performances 
aren't going to be phenomenal.

It seems to me that you need to separate more the high level API (native 
objects) from the low level API (memory access), so that maybe complex 
native objects can be constructed with builders (w/o guessing). 
Underneath, these objects will have some layouts or segment associated, 
but that doesn't have to be the front door by which your objects are 
created.

But (also IIRC), your API is intrinsically megamorphic - e.g. there's 
one common base class for all structs, and all accesses to fields happen 
by doing pseudo-reflective lookups on the layout object. This way, the 
code is almost guaranteed not to perform optimally; the best sweet spot 
would be for each native struct object to have its own class, and have a 
static layout, as well as a set of accessor methods, where each accessor 
method boils down to a simple VarHandle call (where the VarHandle for 
the various fields are also stored as constants in the class). But I 
don't think you are doing that, so I don't see how, even past the layout 
attribute/equals() issue that you have now, the access performances 
provided by your API can be considered acceptable (it might be in your 
particular use case, but it is certainly not the case in general).

Maurizio

>
>
>
>
>>
>> Maurizio
>>
>> On 13/08/2020 14:06, Ty Young wrote:
>>> Hi,
>>>
>>>
>>> I took a little time to look into optimizing the performance of my 
>>> abstraction layer as FMA hasn't changed in any radical, breaking way 
>>> and I'm happy with the overall design of my abstraction layer.
>>>
>>>
>>> In order to look into what could be optimized, I set the number of 
>>> worker threads in my JavaFX application to 1 so that Nvidia 
>>> attribute updates are done in a linear fashion and can be more 
>>> easily reasoned as to how much of a performance impact any given one 
>>> has and why. I then use Netbean's built-in profiler to view the CPU 
>>> time was being taken. Runnables to be updated are given to the 
>>> worker thread pool every 500 ms.
>>>
>>>
>>> Unsurprisingly to me, besides PCIe TX/RX attributes which supposedly 
>>> are hung up within NVML itself, the attribute that represents GPU 
>>> processes is the worst by far(see img1). This attribute is actually 
>>> multiple native function calls jammed into one attribute which all 
>>> utilize arrays of structs.
>>>
>>>
>>> Viewing the call tree(see img2) shows that a major contributor to 
>>> the amount of this is caused by ValueLayout.equals() but there is 
>>> some self-time in the upper NativeObject.getNativeObject() and 
>>> NativeValue.ofUnsafeValueeLayout calls as well. ValueLayout.equals() 
>>> is used in a if-else chain because you need to know which 
>>> NativeValue implementation should be returned. If the layout is an 
>>> integer then return NativeInteger, for example. It is maybe possible 
>>> to order this if-else chain in a way that may return faster results 
>>> without hitting every else-if(e.g. bytes first, then integers, then 
>>> longs, etc) but that's always going to be a presumptuous, arbitrary 
>>> order that may not actually be faster in some situations.
>>>
>>>
>>> What could be done to improve this? I can't think of any absolute 
>>> fixes but an improvement would be to extend the ValueLayout so that 
>>> you have a NumberLayout and a PointerLayout. You could then use 
>>> instanceof to presumably filter things faster and more cheaply so 
>>> that the mentioned else-if chain does not need to check for a 
>>> pointer layout. The PointerLayout specific checks could be moved to 
>>> its own static method. It's a small change, but it's presumably an 
>>> improvement even if small.
>>>
>>>
>>> Unfortunately I can't do this myself because of sealed types so here 
>>> I am.
>>>
>>>
>>> Another thing that needs optimizing is the memory allocation waste 
>>> of getting an attribute. Every call to attribute(string name) 
>>> allocated a new Optional instance which was often times used by my 
>>> abstraction for a check and then immediately discarded. I wanted to 
>>> do a bunch of layout checks to make sure that the MemoryLayout is 
>>> valid, but after viewing the amount of garbage being generated 
>>> standing out like a sore thumb, I decided to remove those 
>>> checks(they are really important too). The amount of memory wasted 
>>> wasn't worth it. The answer to this is presumably going to be value 
>>> types, but it isn't clear when it's going to be delivered.
>>>
>>>
>>> Once again, if MemoryLayout and its extensions weren't sealed I 
>>> could do things to improve both performance and memory waste as well 
>>> as fix the other issue like attributes being factored into equality 
>>> checks when it isn't wanted. Yes, I realize I'm beating a dead horse 
>>> at this point but that dead horse is still causing issues.
>>>
>>>
>>> Could the suggested ValueLayout changes be done, at the very least? 
>>> Or maybe somekind of equals() performance optimizations or something?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>