performance and memory optimization of layouts

Thu Aug 13 16:56:27 UTC 2020

On 8/13/20 10:27 AM, Maurizio Cimadamore wrote:
>
> On 13/08/2020 15:53, Ty Young wrote:
>>
>> On 8/13/20 8:46 AM, Maurizio Cimadamore wrote:
>>> I can no longer find your repository.
>>>
>>> I think I've suggested something in the past related to a similar 
>>> issue, not sure if you acted on in or not.
>>>
>>> Basically, the suggestion was to define a set of your own layout 
>>> constants, which contained a special attribute which could be used 
>>> for deciding whether something is a NativeInteger, or something 
>>> else. This is the same approach used by the ABI layer and works very 
>>> well.
>>>
>>> With something like that there is no need to do an equals() - you 
>>> just have to get the value of a well-known attribute (e.g. lookup in 
>>> an HashMap).
>>
>>
>> I am doing constants for layouts already.
>>
>>
>> Regardless, doing this still generates a lot of garbage and 
>> presumably isn't efficient CPU wise either since you're accessing a 
>> HashMap under-the-hood. Again, this is being done in order to make 
>> sense of struct fields in quick succession. Each struct field needs 
>> safety attribute checks, which have to check if certain attributes of 
>> a ValueLayout exists(e.g. class, handle, type, etc). Each struct is 
>> stored in an array and there are multiple arrays of structs.
>>
>>
>> Without generating garbage and taking whatever CPU time the HashMap 
>> accessing takes, I can't see a way of doing this without changes from 
>> FMA's end. What you're suggesting, if I'm understanding correctly, 
>> can only be done with the least amount of garbage and CPU time if 
>> ValueLayout was extended so that an instanceof check could be used.
>>
>>
> So, this started as - how can I avoid using equals() for layouts, 
> since that's slow (and there's no way to speed that up, since it has 
> to compare everything).
>
> It seems like (but again I'm musing, since I cannot see your code) 
> that you need the equality test to check that the layout is a "known" 
> one, and, if so, create one wrapper or another.
>
> By using attributes, the needs for equality disappears. Will 
> performances still suffer? I don't know, but I doubt that you are 
> gonna be affected by the hashmap lookup.
>
> As for the garbage, I think you are perhaps giving too much importance 
> to it. Yes, if the API returns Optional (curious, in another thread 
> you suggested to change a lookup function to return optional :-) ), 
> there will be some allocation. But the GC is typically very (very very 
> very) smart about getting rid of objects that are discarded soon after 
> they are created. So, don't assume that every object being allocated 
> will affect the performance of your application in the same way. In 
> fact, I'd be surprised if performances were affected at all in this 
> particular case.

The Optional suggestion was in the context of something that is most 
likely done once per application start, not a continues operation. I'm 
not flip flopping.

I don't have a particularly positive opinion of Java's garbage 
collectors in the context of a desktop applications. I don't want to go 
into a verbal fight or go off topic but if you think continuously 
expanding the heap size for no real reason as ZGC or Shenandoah does 
instead of doing a GC, then we must be have polar opposite ideas as to 
what "smart" means. Even G1, as I've found out recently, does things 
that are incredibly "smart" under some conditions and/or JRE 
builds(server vs. client). I'd love to get answers as to why they are so 
"smart", to be frank but I don't know where to ask.

(For desktop applications they really are terrible. I'm sorry if it 
offends anyone, but they *really* are.)

Whatever, off-topic. I'd rather not cross my fingers and hope that 
things beyond my control magically work perfectly. Hopefully that's 
somewhat understandable.

>
> Profilers like the one you are using are not always the best tool to 
> measure performances; they are good at finding obvious issues (and in 
> this case, perhaps, the repeated call to .equals are such an issue), 
> but the information they report should always be taken with a pinch of 
> salt. It happened to me time and again to fix what looked like an 
> obvious performance pothole in JVisualVM just to see that, after the 
> fix, the numbers were unaffected (or not _as affected_ as the profiler 
> was suggesting).

What tools do JDK developers use then? How do you know code you write 
hits every JVM optimization technique? How do you verify that you 
actually hit those optimizations?

>
> That said, stepping back, if you need performances to be truly great, 
> you need to rethink the API to minimize the amount of guessing that 
> goes on every time a native object is to be created. Going straight 
> from layout to native object, which has been the approach you have 
> been pursuing since the start, has the obvious issue that, in order to 
> create a native object for a structured layout, you need to inspect 
> the entire layout and "classify" it. While this is possible, of course 
> performances aren't going to be phenomenal.
>
> It seems to me that you need to separate more the high level API 
> (native objects) from the low level API (memory access), so that maybe 
> complex native objects can be constructed with builders (w/o 
> guessing). Underneath, these objects will have some layouts or segment 
> associated, but that doesn't have to be the front door by which your 
> objects are created.
>
> But (also IIRC), your API is intrinsically megamorphic - e.g. there's 
> one common base class for all structs, and all accesses to fields 
> happen by doing pseudo-reflective lookups on the layout object. This 
> way, the code is almost guaranteed not to perform optimally; the best 
> sweet spot would be for each native struct object to have its own 
> class, and have a static layout, as well as a set of accessor methods, 
> where each accessor method boils down to a simple VarHandle call 
> (where the VarHandle for the various fields are also stored as 
> constants in the class). But I don't think you are doing that, so I 
> don't see how, even past the layout attribute/equals() issue that you 
> have now, the access performances provided by your API can be 
> considered acceptable (it might be in your particular use case, but it 
> is certainly not the case in general).

I wouldn't ever recommend something like I made for any performance 
critical use case or claim it was ever good for it. Use cases like that 
probably have situations where performance can and should be improved on 
a case by case bases.

That said, it doesn't mean things can't be optimized as much as 
possible. I hope the logic of "<X> will never be as good as <Y> so why 
bother trying?" isn't being used here. Many Java language features would 
surely never be implemented if this was the mentality, yeah?

I vaguely remember it being said that FMA is being built so that 
abstraction layers such as mine are able to exist, *presumably* with 
reasonable performance gjven the purpose of the abstraction layers. Is 
this unreasonable or something? If there is a better way that I haven't 
thought of I'd love to hear it. I got nothing.

>
> Maurizio
>
>
>>
>>
>>
>>
>>>
>>> Maurizio
>>>
>>> On 13/08/2020 14:06, Ty Young wrote:
>>>> Hi,
>>>>
>>>>
>>>> I took a little time to look into optimizing the performance of my 
>>>> abstraction layer as FMA hasn't changed in any radical, breaking 
>>>> way and I'm happy with the overall design of my abstraction layer.
>>>>
>>>>
>>>> In order to look into what could be optimized, I set the number of 
>>>> worker threads in my JavaFX application to 1 so that Nvidia 
>>>> attribute updates are done in a linear fashion and can be more 
>>>> easily reasoned as to how much of a performance impact any given 
>>>> one has and why. I then use Netbean's built-in profiler to view the 
>>>> CPU time was being taken. Runnables to be updated are given to the 
>>>> worker thread pool every 500 ms.
>>>>
>>>>
>>>> Unsurprisingly to me, besides PCIe TX/RX attributes which 
>>>> supposedly are hung up within NVML itself, the attribute that 
>>>> represents GPU processes is the worst by far(see img1). This 
>>>> attribute is actually multiple native function calls jammed into 
>>>> one attribute which all utilize arrays of structs.
>>>>
>>>>
>>>> Viewing the call tree(see img2) shows that a major contributor to 
>>>> the amount of this is caused by ValueLayout.equals() but there is 
>>>> some self-time in the upper NativeObject.getNativeObject() and 
>>>> NativeValue.ofUnsafeValueeLayout calls as well. 
>>>> ValueLayout.equals() is used in a if-else chain because you need to 
>>>> know which NativeValue implementation should be returned. If the 
>>>> layout is an integer then return NativeInteger, for example. It is 
>>>> maybe possible to order this if-else chain in a way that may return 
>>>> faster results without hitting every else-if(e.g. bytes first, then 
>>>> integers, then longs, etc) but that's always going to be a 
>>>> presumptuous, arbitrary order that may not actually be faster in 
>>>> some situations.
>>>>
>>>>
>>>> What could be done to improve this? I can't think of any absolute 
>>>> fixes but an improvement would be to extend the ValueLayout so that 
>>>> you have a NumberLayout and a PointerLayout. You could then use 
>>>> instanceof to presumably filter things faster and more cheaply so 
>>>> that the mentioned else-if chain does not need to check for a 
>>>> pointer layout. The PointerLayout specific checks could be moved to 
>>>> its own static method. It's a small change, but it's presumably an 
>>>> improvement even if small.
>>>>
>>>>
>>>> Unfortunately I can't do this myself because of sealed types so 
>>>> here I am.
>>>>
>>>>
>>>> Another thing that needs optimizing is the memory allocation waste 
>>>> of getting an attribute. Every call to attribute(string name) 
>>>> allocated a new Optional instance which was often times used by my 
>>>> abstraction for a check and then immediately discarded. I wanted to 
>>>> do a bunch of layout checks to make sure that the MemoryLayout is 
>>>> valid, but after viewing the amount of garbage being generated 
>>>> standing out like a sore thumb, I decided to remove those 
>>>> checks(they are really important too). The amount of memory wasted 
>>>> wasn't worth it. The answer to this is presumably going to be value 
>>>> types, but it isn't clear when it's going to be delivered.
>>>>
>>>>
>>>> Once again, if MemoryLayout and its extensions weren't sealed I 
>>>> could do things to improve both performance and memory waste as 
>>>> well as fix the other issue like attributes being factored into 
>>>> equality checks when it isn't wanted. Yes, I realize I'm beating a 
>>>> dead horse at this point but that dead horse is still causing issues.
>>>>
>>>>
>>>> Could the suggested ValueLayout changes be done, at the very least? 
>>>> Or maybe somekind of equals() performance optimizations or something?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>