performance and memory optimization of layouts

Thu Aug 13 17:22:34 UTC 2020

>
> The Optional suggestion was in the context of something that is most 
> likely done once per application start, not a continues operation. I'm 
> not flip flopping.
>
>
> I don't have a particularly positive opinion of Java's garbage 
> collectors in the context of a desktop applications. I don't want to 
> go into a verbal fight or go off topic but if you think continuously 
> expanding the heap size for no real reason as ZGC or Shenandoah does 
> instead of doing a GC, then we must be have polar opposite ideas as to 
> what "smart" means. Even G1, as I've found out recently, does things 
> that are incredibly "smart" under some conditions and/or JRE 
> builds(server vs. client). I'd love to get answers as to why they are 
> so "smart", to be frank but I don't know where to ask.
>
>
> (For desktop applications they really are terrible. I'm sorry if it 
> offends anyone, but they *really* are.)
>
>
> Whatever, off-topic. I'd rather not cross my fingers and hope that 
> things beyond my control magically work perfectly. Hopefully that's 
> somewhat understandable.

The real problem in this discussion is that you are failing to provide 
evidence to any of the big claims you are making here.

Without going off topic, do you have any evidence that GC is the culprit 
of your performance issue? If so, can you please share?

>
>
>>
>> Profilers like the one you are using are not always the best tool to 
>> measure performances; they are good at finding obvious issues (and in 
>> this case, perhaps, the repeated call to .equals are such an issue), 
>> but the information they report should always be taken with a pinch 
>> of salt. It happened to me time and again to fix what looked like an 
>> obvious performance pothole in JVisualVM just to see that, after the 
>> fix, the numbers were unaffected (or not _as affected_ as the 
>> profiler was suggesting).
>
>
> What tools do JDK developers use then? How do you know code you write 
> hits every JVM optimization technique? How do you verify that you 
> actually hit those optimizations?
I find JMH and its profiler (if you are on linux, run with option -prof 
perfasm) to be more reliable, when I really want to see what's happening 
with my code at the VM level. It also has a nice `-prof gc` mode, which 
shows how much "garbage" is generated, and how much time is spent on GC. 
I've seen benchmarks generating Gigabytes of stuff, and yet GC time was 
pegged at zero (because of what I told you before - if an object is 
truly garbage, as in created accessed once and then destroyed 
immediately, in the same callstack, I'm very skeptical that this 
contributes to real performance issues).
>
>
>>
>> That said, stepping back, if you need performances to be truly great, 
>> you need to rethink the API to minimize the amount of guessing that 
>> goes on every time a native object is to be created. Going straight 
>> from layout to native object, which has been the approach you have 
>> been pursuing since the start, has the obvious issue that, in order 
>> to create a native object for a structured layout, you need to 
>> inspect the entire layout and "classify" it. While this is possible, 
>> of course performances aren't going to be phenomenal.
>>
>> It seems to me that you need to separate more the high level API 
>> (native objects) from the low level API (memory access), so that 
>> maybe complex native objects can be constructed with builders (w/o 
>> guessing). Underneath, these objects will have some layouts or 
>> segment associated, but that doesn't have to be the front door by 
>> which your objects are created.
>>
>> But (also IIRC), your API is intrinsically megamorphic - e.g. there's 
>> one common base class for all structs, and all accesses to fields 
>> happen by doing pseudo-reflective lookups on the layout object. This 
>> way, the code is almost guaranteed not to perform optimally; the best 
>> sweet spot would be for each native struct object to have its own 
>> class, and have a static layout, as well as a set of accessor 
>> methods, where each accessor method boils down to a simple VarHandle 
>> call (where the VarHandle for the various fields are also stored as 
>> constants in the class). But I don't think you are doing that, so I 
>> don't see how, even past the layout attribute/equals() issue that you 
>> have now, the access performances provided by your API can be 
>> considered acceptable (it might be in your particular use case, but 
>> it is certainly not the case in general).
>
>
> I wouldn't ever recommend something like I made for any performance 
> critical use case or claim it was ever good for it. Use cases like 
> that probably have situations where performance can and should be 
> improved on a case by case bases.
>
>
> That said, it doesn't mean things can't be optimized as much as 
> possible. I hope the logic of "<X> will never be as good as <Y> so why 
> bother trying?" isn't being used here. Many Java language features 
> would surely never be implemented if this was the mentality, yeah?
>
>
> I vaguely remember it being said that FMA is being built so that 
> abstraction layers such as mine are able to exist, *presumably* with 
> reasonable performance gjven the purpose of the abstraction layers. Is 
> this unreasonable or something? If there is a better way that I 
> haven't thought of I'd love to hear it. I got nothing.

There is a better way to do what you want to do: stop resisting and use 
jextract (which you categorically refuse to do, based on other unproven 
assumptions/claims) :-) :-) :-)

Rolling your own framework as your first move is almost never the way to 
go - one typically gets better at going at the meta level as the number 
of use cases grow - then you can decide what level of abstraction should 
be provided by your framework, etc. But, if you really want to go meta, 
and write your own abstraction, you need to approach it the right way; 
my general sense is that in this thread you are complaining about 
_consequences_ (e.g. equals() being slow, layouts are sealed, ...) and 
not about the _root causes_ (e.g. API choices that are questionable, 
like using layout as your currency for _semantics_). But again, without 
looking at the code, this is mostly speculation (I vaguely remember how 
your API looked when I had to debug some issue few months ago).

Maurizio

>
>
>
>>
>> Maurizio
>>
>>
>>>
>>>
>>>
>>>
>>>>
>>>> Maurizio
>>>>
>>>> On 13/08/2020 14:06, Ty Young wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> I took a little time to look into optimizing the performance of my 
>>>>> abstraction layer as FMA hasn't changed in any radical, breaking 
>>>>> way and I'm happy with the overall design of my abstraction layer.
>>>>>
>>>>>
>>>>> In order to look into what could be optimized, I set the number of 
>>>>> worker threads in my JavaFX application to 1 so that Nvidia 
>>>>> attribute updates are done in a linear fashion and can be more 
>>>>> easily reasoned as to how much of a performance impact any given 
>>>>> one has and why. I then use Netbean's built-in profiler to view 
>>>>> the CPU time was being taken. Runnables to be updated are given to 
>>>>> the worker thread pool every 500 ms.
>>>>>
>>>>>
>>>>> Unsurprisingly to me, besides PCIe TX/RX attributes which 
>>>>> supposedly are hung up within NVML itself, the attribute that 
>>>>> represents GPU processes is the worst by far(see img1). This 
>>>>> attribute is actually multiple native function calls jammed into 
>>>>> one attribute which all utilize arrays of structs.
>>>>>
>>>>>
>>>>> Viewing the call tree(see img2) shows that a major contributor to 
>>>>> the amount of this is caused by ValueLayout.equals() but there is 
>>>>> some self-time in the upper NativeObject.getNativeObject() and 
>>>>> NativeValue.ofUnsafeValueeLayout calls as well. 
>>>>> ValueLayout.equals() is used in a if-else chain because you need 
>>>>> to know which NativeValue implementation should be returned. If 
>>>>> the layout is an integer then return NativeInteger, for example. 
>>>>> It is maybe possible to order this if-else chain in a way that may 
>>>>> return faster results without hitting every else-if(e.g. bytes 
>>>>> first, then integers, then longs, etc) but that's always going to 
>>>>> be a presumptuous, arbitrary order that may not actually be faster 
>>>>> in some situations.
>>>>>
>>>>>
>>>>> What could be done to improve this? I can't think of any absolute 
>>>>> fixes but an improvement would be to extend the ValueLayout so 
>>>>> that you have a NumberLayout and a PointerLayout. You could then 
>>>>> use instanceof to presumably filter things faster and more cheaply 
>>>>> so that the mentioned else-if chain does not need to check for a 
>>>>> pointer layout. The PointerLayout specific checks could be moved 
>>>>> to its own static method. It's a small change, but it's presumably 
>>>>> an improvement even if small.
>>>>>
>>>>>
>>>>> Unfortunately I can't do this myself because of sealed types so 
>>>>> here I am.
>>>>>
>>>>>
>>>>> Another thing that needs optimizing is the memory allocation waste 
>>>>> of getting an attribute. Every call to attribute(string name) 
>>>>> allocated a new Optional instance which was often times used by my 
>>>>> abstraction for a check and then immediately discarded. I wanted 
>>>>> to do a bunch of layout checks to make sure that the MemoryLayout 
>>>>> is valid, but after viewing the amount of garbage being generated 
>>>>> standing out like a sore thumb, I decided to remove those 
>>>>> checks(they are really important too). The amount of memory wasted 
>>>>> wasn't worth it. The answer to this is presumably going to be 
>>>>> value types, but it isn't clear when it's going to be delivered.
>>>>>
>>>>>
>>>>> Once again, if MemoryLayout and its extensions weren't sealed I 
>>>>> could do things to improve both performance and memory waste as 
>>>>> well as fix the other issue like attributes being factored into 
>>>>> equality checks when it isn't wanted. Yes, I realize I'm beating a 
>>>>> dead horse at this point but that dead horse is still causing issues.
>>>>>
>>>>>
>>>>> Could the suggested ValueLayout changes be done, at the very 
>>>>> least? Or maybe somekind of equals() performance optimizations or 
>>>>> something?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>