performance and memory optimization of layouts

Thu Aug 13 21:45:49 UTC 2020

On 8/13/20 12:22 PM, Maurizio Cimadamore wrote:
>
>>
>> The Optional suggestion was in the context of something that is most 
>> likely done once per application start, not a continues operation. 
>> I'm not flip flopping.
>>
>>
>> I don't have a particularly positive opinion of Java's garbage 
>> collectors in the context of a desktop applications. I don't want to 
>> go into a verbal fight or go off topic but if you think continuously 
>> expanding the heap size for no real reason as ZGC or Shenandoah does 
>> instead of doing a GC, then we must be have polar opposite ideas as 
>> to what "smart" means. Even G1, as I've found out recently, does 
>> things that are incredibly "smart" under some conditions and/or JRE 
>> builds(server vs. client). I'd love to get answers as to why they are 
>> so "smart", to be frank but I don't know where to ask.
>>
>>
>> (For desktop applications they really are terrible. I'm sorry if it 
>> offends anyone, but they *really* are.)
>>
>>
>> Whatever, off-topic. I'd rather not cross my fingers and hope that 
>> things beyond my control magically work perfectly. Hopefully that's 
>> somewhat understandable.
>
> The real problem in this discussion is that you are failing to provide 
> evidence to any of the big claims you are making here.
>
> Without going off topic, do you have any evidence that GC is the 
> culprit of your performance issue? If so, can you please share?

I don't know what you're referring to specifically. I didn't say GC *is* 
a performance problem, but rather than *it could be* one. I'm viewing 
Netbean's object profiler and it's telling me a few megabytes worth of 
Optional objects are being allocated, only to be garbage collected. 
Turns out I didn't delete *all* of the layout attribute checks(thought I 
did, *sigh*). I'm using Optional elsewhere too so it's hard to see where 
every object type is being allocated from.

>
>
>>
>>
>>>
>>> Profilers like the one you are using are not always the best tool to 
>>> measure performances; they are good at finding obvious issues (and 
>>> in this case, perhaps, the repeated call to .equals are such an 
>>> issue), but the information they report should always be taken with 
>>> a pinch of salt. It happened to me time and again to fix what looked 
>>> like an obvious performance pothole in JVisualVM just to see that, 
>>> after the fix, the numbers were unaffected (or not _as affected_ as 
>>> the profiler was suggesting).
>>
>>
>> What tools do JDK developers use then? How do you know code you write 
>> hits every JVM optimization technique? How do you verify that you 
>> actually hit those optimizations?
> I find JMH and its profiler (if you are on linux, run with option 
> -prof perfasm) to be more reliable, when I really want to see what's 
> happening with my code at the VM level. It also has a nice `-prof gc` 
> mode, which shows how much "garbage" is generated, and how much time 
> is spent on GC. I've seen benchmarks generating Gigabytes of stuff, 
> and yet GC time was pegged at zero (because of what I told you before 
> - if an object is truly garbage, as in created accessed once and then 
> destroyed immediately, in the same callstack, I'm very skeptical that 
> this contributes to real performance issues).

Thanks. If Netbean's profiler really is that inaccurate then I'll give 
that a shot.

>>
>>
>>>
>>> That said, stepping back, if you need performances to be truly 
>>> great, you need to rethink the API to minimize the amount of 
>>> guessing that goes on every time a native object is to be created. 
>>> Going straight from layout to native object, which has been the 
>>> approach you have been pursuing since the start, has the obvious 
>>> issue that, in order to create a native object for a structured 
>>> layout, you need to inspect the entire layout and "classify" it. 
>>> While this is possible, of course performances aren't going to be 
>>> phenomenal.
>>>
>>> It seems to me that you need to separate more the high level API 
>>> (native objects) from the low level API (memory access), so that 
>>> maybe complex native objects can be constructed with builders (w/o 
>>> guessing). Underneath, these objects will have some layouts or 
>>> segment associated, but that doesn't have to be the front door by 
>>> which your objects are created.
>>>
>>> But (also IIRC), your API is intrinsically megamorphic - e.g. 
>>> there's one common base class for all structs, and all accesses to 
>>> fields happen by doing pseudo-reflective lookups on the layout 
>>> object. This way, the code is almost guaranteed not to perform 
>>> optimally; the best sweet spot would be for each native struct 
>>> object to have its own class, and have a static layout, as well as a 
>>> set of accessor methods, where each accessor method boils down to a 
>>> simple VarHandle call (where the VarHandle for the various fields 
>>> are also stored as constants in the class). But I don't think you 
>>> are doing that, so I don't see how, even past the layout 
>>> attribute/equals() issue that you have now, the access performances 
>>> provided by your API can be considered acceptable (it might be in 
>>> your particular use case, but it is certainly not the case in general).
>>
>>
>> I wouldn't ever recommend something like I made for any performance 
>> critical use case or claim it was ever good for it. Use cases like 
>> that probably have situations where performance can and should be 
>> improved on a case by case bases.
>>
>>
>> That said, it doesn't mean things can't be optimized as much as 
>> possible. I hope the logic of "<X> will never be as good as <Y> so 
>> why bother trying?" isn't being used here. Many Java language 
>> features would surely never be implemented if this was the mentality, 
>> yeah?
>>
>>
>> I vaguely remember it being said that FMA is being built so that 
>> abstraction layers such as mine are able to exist, *presumably* with 
>> reasonable performance gjven the purpose of the abstraction layers. 
>> Is this unreasonable or something? If there is a better way that I 
>> haven't thought of I'd love to hear it. I got nothing.
>
> There is a better way to do what you want to do: stop resisting and 
> use jextract (which you categorically refuse to do, based on other 
> unproven assumptions/claims) :-) :-) :-)
>
> Rolling your own framework as your first move is almost never the way 
> to go - one typically gets better at going at the meta level as the 
> number of use cases grow - then you can decide what level of 
> abstraction should be provided by your framework, etc. But, if you 
> really want to go meta, and write your own abstraction, you need to 
> approach it the right way; my general sense is that in this thread you 
> are complaining about _consequences_ (e.g. equals() being slow, 
> layouts are sealed, ...) and not about the _root causes_ (e.g. API 
> choices that are questionable, like using layout as your currency for 
> _semantics_). But again, without looking at the code, this is mostly 
> speculation (I vaguely remember how your API looked when I had to 
> debug some issue few months ago).

I'm sorry, but no. I 100% understand that the approach I take has *many* 
flaws and that I am wrong in somecases, but I refuse to let this idea or 
even an implication that jextract/raw FMA is somehow perfect and my 
abstraction layer is terrible and/or jextract will fix all my problems.

Tthe bindings jextract creates is *unsafe*. You have *zero* checks to 
validate that the incoming MemoryAddress is of the correct size, 
alignment, signedness, etc. You're, as far as I can tell, instead 
crossing your fingers and hoping that people will somehow associate a 
seemingly non-helpfully named file, the generated *constants* class 
file(not *layouts*), as a provider of the appropriate ValueLayouts. Here 
is a generated NVML function binding from jextract:

public static MethodHandle nvmlDeviceGetCount_v2$MH() {
         return nvml_h$constants.nvmlDeviceGetCount_v2$MH();
     }
     public static int nvmlDeviceGetCount_v2 
(jdk.incubator.foreign.MemoryAddress deviceCount) {
         try {
             return 
(int)nvml_h$constants.nvmlDeviceGetCount_v2$MH().invokeExact(deviceCount);
         } catch (Throwable ex) {
             throw new AssertionError(ex);
         }
     }

(The JDK version I'm using is a bit old, but nothing has changed AFAIK 
that matters here AFAIK)

So you have an overloaded object type(MemoryAddress) with zero checks 
and no documentation as to where the appropriate 
MemoryLayout(ValueLayout, actually)  to create it is to be found. If 
this function accepted an enum value, you'd be forced, unless 
something's change, to plug in numbers that aren't understandable 
without documentation. You're already required to do that for the return 
value.

Sure, my bindings using my abstraction layer still have issues like 
telling whether a NativeValue<Long> is signed, but that's a *huge* 
reduction in possible input possibilities than a MemoryAddress. Without 
creating a new implementation that violates the point of the interface, 
you cannot at the very least plug in a NativeStruct or a NativeArray 
abstraction. Some of these aren't even my fault.

And these bindings are tied to the platform they are generated for even 
if the library is cross-platform. Cross-platform JavaFX applications 
using jextract, like mine, are, at the very least, more work to make 
than should be required. There is good reason for this,I know, but if 
the solution is to create a plugin because of a minor disagreement then 
I'll just not use it to begin with.

The very idea that jextract is somehow this perfect solution that'll my 
problem(s) is a red herring. You want to talk about root causes? Fine, 
lets talk about the root cause of why the abstraction exists in the 
first place and is still being used: jextract wasn't in a state to do 
what I wanted it to do at the time, in a way I wanted it, and is still 
not*. Going by everything said, it will never be either, so the only 
other alternative is to waste time creating a plugin that just does what 
I could do by hand, even if doing it by hand is risky. If jextract had 
the ability to spit out layout information for struct, making bindings 
by hand would be less risky, but it looks like that too must be done via 
a plugin.

*referring to method naming and forcing API users to go through the 
Stream API to access attribute names for a given layout, in addition to 
everything else mentioned.

jextract couldn't even handle inner structs/unions until recently ontop 
of the issues mentioned above and you want me to abandon my abstraction 
layer which did and still can do things jextract couldn't? I understand 
that this is a process and things aren't even close to being finished, 
but I'm also being told to use it despite those facts. I'm not trying to 
sound overly negative or ungrateful(I very much am grateful) but this is 
a red herring.

>
> Maurizio
>
>>
>>
>>
>>>
>>> Maurizio
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 13/08/2020 14:06, Ty Young wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I took a little time to look into optimizing the performance of 
>>>>>> my abstraction layer as FMA hasn't changed in any radical, 
>>>>>> breaking way and I'm happy with the overall design of my 
>>>>>> abstraction layer.
>>>>>>
>>>>>>
>>>>>> In order to look into what could be optimized, I set the number 
>>>>>> of worker threads in my JavaFX application to 1 so that Nvidia 
>>>>>> attribute updates are done in a linear fashion and can be more 
>>>>>> easily reasoned as to how much of a performance impact any given 
>>>>>> one has and why. I then use Netbean's built-in profiler to view 
>>>>>> the CPU time was being taken. Runnables to be updated are given 
>>>>>> to the worker thread pool every 500 ms.
>>>>>>
>>>>>>
>>>>>> Unsurprisingly to me, besides PCIe TX/RX attributes which 
>>>>>> supposedly are hung up within NVML itself, the attribute that 
>>>>>> represents GPU processes is the worst by far(see img1). This 
>>>>>> attribute is actually multiple native function calls jammed into 
>>>>>> one attribute which all utilize arrays of structs.
>>>>>>
>>>>>>
>>>>>> Viewing the call tree(see img2) shows that a major contributor to 
>>>>>> the amount of this is caused by ValueLayout.equals() but there is 
>>>>>> some self-time in the upper NativeObject.getNativeObject() and 
>>>>>> NativeValue.ofUnsafeValueeLayout calls as well. 
>>>>>> ValueLayout.equals() is used in a if-else chain because you need 
>>>>>> to know which NativeValue implementation should be returned. If 
>>>>>> the layout is an integer then return NativeInteger, for example. 
>>>>>> It is maybe possible to order this if-else chain in a way that 
>>>>>> may return faster results without hitting every else-if(e.g. 
>>>>>> bytes first, then integers, then longs, etc) but that's always 
>>>>>> going to be a presumptuous, arbitrary order that may not actually 
>>>>>> be faster in some situations.
>>>>>>
>>>>>>
>>>>>> What could be done to improve this? I can't think of any absolute 
>>>>>> fixes but an improvement would be to extend the ValueLayout so 
>>>>>> that you have a NumberLayout and a PointerLayout. You could then 
>>>>>> use instanceof to presumably filter things faster and more 
>>>>>> cheaply so that the mentioned else-if chain does not need to 
>>>>>> check for a pointer layout. The PointerLayout specific checks 
>>>>>> could be moved to its own static method. It's a small change, but 
>>>>>> it's presumably an improvement even if small.
>>>>>>
>>>>>>
>>>>>> Unfortunately I can't do this myself because of sealed types so 
>>>>>> here I am.
>>>>>>
>>>>>>
>>>>>> Another thing that needs optimizing is the memory allocation 
>>>>>> waste of getting an attribute. Every call to attribute(string 
>>>>>> name) allocated a new Optional instance which was often times 
>>>>>> used by my abstraction for a check and then immediately 
>>>>>> discarded. I wanted to do a bunch of layout checks to make sure 
>>>>>> that the MemoryLayout is valid, but after viewing the amount of 
>>>>>> garbage being generated standing out like a sore thumb, I decided 
>>>>>> to remove those checks(they are really important too). The amount 
>>>>>> of memory wasted wasn't worth it. The answer to this is 
>>>>>> presumably going to be value types, but it isn't clear when it's 
>>>>>> going to be delivered.
>>>>>>
>>>>>>
>>>>>> Once again, if MemoryLayout and its extensions weren't sealed I 
>>>>>> could do things to improve both performance and memory waste as 
>>>>>> well as fix the other issue like attributes being factored into 
>>>>>> equality checks when it isn't wanted. Yes, I realize I'm beating 
>>>>>> a dead horse at this point but that dead horse is still causing 
>>>>>> issues.
>>>>>>
>>>>>>
>>>>>> Could the suggested ValueLayout changes be done, at the very 
>>>>>> least? Or maybe somekind of equals() performance optimizations or 
>>>>>> something?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>