Valhalla, startup, performance of interpreter, and vwithfield
karen.kinnear at oracle.com
Wed May 15 13:49:11 UTC 2019
I discussed this with Frederic, and between MVT and LW1 he had improved the interpreter overhead of the withfield bytecode.
He pointed out that the measurement we are looking for is slightly different - at least than my understanding of what
The question is about the cost of inline class creation vs. identity class creation, not a single operation default or withfield.
So - could you take take a small inline class - say one with 4 fields each containing an int and compare
the cost of a method that constructs the inline class using 4 withfields, vs. the cost of a constructor
for a comparable identity class?
The theory is that for the inline class, there would be 4 withfields, each with an allocation step
(for the interpreter, and possibly for C1). So the cost of construction would be much higher than the
equivalent identity class constructor. For those not in the nest, there would the need to call the method
that creates the inline class; whereas the identity class could be created by anyone - so my mental model
is that both examples would have a call overhead in them.
Does that make sense to you?
Would that be something you could measure?
I think we have alternative approaches which would not require each field setting to perform an allocation step.
> On May 13, 2019, at 5:26 PM, Sergey Kuksenko <sergey.kuksenko at oracle.com> wrote:
> On 5/13/19 7:00 AM, Brian Goetz wrote:
>> This is good news. I want to ask further about the numbers you cite here. You compare value creation to classic object creation, but obviously we want value creation to be faster.
> In the interpreter? I am afraid that value creation cost in the interpreter can't be faster than classic object creation. We still have interpretation cost of value types slower than interpretation of equivalent classic objects. But the difference was reduced drastically. Also I didn't find any scenario where the interpreter performance has significant impact to startup time. The first execution which implies class loading, verification, etc is 500x times slower than subsequent execution in the interpreter. (classic objects and value types)
>> When you say it is comparable to classic object creation costs, I assume that you are not including the allocation cost, and comparing only the field write costs?
> No. It includes allocation cost. Don't forget - I am talking about the interpreter performance. Here is some decomposition.
> 1. Classic object creation: ~230ns (500 cycles) for the whole object creation. It could be split to ~200ns (440 cycles) for object allocation and ~30 ns (60 cycles) for fields initialization.
> 2. Value type creation. Any single operation vdefault or vwithfield has ~200ns (440 cycles) cost. It's on par (even slightly better) than full object creation. And it looks normal, because of the single vdefault or vwithfield operation - "creates" object (or similar to it). Of course, than more fields we have than higher it is in the interpreter to gather the full objects.
> As for compiled code - after C2 we have the following numbers:
> e.g. (two-fields classe)
> 1. Classic object creation: 14.9ns (total cost) (G1GC)
> 1.1 Classic object creation - only fields write cost: 0.99ns
> 2. Value type (full creation): 0.97ns (slightly better than just fields write cost in case of classic object).
> Note: all examples here was measured when all data are perfectly fit into CPU caches, even for classic objects. All value type benefits due to better cache locality were intentionally excluded.
>>> I did quick evaluation of startup and interpreter performance cost. I have to take back my words that "vwithfield is major contributor to the interpreter speed and merged(or fused) vwithfield could improve interpreter performance". It was quite long time age when I was looking into interpreter's performance last time. I have to say that a huge work was done for interpreter since that time and now I don't consider interpreter's performance as an issue. As for vwithfield, now cost of the single vwithfield (in the interpreter) is approximately 200ns (on 2.2GHz freq). It is not a big nor a small value. If compare cost of value creation vs cost similar classic java object creation (simple writes) then single vwithfield costs ~7%-10% from the whole object creation. So I am guessing that if you have a value with 10 fields (and 10 vwithfield operations) - you may double value creation cost, but it will have minor impact for the whole execution.
>>> Also I have to say that if look into startup for the first execution of code - interpreter takes less than 1%. All others actions (classloading, verification, etc..) take much more time. As for "time to performance" - I didn't evaluate it yet. Interpreter's impact could be higher in that case. At the same moment - working TieredCompilation will improve "time to performance" much more than any interpreter tuning.
More information about the valhalla-dev