Valhalla, startup, performance of interpreter, and vwithfield

Mon May 13 21:26:11 UTC 2019

On 5/13/19 7:00 AM, Brian Goetz wrote:
> This is good news.  I want to ask further about the numbers you cite here.  You compare value creation to classic object creation, but obviously we want value creation to be faster.

In the interpreter? I am afraid that value creation cost in the 
interpreter can't be faster than classic object creation. We still have 
interpretation cost of value types slower than interpretation of 
equivalent classic objects. But the difference was reduced drastically. 
Also I didn't find any scenario where the interpreter performance has 
significant impact to startup time. The first execution which implies 
class loading, verification, etc is 500x times slower than subsequent 
execution in the interpreter. (classic objects and value types)

>   When you say it is comparable to classic object creation costs, I assume that you are not including the allocation cost, and comparing only the field write costs?

No. It includes allocation cost. Don't forget - I am talking about the 
interpreter performance. Here is some decomposition.

1. Classic object creation: ~230ns (500 cycles) for the whole object 
creation. It could be split to ~200ns (440 cycles) for object allocation 
and ~30 ns (60 cycles) for fields initialization.

2. Value type creation. Any single operation vdefault or vwithfield has 
~200ns (440 cycles) cost. It's on par (even slightly better) than full 
object creation. And it looks normal, because of the single vdefault or 
vwithfield operation - "creates" object (or similar to it). Of course, 
than more fields we have than higher it is in the interpreter to gather 
the full objects.

As for compiled code - after C2 we have the following numbers:

e.g. (two-fields classe)

1. Classic object creation: 14.9ns (total cost) (G1GC)

1.1 Classic object creation - only fields write cost: 0.99ns

2. Value type (full creation): 0.97ns   (slightly better than just 
fields write cost in case of classic object).

Note: all examples here was measured when all data are perfectly fit 
into CPU caches, even for classic objects. All value type benefits due 
to better cache locality were intentionally excluded.

>> I did quick evaluation of startup and interpreter performance cost. I have to take back my words that "vwithfield is major contributor to the interpreter speed and merged(or fused) vwithfield could improve interpreter performance". It was quite long time age when I was looking into interpreter's performance last time. I have to say that a huge work was done for interpreter since that time and now I don't consider interpreter's performance as an issue. As for vwithfield, now cost of the single vwithfield (in the interpreter) is approximately 200ns (on 2.2GHz freq). It is not a big nor a small value. If compare cost of value creation vs cost similar classic java object creation (simple writes) then single vwithfield costs ~7%-10% from the whole object creation. So I am guessing that if you have a value with 10 fields (and 10 vwithfield operations) - you may double value creation cost, but it will have minor impact for the whole execution.
>>
>> Also I have to say that if look into startup for the first execution of code - interpreter takes less than 1%. All others actions (classloading, verification, etc..) take much more time. As for "time to performance" - I didn't evaluate it yet. Interpreter's impact could be higher in that case. At the same moment  - working TieredCompilation will improve "time to performance" much more than any interpreter tuning.