Profiling
Edward Nevill
ed at camswl.com
Thu Feb 19 16:51:05 PST 2009
Hi folks,
I have been doing some profing on the vm using gprof with some interesting
results. To do this I set LINK_INTO=AOUT to generate a static 'gamma' launcher
with the appropriate -pg flags set.
However, I have not been able to do this for the other libraries (eg libjava.so)
only for libjvm.so so we do not get any profiling information on the native
libraries which is what I really wanted.
The problem is that unlike libjvm.so libjava.so and friends are not linked in,
instead they are opened and referenced using dll_open and friends. So I cannot
just statically link in libjava.so.
Does anyone know of a way to get gprof information out of libjava.so.
Anyway, on to the results I got.
The benchmarks I used were CaffeineMark, EEMBC and ThinkFreeOffic (a pure Java
office suite). I ran each benchmark with the original build, the split
interpreter loop (in C), and the split interpreter loop with the split
recoded in ARM asm.
The complete results are available here
http://camswl.com/openjdk/profiles/flat_ecm_original
http://camswl.com/openjdk/profiles/flat_ecm_opt
http://camswl.com/openjdk/profiles/flat_ecm_asm
http://camswl.com/openjdk/profiles/flat_eembc_original
http://camswl.com/openjdk/profiles/flat_eembc_opt
http://camswl.com/openjdk/profiles/flat_eembc_asm
http://camswl.com/openjdk/profiles/flat_office_original
http://camswl.com/openjdk/profiles/flat_office_opt
http://camswl.com/openjdk/profiles/flat_office_asm
Here is a summary, with some discussion. The main point I think is that we
are wasting our time optimising the VM for 'semi real world' applications
like Think Free Office. That doesn't mean it is not worthwhile impriving the
VM, but there are classes of application for which it will make little/no
difference.
--- CaffeineMark --- Original
% cumulative self self total
time seconds seconds calls s/call s/call name
84.12 8.53 8.53 2033574 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
2.27 8.76 0.23 486 0.00 0.00 CppInterpreter::main_loop(int, Thread*)
-----------------------------
As expected almost all the time spent in 'run'
--- CaffeineMark --- Split interpreter loop (run_opt in C)
% cumulative self self total
time seconds seconds calls s/call s/call name
76.80 7.47 7.47 2093698 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
4.63 7.92 0.45 1872003 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
4.01 8.31 0.39 489 0.00 0.00 CppInterpreter::main_loop(int, Thread*)
-----------------------------
OK, this looks good. 76% of the time is now spent in the simple interpreter
loop, ready for optimisation.
Note: you will notice that the time in seconds remains approx the same
in CaffeineMark because CaffeineMark runs for a fixed time rather that
a fixed number of iterations.
--- CaffeineMark --- Asm interpreter loop (run_opt in ARM asm)
% cumulative self self total
time seconds seconds calls ms/call ms/call name
77.15 7.70 7.70 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
5.71 8.27 0.57 2136095 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
2.51 8.52 0.25 490 0.51 1.65 CppInterpreter::main_loop(int, Thread*)
------------------------------
Beginning to flatten somewhat, but still mainly in 'run_opt'. Still plenty of
scope for optimisation.
Note: The Caffeinemark score for asm was approx 1.9X original.
Now, lookup at EEMBC
--- EEMBC --- Original
% cumulative self self total
time seconds seconds calls s/call s/call name
59.07 73.29 73.29 100435660 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
11.29 87.29 14.01 1850 0.01 0.02 CppInterpreter::main_loop(int, Thread*)
4.55 92.94 5.65 47872597 0.00 0.00 methodOopDesc::result_type() const
3.83 97.69 4.75 47882070 0.00 0.00 SignatureIterator::iterate_returntype()
3.32 101.82 4.13 47096159 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
2.80 105.30 3.48 94194346 0.00 0.00 os::vm_page_size()
2.36 108.23 2.93 46326866 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
---------------------------
60% in 'run', note the relatively large % in 'main_loop'. This is because each
invoke and return actually returns (in C) to 'main_loop', the invoke or return
is then handled in 'main_loop' which then calls 'run' again. Hence the
huge number of calls to 'run'.
--- EEMBC --- Split interpreter loop (run_opt in C)
% cumulative self self total
time seconds seconds calls s/call s/call name
31.33 39.03 39.03 113138096 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
22.84 67.48 28.45 100435423 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
12.78 83.39 15.92 1842 0.01 0.03 CppInterpreter::main_loop(int, Thread*)
5.31 90.00 6.61 47872423 0.00 0.00 methodOopDesc::result_type() const
4.43 95.52 5.52 47096055 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
3.77 100.22 4.70 47881896 0.00 0.00 SignatureIterator::iterate_returntype()
2.99 103.95 3.73 47897743 0.00 0.00 SignatureIterator::parse_type()
2.50 107.07 3.12 94194122 0.00 0.00 os::vm_page_size()
2.20 109.81 2.75 46326782 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
1.88 112.16 2.35 74343657 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
----------------------------
Flattening out with 54% spent in 'run' and 'run_opt'. Note the oddity
'vm_page_size()'. What is making almost 1E8 calls to vm_page_size()???
I don't think it is going to change since the last time they called it.
--- EEMBC --- Asm interpreter loop (run_opt in ARM asm)
% cumulative self self total
time seconds seconds calls s/call s/call name
25.74 28.29 28.29 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
25.02 55.77 27.49 100435695 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
13.82 70.96 15.19 1842 0.01 0.02 CppInterpreter::main_loop(int, Thread*)
5.48 76.98 6.02 47872566 0.00 0.00 methodOopDesc::result_type() const
4.51 81.93 4.96 47096184 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
4.20 86.54 4.61 47882020 0.00 0.00 SignatureIterator::iterate_returntype()
2.99 89.83 3.29 46326896 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
2.89 93.01 3.18 94194380 0.00 0.00 os::vm_page_size()
2.17 95.39 2.38 47897875 0.00 0.00 SignatureIterator::parse_type()
1.80 97.37 1.98 74343844 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
1.73 99.27 1.90 47884816 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle)
----------------------------------
OK. So now just 50% in 'run' and 'run_opt' with only 25% in 'run_opt'. This
severly limits further optimisation by improving 'run_opt'. Even if we made
run_opt go 10 X faster it would only improve performance by about 20%.
--- ThinkFreeOffice --- Original
% cumulative self self total
time seconds seconds calls s/call s/call name
42.12 21.35 21.35 33855747 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
9.70 26.27 4.92 34204 0.00 0.00 CppInterpreter::main_loop(int, Thread*)
5.17 28.89 2.62 17783120 0.00 0.00 methodOopDesc::result_type() const
4.54 31.19 2.30 16484931 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
4.48 33.46 2.27 17847099 0.00 0.00 SignatureIterator::iterate_returntype()
2.49 34.72 1.26 15239483 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
2.41 35.94 1.22 18044932 0.00 0.00 SignatureIterator::parse_type()
2.05 36.98 1.04 33005883 0.00 0.00 os::vm_page_size()
1.47 37.72 0.75 1245317 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
1.44 38.45 0.73 17891344 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle)
1.09 39.00 0.55 1773458 0.00 0.00 CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*)
--------------------------------
Note how relatively flat it is to start with. Only 42% in 'run'. And this is
not the whole truth. Think Office is spending a lot of time in native methods
(look at the number of calls to 'native_entry') but these are not shown in the
profile because I cant link against libjava.so. So the 42% of the time is
42% of the time spent in the JVM itself, not in the whole of openjdk.
Look at native_entry. 2.4% of the time spent in native_entry itself but
37 seconds attributable to native_entry & all its descendants.
--- ThinkFreeOffice --- Split interpreter loop (run_opt in C)
% cumulative self self total
time seconds seconds calls s/call s/call name
23.24 12.92 12.92 34292092 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
18.08 22.96 10.05 33659712 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
10.66 28.89 5.93 34512 0.00 0.00 CppInterpreter::main_loop(int, Thread*)
5.90 32.17 3.28 17997457 0.00 0.00 methodOopDesc::result_type() const
4.14 34.47 2.30 16685449 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
4.07 36.73 2.26 18064129 0.00 0.00 SignatureIterator::iterate_returntype()
3.17 38.49 1.76 18260032 0.00 0.00 SignatureIterator::parse_type()
2.23 39.73 1.24 15425660 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
1.96 40.82 1.09 23829897 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
1.75 41.79 0.97 33407124 0.00 0.00 os::vm_page_size()
1.40 42.57 0.78 18106431 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle)
1.26 43.27 0.70 1259656 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
---------------------------------------------------------
Note interestingly only 18% of the time spent in run_opt with 23% spent in run
because of the huge number of invokes and returns (presumeable to those native
methods).
--- ThinkFreeOffice --- Asm interpreter loop (run_opt in ARM asm)
% cumulative self self total
time seconds seconds calls s/call s/call name
24.38 12.23 12.23 34428228 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*)
14.14 19.32 7.09 BytecodeInterpreter::run_opt(BytecodeInterpreter*)
11.36 25.01 5.70 34395 0.00 0.00 CppInterpreter::main_loop(int, Thread*)
6.08 28.06 3.05 18104834 0.00 0.00 methodOopDesc::result_type() const
4.99 30.56 2.50 18170808 0.00 0.00 SignatureIterator::iterate_returntype()
4.15 32.64 2.08 16765945 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
2.57 33.93 1.29 18369885 0.00 0.00 SignatureIterator::parse_type()
2.53 35.20 1.27 23874942 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
2.29 36.35 1.15 33568034 0.00 0.00 os::vm_page_size()
2.05 37.38 1.03 15479805 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
1.46 38.11 0.73 18214073 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle)
1.39 38.81 0.70 1286010 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
------------------------------
Note much scope for improvement with only 14% spent in run_opt.
What I would really like to do is profile libjava.so so I can find out what
it is doing in those native methods. Any help on how to do this much
appreciated.
Regards,
Ed.
More information about the zero-dev
mailing list