Profiling

Thu Feb 19 16:51:05 PST 2009

Hi folks,

I have been doing some profing on the vm using gprof with some interesting
results. To do this I set LINK_INTO=AOUT to generate a static 'gamma' launcher
with the appropriate -pg flags set.

However, I have not been able to do this for the other libraries (eg libjava.so)
only for libjvm.so so we do not get any profiling information on the native
libraries which is what I really wanted.

The problem is that unlike libjvm.so libjava.so and friends are not linked in,
instead they are opened and referenced using dll_open and friends. So I cannot
just statically link in libjava.so.

Does anyone know of a way to get gprof information out of libjava.so.

Anyway, on to the results I got.

The benchmarks I used were CaffeineMark, EEMBC and ThinkFreeOffic (a pure Java
office suite). I ran each benchmark with the original build, the split
interpreter loop (in C), and the split interpreter loop with the split
recoded in ARM asm.

The complete results are available here

http://camswl.com/openjdk/profiles/flat_ecm_original
http://camswl.com/openjdk/profiles/flat_ecm_opt
http://camswl.com/openjdk/profiles/flat_ecm_asm
http://camswl.com/openjdk/profiles/flat_eembc_original
http://camswl.com/openjdk/profiles/flat_eembc_opt
http://camswl.com/openjdk/profiles/flat_eembc_asm
http://camswl.com/openjdk/profiles/flat_office_original
http://camswl.com/openjdk/profiles/flat_office_opt
http://camswl.com/openjdk/profiles/flat_office_asm

Here is a summary, with some discussion. The main point I think is that we
are wasting our time optimising the VM for 'semi real world' applications
like Think Free Office. That doesn't mean it is not worthwhile impriving the
VM, but there are classes of application for which it will make little/no
difference.

--- CaffeineMark --- Original

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 84.12      8.53     8.53  2033574     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  2.27      8.76     0.23      486     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
-----------------------------

As expected almost all the time spent in 'run'

--- CaffeineMark --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 76.80      7.47     7.47  2093698     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
  4.63      7.92     0.45  1872003     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  4.01      8.31     0.39      489     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
-----------------------------

OK, this looks good. 76% of the time is now spent in the simple interpreter
loop, ready for optimisation.

Note: you will notice that the time in seconds remains approx the same
in CaffeineMark because CaffeineMark runs for a fixed time rather that
a fixed number of iterations.

--- CaffeineMark --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 77.15      7.70     7.70                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
  5.71      8.27     0.57  2136095     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  2.51      8.52     0.25      490     0.51     1.65  CppInterpreter::main_loop(int, Thread*)
------------------------------

Beginning to flatten somewhat, but still mainly in 'run_opt'. Still plenty of
scope for optimisation.

Note: The Caffeinemark score for asm was approx 1.9X original.

Now, lookup at EEMBC

--- EEMBC --- Original

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 59.07     73.29    73.29 100435660     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 11.29     87.29    14.01     1850     0.01     0.02  CppInterpreter::main_loop(int, Thread*)
  4.55     92.94     5.65 47872597     0.00     0.00  methodOopDesc::result_type() const
  3.83     97.69     4.75 47882070     0.00     0.00  SignatureIterator::iterate_returntype()
  3.32    101.82     4.13 47096159     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  2.80    105.30     3.48 94194346     0.00     0.00  os::vm_page_size()
  2.36    108.23     2.93 46326866     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
---------------------------

60% in 'run', note the relatively large % in 'main_loop'. This is because each
invoke and return actually returns (in C) to 'main_loop', the invoke or return
is then handled in 'main_loop' which then calls 'run' again. Hence the
huge number of calls to 'run'.

--- EEMBC --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 31.33     39.03    39.03 113138096     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 22.84     67.48    28.45 100435423     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 12.78     83.39    15.92     1842     0.01     0.03  CppInterpreter::main_loop(int, Thread*)
  5.31     90.00     6.61 47872423     0.00     0.00  methodOopDesc::result_type() const
  4.43     95.52     5.52 47096055     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  3.77    100.22     4.70 47881896     0.00     0.00  SignatureIterator::iterate_returntype()
  2.99    103.95     3.73 47897743     0.00     0.00  SignatureIterator::parse_type()
  2.50    107.07     3.12 94194122     0.00     0.00  os::vm_page_size()
  2.20    109.81     2.75 46326782     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.88    112.16     2.35 74343657     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
----------------------------

Flattening out with 54% spent in 'run' and 'run_opt'. Note the oddity
'vm_page_size()'. What is making almost 1E8 calls to vm_page_size()???
I don't think it is going to change since the last time they called it.

--- EEMBC --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.74     28.29    28.29                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 25.02     55.77    27.49 100435695     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 13.82     70.96    15.19     1842     0.01     0.02  CppInterpreter::main_loop(int, Thread*)
  5.48     76.98     6.02 47872566     0.00     0.00  methodOopDesc::result_type() const
  4.51     81.93     4.96 47096184     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.20     86.54     4.61 47882020     0.00     0.00  SignatureIterator::iterate_returntype()
  2.99     89.83     3.29 46326896     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.89     93.01     3.18 94194380     0.00     0.00  os::vm_page_size()
  2.17     95.39     2.38 47897875     0.00     0.00  SignatureIterator::parse_type()
  1.80     97.37     1.98 74343844     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  1.73     99.27     1.90 47884816     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
----------------------------------

OK. So now just 50% in 'run' and 'run_opt' with only 25% in 'run_opt'. This
severly limits further optimisation by improving 'run_opt'. Even if we made
run_opt go 10 X faster it would only improve performance by about 20%.

--- ThinkFreeOffice --- Original

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 42.12     21.35    21.35 33855747     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  9.70     26.27     4.92    34204     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  5.17     28.89     2.62 17783120     0.00     0.00  methodOopDesc::result_type() const
  4.54     31.19     2.30 16484931     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.48     33.46     2.27 17847099     0.00     0.00  SignatureIterator::iterate_returntype()
  2.49     34.72     1.26 15239483     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.41     35.94     1.22 18044932     0.00     0.00  SignatureIterator::parse_type()
  2.05     36.98     1.04 33005883     0.00     0.00  os::vm_page_size()
  1.47     37.72     0.75  1245317     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
  1.44     38.45     0.73 17891344     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.09     39.00     0.55  1773458     0.00     0.00  CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*)
--------------------------------

Note how relatively flat it is to start with. Only 42% in 'run'. And this is
not the whole truth. Think Office is spending a lot of time in native methods
(look at the number of calls to 'native_entry') but these are not shown in the
profile because I cant link against libjava.so. So the 42% of the time is
42% of the time spent in the JVM itself, not in the whole of openjdk.

Look at native_entry. 2.4% of the time spent in native_entry itself but
37 seconds attributable to native_entry & all its descendants.

--- ThinkFreeOffice --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 23.24     12.92    12.92 34292092     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 18.08     22.96    10.05 33659712     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 10.66     28.89     5.93    34512     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  5.90     32.17     3.28 17997457     0.00     0.00  methodOopDesc::result_type() const
  4.14     34.47     2.30 16685449     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.07     36.73     2.26 18064129     0.00     0.00  SignatureIterator::iterate_returntype()
  3.17     38.49     1.76 18260032     0.00     0.00  SignatureIterator::parse_type()
  2.23     39.73     1.24 15425660     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.96     40.82     1.09 23829897     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  1.75     41.79     0.97 33407124     0.00     0.00  os::vm_page_size()
  1.40     42.57     0.78 18106431     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.26     43.27     0.70  1259656     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
---------------------------------------------------------

Note interestingly only 18% of the time spent in run_opt with 23% spent in run
because of the huge number of invokes and returns (presumeable to those native
methods).

--- ThinkFreeOffice --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.38     12.23    12.23 34428228     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 14.14     19.32     7.09                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 11.36     25.01     5.70    34395     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  6.08     28.06     3.05 18104834     0.00     0.00  methodOopDesc::result_type() const
  4.99     30.56     2.50 18170808     0.00     0.00  SignatureIterator::iterate_returntype()
  4.15     32.64     2.08 16765945     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  2.57     33.93     1.29 18369885     0.00     0.00  SignatureIterator::parse_type()
  2.53     35.20     1.27 23874942     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  2.29     36.35     1.15 33568034     0.00     0.00  os::vm_page_size()
  2.05     37.38     1.03 15479805     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.46     38.11     0.73 18214073     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.39     38.81     0.70  1286010     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
------------------------------

Note much scope for improvement with only 14% spent in run_opt.

What I would really like to do is profile libjava.so so I can find out what
it is doing in those native methods. Any help on how to do this much
appreciated.

Regards,
Ed.