Improving the performance of OpenJDK

Wed Feb 18 02:43:35 PST 2009

Hi Ed,

I haven't looked into the code particularly -- it's pretty difficult
to locate your stuff in that massive patch -- but here are my initial
thoughts.

Edward Nevill wrote:
> Splitting the loop like this improves the code generated by gcc in a
> number of ways. Firstly it improves register allocation because the
> compiler is not trying to allocate registers across complex
> code. This code is infrequently executed, but the compiler has no
> way of knowing, and tends to give the complex code more priority for
> register allocations (since it is the deepest, most nested piece of
> code, it must be the most frequently executed, right? Wrong!!!).

I don't know if this would make a huge difference, but there's a
conditional, LOTS_OF_REGS, defined in bytecodeInterpreter_zero.hpp,
that specifies register keywords for several variables in the
bytecode interpreter's main loop.  It might be worth turning it on
for ARM and seeing if it has an effect.

> The interpreter (as is) has two modes of operation, TaggedStacks and
> not Tagged. A TaggedStack is one where in addition to the data on
> the stack a tag is stored with each datum to say what type it is
> (the main types we are interested in are 'a' and non 'a'). This
> means that each stack element is 8 bytes.  The TaggedStack (as I
> understand it) is only used by certain garbage collectors to
> identify what elements on the stack are references and it is not the
> default.

As I understand it, the tagged stack interpreter was written because
some applications had such complex code that locating the objects on
the stack was taking a huge amount of time.  It was a particular
problem with automatically generated code, from JSPs for example.
I hear it didn't particularly work well, and is pretty much out of
favour now as the initial problem was worked around by some other
means.  I'm not sure it even works correctly in the C++ interpreter,
and Zero certainly doesn't support it.  It may be that we can just
strip it out...

> get_native_u2() and get_Java_u2() ... This seems to be a misguided
> attempt of the original authors to optimised reading of halfwords
> (judging by the comment immediate preceding the code).

It's not an optimization, it's to do unaligned access on hardware that
doesn't support it.  I'm guessing ARM does allow unaligned access by
the fact that your code didn't segfault instantly ;)  We should
probably optimize this for machines that allow it, given that it has
the performance impact you describe.  Does anyone know which machines
do and do not allow it?  AFAIK x86, x86_64 yes; ppc, ppc64 no; others?

Cheers,
Gary

-- 
http://gbenson.net/