From gbenson at redhat.com  Sun Feb  8 06:49:10 2009
From: gbenson at redhat.com (Gary Benson)
Date: Sun,  8 Feb 2009 06:49:10 -0800 (PST)
Subject: Zero build passes TCK
Message-ID: <20090208144910.20AED2C434@rexford.dreamhost.com>

For the past three months the OpenJDK team at Red Hat has
been working on a project to bring the Zero-assembler port
of HotSpot to the point where IcedTea builds using Zero
are capable of passing the Java SE 6 TCK.

As a result of this work I am pleased to announce that the
latest OpenJDK packages included in Fedora 10 for 32- and 
64-bit PowerPC have passed the Java SE 6 TCK and are 
compatible with the Java SE 6 platform.

This work was funded by Red Hat.


From Edward.Nevill at arm.com  Tue Feb 17 08:13:19 2009
From: Edward.Nevill at arm.com (Edward Nevill)
Date: Tue, 17 Feb 2009 16:13:19 -0000
Subject: Optimised ARM assembler loop
Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com>

Hi all,

 
I have now completed converting the bytecodeInterpreterOpt.cpp in my previous email into hand crafted ARM assembler and uploaded to my PPA.

 
This gives approx 100% performance improvement over the original Zero.

 
I have conditionalised both sets of optimisations (the generic C optimisations and the Asm optimisations) so they only build on ARM.

 
The conditionalisation is in

 
ports/hotspot/build/linux/makefiles/zero.make

 
# Not included in includeDB because it has no dependencies

ifdef ICEDTEA_ZERO_BUILD

# ECN - For the time being HOTSPOT_OPT only enabled for ARM

# CFLAGS += -DHOTSPOT_OPT

ifeq ($(ZERO_LIBARCH),arm)

Obj_Files += bytecodeInterpreter_arm.o

  CFLAGS += -DHOTSPOT_OPT -DHOTSPOT_ASM

endif

endif

 
If you want to try out the generic optimisations uncomment the CFLAGS += -DHOTSPOT_OPT line.

 
Note: You will also need to update debian/rules or you wont even get the patches. Unconditionalise the lines

 
  ifneq (,$(filter $(DEB_HOST_ARCH),armel i386))
    DISTRIBUTION_PATCHES += \
        debian/patches/zero-port-opt-new.diff \
        debian/patches/zero-hotspot-opt-new.diff
  endif
 
(Of course, if you are running on an ARM platform you don't need to do this).

 
The main change was the addition of the the asm loop, however there were a few minor changes....

(note these are additional changes to those described in my previous document).

 
-          Added a rule to openjdk/hotspot/make/linux/makefiles/rules.make

o   This allows compilation of .S to .o

-          In openjdk/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp

o   Added conditionalisation to turn HOTSPOT_OPT off when build jvmti

o   Changed run_opt so it no longer takes a strange 'bmb' argument and always returns 0

?  The adition is now always to re-execute the bytecode regardless of why it failed (eg exception etc). bytecodeInterpreterOpt.cpp is responsible for putting the world back in a state where the bytecode can be re-executed.

 
Sources available in...

 
https://launchpad.net/~enevill/+archive/ppa

 
openjdk-6-6b14-0ubuntu14~ed01

 
Binaries available RSN,

 
Regards,

Ed.

 
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/351932ac/attachment.html 

From Edward.Nevill at arm.com  Tue Feb 17 08:40:16 2009
From: Edward.Nevill at arm.com (Edward Nevill)
Date: Tue, 17 Feb 2009 16:40:16 -0000
Subject: Improving the performance of OpenJDK
Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>

This seems to have got lost in the ether, ... resending

 
From: Edward Nevill 
Sent: 17 February 2009 15:25
To: 'zero-dev at openjdk.java.net'
Subject: Improving the performance of OpenJDK

 
Hi All,

 
I have been looking at ways of improving the performance of Zero

 
I understand that performance was not one of the original goals of Zero,
however...

 
The performance of Zero is absolutely, incredibly, dire. To give an
example, I tried an application ThinkFreeOffice (another office suite).
To open a new blank word document by clicking on the 'New Document' icon
(having already loaded ThinkFreeOffice), took ...

 
<Jeremy Clarkson>

1 Minute<br>

....<br>

18 Seconds>

</Jeremy Clarkson>

 
... thats about as long as it take the Stig to do a lap.

 
So.. I would like to open the discussion on improving Zero performance.

 
Attached is a HTML document describing some simple changes I have made
which provide ~50% improvement.

 
Regards,

Ed.

 
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/1e7e5933/attachment.html 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/1e7e5933/attachment.htm 

From aph at redhat.com  Tue Feb 17 09:05:50 2009
From: aph at redhat.com (Andrew Haley)
Date: Tue, 17 Feb 2009 17:05:50 +0000
Subject: Optimised ARM assembler loop
In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com>
Message-ID: <499AEE6E.5060706@redhat.com>

Edward Nevill wrote:

> I have now completed converting the bytecodeInterpreterOpt.cpp in my previous email into hand crafted ARM assembler and uploaded to my PPA.

> This gives approx 100% performance improvement over the original Zero.

Excellent.  Can I make one tiny, teeny request?  The first time I
read that I assumed that 100% of the time was saved, and thus it
was infinitely fast.  :-)

OK, I knows that's not what it means.  I presume it's now twice as fast
as Zero.

Andrew.


From gbenson at redhat.com  Wed Feb 18 02:43:35 2009
From: gbenson at redhat.com (Gary Benson)
Date: Wed, 18 Feb 2009 10:43:35 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
Message-ID: <20090218104335.GB3213@redhat.com>

Hi Ed,

I haven't looked into the code particularly -- it's pretty difficult
to locate your stuff in that massive patch -- but here are my initial
thoughts.

Edward Nevill wrote:
> Splitting the loop like this improves the code generated by gcc in a
> number of ways. Firstly it improves register allocation because the
> compiler is not trying to allocate registers across complex
> code. This code is infrequently executed, but the compiler has no
> way of knowing, and tends to give the complex code more priority for
> register allocations (since it is the deepest, most nested piece of
> code, it must be the most frequently executed, right? Wrong!!!).

I don't know if this would make a huge difference, but there's a
conditional, LOTS_OF_REGS, defined in bytecodeInterpreter_zero.hpp,
that specifies register keywords for several variables in the
bytecode interpreter's main loop.  It might be worth turning it on
for ARM and seeing if it has an effect.

> The interpreter (as is) has two modes of operation, TaggedStacks and
> not Tagged. A TaggedStack is one where in addition to the data on
> the stack a tag is stored with each datum to say what type it is
> (the main types we are interested in are 'a' and non 'a'). This
> means that each stack element is 8 bytes.  The TaggedStack (as I
> understand it) is only used by certain garbage collectors to
> identify what elements on the stack are references and it is not the
> default.

As I understand it, the tagged stack interpreter was written because
some applications had such complex code that locating the objects on
the stack was taking a huge amount of time.  It was a particular
problem with automatically generated code, from JSPs for example.
I hear it didn't particularly work well, and is pretty much out of
favour now as the initial problem was worked around by some other
means.  I'm not sure it even works correctly in the C++ interpreter,
and Zero certainly doesn't support it.  It may be that we can just
strip it out...

> get_native_u2() and get_Java_u2() ... This seems to be a misguided
> attempt of the original authors to optimised reading of halfwords
> (judging by the comment immediate preceding the code).

It's not an optimization, it's to do unaligned access on hardware that
doesn't support it.  I'm guessing ARM does allow unaligned access by
the fact that your code didn't segfault instantly ;)  We should
probably optimize this for machines that allow it, given that it has
the performance impact you describe.  Does anyone know which machines
do and do not allow it?  AFAIK x86, x86_64 yes; ppc, ppc64 no; others?

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Wed Feb 18 02:57:43 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 18 Feb 2009 10:57:43 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090218104335.GB3213@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com>
Message-ID: <499BE9A7.3020405@redhat.com>

Gary Benson wrote:
> Hi Ed,
> 
> I haven't looked into the code particularly -- it's pretty difficult
> to locate your stuff in that massive patch -- but here are my initial
> thoughts.
> 
> Edward Nevill wrote:
>> Splitting the loop like this improves the code generated by gcc in a
>> number of ways. Firstly it improves register allocation because the
>> compiler is not trying to allocate registers across complex
>> code. This code is infrequently executed, but the compiler has no
>> way of knowing, and tends to give the complex code more priority for
>> register allocations (since it is the deepest, most nested piece of
>> code, it must be the most frequently executed, right? Wrong!!!).
> 
> I don't know if this would make a huge difference, but there's a
> conditional, LOTS_OF_REGS, defined in bytecodeInterpreter_zero.hpp,
> that specifies register keywords for several variables in the
> bytecode interpreter's main loop.  It might be worth turning it on
> for ARM and seeing if it has an effect.

I suspect it'd make things worse.  ARM has only 16  registers, and some
of those are fixed by the ABI.

The idea of separating frequently-executed code from stuff that is only
used occasionally is a good one.  Every compiler, and certainly gc,
finds it difficult to do a good job of allocating registers in a large
routine.  It's especially hard for ARM, which is register-starved.

>> get_native_u2() and get_Java_u2() ... This seems to be a misguided
>> attempt of the original authors to optimised reading of halfwords
>> (judging by the comment immediate preceding the code).
> 
> It's not an optimization, it's to do unaligned access on hardware that
> doesn't support it.  I'm guessing ARM does allow unaligned access by
> the fact that your code didn't segfault instantly ;)

ARM doesn't support unaligned loads.  The new ARM code as posted is

	ldrsb	r0, [java_pc, #0]
	ldrb	r1, [java_pc, #1]
	orr	r1, r1, r0, lsl #8

i.e two byte loads.

Andrew.


From gbenson at redhat.com  Wed Feb 18 03:08:51 2009
From: gbenson at redhat.com (Gary Benson)
Date: Wed, 18 Feb 2009 11:08:51 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <499BE9A7.3020405@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
Message-ID: <20090218110851.GD3213@redhat.com>

Andrew Haley wrote:
> Gary Benson wrote:
> > Edward Nevill wrote:
> > > get_native_u2() and get_Java_u2() ... This seems to be a
> > > misguided attempt of the original authors to optimised reading
> > > of halfwords (judging by the comment immediate preceding the
> > > code).
> > 
> > It's not an optimization, it's to do unaligned access on hardware
> > that doesn't support it.  I'm guessing ARM does allow unaligned
> > access by the fact that your code didn't segfault instantly ;)
> 
> ARM doesn't support unaligned loads.  The new ARM code as posted is
> 
> 	ldrsb	r0, [java_pc, #0]
> 	ldrb	r1, [java_pc, #1]
> 	orr	r1, r1, r0, lsl #8
> 
> i.e two byte loads.

Ah, I didn't realise.  Which is good, it means this optimization is
generic :)

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Wed Feb 18 03:19:15 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 18 Feb 2009 11:19:15 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090218110851.GD3213@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>	<20090218104335.GB3213@redhat.com>
	<499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com>
Message-ID: <499BEEB3.3010904@redhat.com>

Gary Benson wrote:
> Andrew Haley wrote:
>> Gary Benson wrote:
>>> Edward Nevill wrote:
>>>> get_native_u2() and get_Java_u2() ... This seems to be a
>>>> misguided attempt of the original authors to optimised reading
>>>> of halfwords (judging by the comment immediate preceding the
>>>> code).
>>> It's not an optimization, it's to do unaligned access on hardware
>>> that doesn't support it.  I'm guessing ARM does allow unaligned
>>> access by the fact that your code didn't segfault instantly ;)
>> ARM doesn't support unaligned loads.  The new ARM code as posted is
>>
>> 	ldrsb	r0, [java_pc, #0]
>> 	ldrb	r1, [java_pc, #1]
>> 	orr	r1, r1, r0, lsl #8
>>
>> i.e two byte loads.
> 
> Ah, I didn't realise.  Which is good, it means this optimization is
> generic :)

Right.  The whole idea of the way it's don ATM is bonkers: do a byte-
at-a-time unaligned load into machine order, then reverse the bytes.
Maybe the hope was that the compiler would see all this cruft and
silently convert it into an efficient form, but, er, no.  :-(

Andrew.


From gbenson at redhat.com  Wed Feb 18 03:37:46 2009
From: gbenson at redhat.com (Gary Benson)
Date: Wed, 18 Feb 2009 11:37:46 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <499BEEB3.3010904@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
	<20090218110851.GD3213@redhat.com> <499BEEB3.3010904@redhat.com>
Message-ID: <20090218113746.GE3213@redhat.com>

Andrew Haley wrote:
> Right.  The whole idea of the way it's don ATM is bonkers: do a
> byte- at-a-time unaligned load into machine order, then reverse the
> bytes.  Maybe the hope was that the compiler would see all this
> cruft and silently convert it into an efficient form, but, er, no.
> :-(

How does it work for longer types?  For a 64-bit value, for instance,
is it better to always do 8 individual loads, or might it be better
to try and optimize like they have done?

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Wed Feb 18 04:37:19 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 18 Feb 2009 12:37:19 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090218113746.GE3213@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>	<20090218104335.GB3213@redhat.com>
	<499BE9A7.3020405@redhat.com>	<20090218110851.GD3213@redhat.com>
	<499BEEB3.3010904@redhat.com> <20090218113746.GE3213@redhat.com>
Message-ID: <499C00FF.8010203@redhat.com>

Gary Benson wrote:
> Andrew Haley wrote:
>> Right.  The whole idea of the way it's don ATM is bonkers: do a
>> byte- at-a-time unaligned load into machine order, then reverse the
>> bytes.  Maybe the hope was that the compiler would see all this
>> cruft and silently convert it into an efficient form, but, er, no.
>> :-(
> 
> How does it work for longer types?  For a 64-bit value, for instance,
> is it better to always do 8 individual loads, or might it be better
> to try and optimize like they have done?

It depends on the frequency of execution.  If the alignment is no
better than random with a uniform distribution, then half the time
you'll be looking at an address that is not aligned for any type
larger than a byte.  If so, there's no point checking for special cases.

It is, however, worth avoiding 64-bit operations on 32-bit platforms,
so this is probably the best way to do a 64-bit big-endian load, at
least on gcc:

unsigned long long foo6 (unsigned char *p)
{
  unsigned long u1;

  u1 = ((unsigned long)p[0] << 24
	| (unsigned long)p[1] << 16
	| (unsigned long)p[2] << 8
	| p[3]);

  unsigned long u2;

  u2 = ((unsigned long)p[4] << 24
	| (unsigned long)p[5] << 16
	| (unsigned long)p[6] << 8
	| p[7]);

  return ((unsigned long long)u1<<32 | u2);
}

The code generated here may be significantly better on 32-bit platforms
than the equivalent that uses unsigned long long, and not significantly
worse on 64-bit platforms.

Andrew.


From Edward.Nevill at arm.com  Wed Feb 18 05:48:09 2009
From: Edward.Nevill at arm.com (Edward Nevill)
Date: Wed, 18 Feb 2009 13:48:09 -0000
Subject: backedge checks
Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com>

Hi folks,

 
In bytecodeInterpreter.c is this wonderful macro

 
#define DO_BACKEDGE_CHECKS(skip, branch_pc)
\

    if ((skip) <= 0) {
\

      if (UseCompiler && UseLoopCounter) {
\

        bool do_OSR = UseOnStackReplacement;
\

        BACKEDGE_COUNT->increment();
\

        if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();
\

        if (do_OSR) {
\

          nmethod*  osr_nmethod;
\

          OSR_REQUEST(osr_nmethod, branch_pc);
\

          if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=
InvalidOSREntryBci) {          \

            intptr_t* buf;
\

            CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD),
handle_exception);              \

            istate->set_msg(do_osr);
\

            istate->set_osr_buf((address)buf);
\

            istate->set_osr_entry(osr_nmethod->osr_entry());
\

            return;
\

          }
\

        } else {
\

          INCR_INVOCATION_COUNT;
\

          SAFEPOINT;
\

        }
\

      }  /* UseCompiler ... */
\

      INCR_INVOCATION_COUNT;
\

      SAFEPOINT;
\

    }

 
This macro is invoked in every branch (although the body is only
executed on backwards branches).

 
Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT
twice on every backwards branch. Surely a mistake?

 
Secondly, can someone tell me under what circumstances in zero
UseCompiler would ever be true (and no, it doesn't resolve to a
constant.

 
Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT
does...

 
#define SAFEPOINT
\
    if ( SafepointSynchronize::is_synchronizing()) {
\
        {
\
          /* zap freed handles rather than GC'ing them */
\
          HandleMarkCleaner __hmc(THREAD);
\
        }
\
        CALL_VM(SafepointSynchronize::block(THREAD), handle_exception);
\
    }

 
However, there is an upcall notice_safepoints() which safepoint.cpp
calls whenever it is setting is_synchronizing(). This upcall is there to
notify the interpreter that it needs to run to a GC safepoint. A
corresponding call ignore_safepoints() is called when the interpreter
can safely ignore safepoints. So there should be no need for the
interpreter to continually check for safepoints.

 
notice_safepoints() and ignore_safepoints() in the template interpreter
do indeed do something sensible. However, in the bytecode Interpreter
the implementation is just {}

 
The way it would work is...

 
When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather
than a switch statement. Currently this is defined as 

 
  const static void* const opclabels_data[256] = {
/* 0x00 */ &&opc_nop,
&&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,

...

      };

  register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0];

 
We would change this to

 
  const static void* opclabels_data[256] = {
/* 0x00 */ &&opc_nop,
&&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,

...

      };

 
IE. I have just removed the 'const' because we are going to change this
dynamically

 
We would then define two sets of handlers for branch instructions

 
struct branch_dispatch {

                int           bytecode;           /* The bytecode */

                void       *handler;            /* The handler for it */

};

typedef struct branch_dispatch branch_dispatch;

 
branch_dispatch safe_branch_dispatch_table[] = {

                {              Bytecodes::_ifle,
safe_ifle_handler },

                {              Bytecodes::_ifgt,
safe_ifgt_handler },

                ...

};

 
branch_dispatch unsafe_branch_dispatch_table[] = {

                {              Bytecodes::_ifle,
unsafe_ifle_handler },

                {              Bytecodes::_ifgt,
unsafe_ifgt_handler },

                ...

}

 
notice_safepoints() and ignore_safepoints() then become

 
void update_table(branch_dispatch *p, branch_dispatch *pend)

{

                do { opclabels_data[p->bytecode] = p->handler; } while
(p++ < pend);

}

 
notice_safepoints()

{

                update_table(safe_branch_dispatch_table,
safe_branch_dispatch_table +
sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch));

}

 
ignore_safepoint()

{

                update_table(unsafe_branch_dispatch_table,
unsafe_branch_dispatch_table +
sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch));

}

 
Finally, can someone enlighten me as to the purpose of
INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all
backedge checks.

 
Regards,

Ed.

 
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090218/3cf42a48/attachment.html 

From aph at redhat.com  Wed Feb 18 06:16:23 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 18 Feb 2009 14:16:23 +0000
Subject: backedge checks
In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com>
Message-ID: <499C1837.2070509@redhat.com>

Hi Ed,

Your mail looked like it went through a rubbish compactor.
I reformatted it.

-------------------------------------------------------------------------------------------------------

Hi folks,

In bytecodeInterpreter.c is this wonderful macro

#define DO_BACKEDGE_CHECKS(skip, branch_pc)\
    if ((skip) <= 0) {\
      if (UseCompiler && UseLoopCounter) {\
        bool do_OSR = UseOnStackReplacement;\
        BACKEDGE_COUNT->increment();\
        if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\
        if (do_OSR) {\
          nmethod*  osr_nmethod;\
          OSR_REQUEST(osr_nmethod, branch_pc);\
          if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) {          \
            intptr_t* buf;\
            CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception);              \
            istate->set_msg(do_osr);\
            istate->set_osr_buf((address)buf);\
            istate->set_osr_entry(osr_nmethod->osr_entry());\
            return;\
          }\
        } else {\
          INCR_INVOCATION_COUNT;\
          SAFEPOINT;\
        }\
      }  /* UseCompiler ... */\
      INCR_INVOCATION_COUNT;\
      SAFEPOINT;\
    }


This macro is invoked in every branch (although the body is only
executed on backwards branches).

Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT
twice on every backwards branch. Surely a mistake?

Secondly, can someone tell me under what circumstances in zero
UseCompiler would ever be true (and no, it doesn't resolve to a
constant.

Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT
does...

#define SAFEPOINT \
    if ( SafepointSynchronize::is_synchronizing()) { \
        { \
          /* zap freed handles rather than GC'ing them */ \
          HandleMarkCleaner __hmc(THREAD); \
        } \
        CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \
    }


However, there is an upcall notice_safepoints() which safepoint.cpp
calls whenever it is setting is_synchronizing(). This upcall is there to
notify the interpreter that it needs to run to a GC safepoint. A
corresponding call ignore_safepoints() is called when the interpreter
can safely ignore safepoints. So there should be no need for the
interpreter to continually check for safepoints.

notice_safepoints() and ignore_safepoints() in the template interpreter
do indeed do something sensible. However, in the bytecode Interpreter
the implementation is just {}

The way it would work is...

When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather
than a switch statement. Currently this is defined as

  const static void* const opclabels_data[256] = {
/* 0x00 */ &&opc_nop,
&&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,
...
      };
  register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0];

We would change this to

  const static void* opclabels_data[256] = {
/* 0x00 */ &&opc_nop,
&&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,
...
      };

IE. I have just removed the 'const' because we are going to change this
dynamically

We would then define two sets of handlers for branch instructions

struct branch_dispatch {
                int           bytecode;           /* The bytecode */
                void       *handler;            /* The handler for it */
};
typedef struct branch_dispatch branch_dispatch;

branch_dispatch safe_branch_dispatch_table[] = {
                {              Bytecodes::_ifle,safe_ifle_handler },
                {              Bytecodes::_ifgt,safe_ifgt_handler },
                ...
};

branch_dispatch unsafe_branch_dispatch_table[] = {
                {              Bytecodes::_ifle,unsafe_ifle_handler },
                {              Bytecodes::_ifgt,unsafe_ifgt_handler },
                ...
}

notice_safepoints() and ignore_safepoints() then become

void update_table(branch_dispatch *p, branch_dispatch *pend)
{
                do { opclabels_data[p->bytecode] = p->handler; } while (p++ < pend);
}

notice_safepoints()
{
                update_table(safe_branch_dispatch_table,
			safe_branch_dispatch_table +
			sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch));
}

ignore_safepoint()
{
                update_table(unsafe_branch_dispatch_table,
		unsafe_branch_dispatch_table +
		sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch));
}

Finally, can someone enlighten me as to the purpose of
INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all
backedge checks.

Regards,
Ed.

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


From aph at redhat.com  Wed Feb 18 09:24:50 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 18 Feb 2009 17:24:50 +0000
Subject: backedge checks
In-Reply-To: <499C1837.2070509@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com>
	<499C1837.2070509@redhat.com>
Message-ID: <499C4462.4080809@redhat.com>


> Hi folks,
> 
> In bytecodeInterpreter.c is this wonderful macro
> 
> #define DO_BACKEDGE_CHECKS(skip, branch_pc)\
>     if ((skip) <= 0) {\
>       if (UseCompiler && UseLoopCounter) {\
>         bool do_OSR = UseOnStackReplacement;\
>         BACKEDGE_COUNT->increment();\
>         if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\
>         if (do_OSR) {\
>           nmethod*  osr_nmethod;\
>           OSR_REQUEST(osr_nmethod, branch_pc);\
>           if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) {          \
>             intptr_t* buf;\
>             CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception);              \
>             istate->set_msg(do_osr);\
>             istate->set_osr_buf((address)buf);\
>             istate->set_osr_entry(osr_nmethod->osr_entry());\
>             return;\
>           }\
>         } else {\
>           INCR_INVOCATION_COUNT;\
>           SAFEPOINT;\
>         }\
>       }  /* UseCompiler ... */\
>       INCR_INVOCATION_COUNT;\
>       SAFEPOINT;\
>     }
> 
> 
> This macro is invoked in every branch (although the body is only
> executed on backwards branches).
> 
> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT
> twice on every backwards branch. Surely a mistake?

It looks like that to me.

> Secondly, can someone tell me under what circumstances in zero
> UseCompiler would ever be true (and no, it doesn't resolve to a
> constant.

This stuff is AFAIK used to decide when to compile a method with Shark.

> Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT
> does...
> 
> #define SAFEPOINT \
>     if ( SafepointSynchronize::is_synchronizing()) { \
>         { \
>           /* zap freed handles rather than GC'ing them */ \
>           HandleMarkCleaner __hmc(THREAD); \
>         } \
>         CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \
>     }
> 
> 
> However, there is an upcall notice_safepoints() which safepoint.cpp
> calls whenever it is setting is_synchronizing(). This upcall is there to
> notify the interpreter that it needs to run to a GC safepoint. A
> corresponding call ignore_safepoints() is called when the interpreter
> can safely ignore safepoints. So there should be no need for the
> interpreter to continually check for safepoints.
> 
> notice_safepoints() and ignore_safepoints() in the template interpreter
> do indeed do something sensible. However, in the bytecode Interpreter
> the implementation is just {}
> 
> The way it would work is...
> 
> When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather
> than a switch statement. Currently this is defined as
> 
>   const static void* const opclabels_data[256] = {
> /* 0x00 */ &&opc_nop,
> &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,
> ...
>       };
>   register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0];
> 
> We would change this to
> 
>   const static void* opclabels_data[256] = {
> /* 0x00 */ &&opc_nop,
> &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0,
> ...
>       };
> 
> IE. I have just removed the 'const' because we are going to change this
> dynamically

> We would then define two sets of handlers for branch instructions
> 
> struct branch_dispatch {
>                 int           bytecode;           /* The bytecode */
>                 void       *handler;            /* The handler for it */
> };
> typedef struct branch_dispatch branch_dispatch;
> 
> branch_dispatch safe_branch_dispatch_table[] = {
>                 {              Bytecodes::_ifle,safe_ifle_handler },
>                 {              Bytecodes::_ifgt,safe_ifgt_handler },
>                 ...
> };
> 
> branch_dispatch unsafe_branch_dispatch_table[] = {
>                 {              Bytecodes::_ifle,unsafe_ifle_handler },
>                 {              Bytecodes::_ifgt,unsafe_ifgt_handler },
>                 ...
> }
> 
> notice_safepoints() and ignore_safepoints() then become
> 
> void update_table(branch_dispatch *p, branch_dispatch *pend)
> {
>                 do { opclabels_data[p->bytecode] = p->handler; } while (p++ < pend);
> }
> 
> notice_safepoints()
> {
>                 update_table(safe_branch_dispatch_table,
> 			safe_branch_dispatch_table +
> 			sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch));
> }
> 
> ignore_safepoint()
> {
>                 update_table(unsafe_branch_dispatch_table,
> 		unsafe_branch_dispatch_table +
> 		sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch));
> }

Hmm, this doesn't look like it's thread safe to me.  It would make more
sense to have a pointer to the branch dispatch table that's updated
atomically.

> Finally, can someone enlighten me as to the purpose of
> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all
> backedge checks.

Surely the compile broker uses that count.

Andrew.


From gbenson at redhat.com  Thu Feb 19 06:38:01 2009
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 19 Feb 2009 14:38:01 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090218110851.GD3213@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
	<20090218110851.GD3213@redhat.com>
Message-ID: <20090219143801.GB31986@redhat.com>

Gary Benson wrote:
> Andrew Haley wrote:
> > Gary Benson wrote:
> > > Edward Nevill wrote:
> > > > get_native_u2() and get_Java_u2() ... This seems to be a
> > > > misguided attempt of the original authors to optimised reading
> > > > of halfwords (judging by the comment immediate preceding the
> > > > code).
> > > 
> > > It's not an optimization, it's to do unaligned access on
> > > hardware that doesn't support it.  I'm guessing ARM does allow
> > > unaligned access by the fact that your code didn't segfault
> > > instantly ;)
> > 
> > ARM doesn't support unaligned loads.  The new ARM code as posted
> > is
> > 
> > 	ldrsb	r0, [java_pc, #0]
> > 	ldrb	r1, [java_pc, #1]
> > 	orr	r1, r1, r0, lsl #8
> > 
> > i.e two byte loads.
> 
> Ah, I didn't realise.  Which is good, it means this optimization is
> generic :)

So I did a quick SPECjvm98 run of this change on amd64.  A couple of
the times are improved, by 3-10%, but one of the times is *slower*
by 11%.  See attached.  I'm not sure what to make of that...

Cheers,
Gary

-- 
http://gbenson.net/
-------------- next part --------------
,times/zero-92c4cc753f06,times/zero-92c4cc753f06-u2
compress,231.02,255.489
jess,70.363,64.099
db,116.143,112.242
javac,72.624,69.744
mpegaudio,229.505,226.654
mtrt,59.302,59.274
jack,49.195,44.505
-------------- next part --------------
A non-text attachment was scrubbed...
Name: result.png
Type: image/png
Size: 2155 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090219/15362c33/attachment.png 

From gbenson at redhat.com  Thu Feb 19 07:02:56 2009
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 19 Feb 2009 15:02:56 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090219143801.GB31986@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
	<20090218110851.GD3213@redhat.com>
	<20090219143801.GB31986@redhat.com>
Message-ID: <20090219150256.GC31986@redhat.com>

Gary Benson wrote:
...
> > > > Edward Nevill wrote:
> > > > > get_native_u2() and get_Java_u2() ... This seems to be a
> > > > > misguided attempt of the original authors to optimised
> > > > > reading of halfwords (judging by the comment immediate
> > > > > preceding the code).
...
> So I did a quick SPECjvm98 run of this change on amd64.  A couple of
> the times are improved, by 3-10%, but one of the times is *slower*
> by 11%.  See attached.  I'm not sure what to make of that...

I'm wondering if rewriting get_Java_u2() to read directly rather than
read and swap is speeding it up, but removing the optimization from
get_native_u2() is slowing it down.  I'm going to try this with the
original get_native_u2() and with get_Java_u2() just a copy of the
big-endian get_native_u2().

Cheers,
Gary

-- 
http://gbenson.net/


From ed at camswl.com  Thu Feb 19 11:27:06 2009
From: ed at camswl.com (Edward Nevill)
Date: Thu, 19 Feb 2009 19:27:06 GMT
Subject: backedge checks
Message-ID: <200902191927.n1JJR6H9003614@parsley.camswl.com>

Hi,

Apologies for the legal notices that appeared in my previous emails. I am now doing this
from my home machine over a wet piece of string. Email address is now either ed at camswl.com
or ed at starksfield.org (both resolve to the same machine).

>>           INCR_INVOCATION_COUNT;\
>>           SAFEPOINT;\
>>         }\
>>       }  /* UseCompiler ... */\
>>       INCR_INVOCATION_COUNT;\
>>       SAFEPOINT;\
>>     }
>> 
>> 
>> This macro is invoked in every branch (although the body is only
>> executed on backwards branches).
>> 
>> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT
>> twice on every backwards branch. Surely a mistake?
>
>It looks like that to me.

OK. Can someone with committal rights remove the first instance.

>> Secondly, can someone tell me under what circumstances in zero
>> UseCompiler would ever be true (and no, it doesn't resolve to a
>> constant.
>
>This stuff is AFAIK used to decide when to compile a method with Shark.

OK. So Shark is using the bytecode interpreter rather than the template interpreter?
Is this correct Gary?

In that case it is even more important we do something about the dire performance.

Regardless, when we are building zero (interpreted) this should be defined to be constant
false. At the moment it is a global.

Aside:

In general access to globals is very expensive (more expensive than you might believe in a
PIC world). If we need to do checks on globals can we at least do..

{
	type	*UseCompiler_ptr = &UseCompiler;

	...
	if (*UseCompiler_ptr) {
		...
	}
}

This is partly doing the compilers work for it, but sometimes it needs help.

>> notice_safepoints()
>> {
>>                 update_table(safe_branch_dispatch_table,
>> 			safe_branch_dispatch_table +
>>
>sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch));
>> }
>> 
>> ignore_safepoint()
>> {
>>                 update_table(unsafe_branch_dispatch_table,
>> 		unsafe_branch_dispatch_table +
>>
>sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch));
>> }
>
>Hmm, this doesn't look like it's thread safe to me.  It would make more
>sense to have a pointer to the branch dispatch table that's updated
>atomically.

There is a problem with updating the pointer to the table rather than the contents
of the table.

Hopefully, the pointer to the table is residing in an ARM register (or MIPS or ...).

	register uintptr_t *dispatch_table = ...

In order to update the pointer we would first of all need to remove the register specifier,
then we need to may it a static global, otherwise we cant access it in 'notice_safepoints'.
For good measure we need to declare it volatile to make sure the compiler cant do anything
clever. Yeuch (remember this pointer is accessed for every single bytecode).

I believe the above is thread safe.

In 'notice_safepoints' the VM state is 'synchronizing'. On return from 'notice_safepoints'
the VM state is changed to 'synchronized' (provided this was the last thread to synchronise).

Until the VM state is 'synchronized', the VM cannot make any assumptions as to whether
the interpreter is holding pointer to objects in registers or local storage.

Therefore unsafe operations such as GC are prohibited until the VM state is 'synchronized'.

During execution of 'notice_safepoints' half the handlers may point to safe handlers, half
may point to unsafe handlers. However, this will not cause any problems. The safe and
unsafe handlers are compatible in operation, its just the unsafe handlers dont call
'SafepointSynchronize::block(THREAD)'.

Now assume there is a pre-emptive thread swap while in the middle of 'notice_safepoints'.
The new thread then tries an unsafe operation (eg a GC request) which results in
'notice_safepoints' being called again.

The result is that the table may end up getting updated twice. In the really paranoid case
every single thread may be sitting in 'notice_safepoints'.

notice_safepoints() {
/* <- thread swap occurs here */
}

This is no different from the case where 'notice_safepoints' actually updates the table.
The VM is still not synchronised, and we have to either wait until control returns or
until someone else calls notice_safepoints. Remember the state is not set to synchronised
until all threads have synchronised.

On exit, before calling ignore_safepoints the VM is places in an unsynchronised state.
Therefore it doesn't matter whether we call the safe or unsafe branch handler. The
safe branch handler simply does an extra check on 'is_synchronizing' which returns false.

The only thing that could really ruin your day is if word write is not atomic, and I think
there is a lot of code that will break is word write is not atomic.

>
>> Finally, can someone enlighten me as to the purpose of
>> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of
>all
>> backedge checks.
>
>Surely the compile broker uses that count.

Yes, is the compiler the only thing that is using this? Is it used for profiling, or ...

We need to move to a world where we are not doing all these redundant checks with all the
attendant crud when the check trips to a world where we have a simple VM which just
handles the common default case.

If someone does require all this crud then we simply direct them to

BytecodeInterpreter::run_with_all_unnecessary_crud(..)

Regards,
Ed


From gbenson at redhat.com  Thu Feb 19 07:49:17 2009
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 19 Feb 2009 15:49:17 +0000
Subject: backedge checks
In-Reply-To: <499C1837.2070509@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com>
	<499C1837.2070509@redhat.com>
Message-ID: <20090219154917.GD31986@redhat.com>

Edward Nevill wrote:
> In bytecodeInterpreter.c is this wonderful macro
> 
> #define DO_BACKEDGE_CHECKS(skip, branch_pc)\
>     if ((skip) <= 0) {\
>       if (UseCompiler && UseLoopCounter) {\
>         bool do_OSR = UseOnStackReplacement;\
>         BACKEDGE_COUNT->increment();\
>         if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\
>         if (do_OSR) {\
>           nmethod*  osr_nmethod;\
>           OSR_REQUEST(osr_nmethod, branch_pc);\
>           if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) {          \
>             intptr_t* buf;\
>             CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception);              \
>             istate->set_msg(do_osr);\
>             istate->set_osr_buf((address)buf);\
>             istate->set_osr_entry(osr_nmethod->osr_entry());\
>             return;\
>           }\
>         } else {\
>           INCR_INVOCATION_COUNT;\
>           SAFEPOINT;\
>         }\
>       }  /* UseCompiler ... */\
>       INCR_INVOCATION_COUNT;\
>       SAFEPOINT;\
>     }
> 
> 
> This macro is invoked in every branch (although the body is only
> executed on backwards branches).
> 
> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT;
> SAFEPOINT twice on every backwards branch. Surely a mistake?

It looks that way.  I'll take it out...

> Secondly, can someone tell me under what circumstances in zero
> UseCompiler would ever be true (and no, it doesn't resolve to a
> constant.

It's true whenever you're using Shark.  Shark doesn't do OSR at the
moment though, so most of that macro is unentered.

> Thirdly, it should be possible to avoid the SAFEPOINT
> checks. SAFEPOINT does...
> 
> #define SAFEPOINT \
>     if ( SafepointSynchronize::is_synchronizing()) { \
>         { \
>           /* zap freed handles rather than GC'ing them */ \
>           HandleMarkCleaner __hmc(THREAD); \
>         } \
>         CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \
>     }
> 
> However, there is an upcall notice_safepoints() which safepoint.cpp
> calls whenever it is setting is_synchronizing(). This upcall is
> there to notify the interpreter that it needs to run to a GC
> safepoint. A corresponding call ignore_safepoints() is called when
> the interpreter can safely ignore safepoints. So there should be no
> need for the interpreter to continually check for safepoints.
> 
> notice_safepoints() and ignore_safepoints() in the template
> interpreter do indeed do something sensible. However, in the
> bytecode Interpreter the implementation is just {}

Sadly the one person who could answer that question categorically is
no longer with us, so it'll be trial and error.  If you make a patch
I can run it past the TCK for you, if you like...

> Finally, can someone enlighten me as to the purpose of
> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of
> all backedge checks.

It's how methods are selected for compilation, when you're running
with a JIT.  Methods are considered for compilation when their
invocation count passes a certain threshold.

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Thu Feb 19 08:14:55 2009
From: aph at redhat.com (Andrew Haley)
Date: Thu, 19 Feb 2009 16:14:55 +0000
Subject: backedge checks
In-Reply-To: <200902191927.n1JJR6H9003614@parsley.camswl.com>
References: <200902191927.n1JJR6H9003614@parsley.camswl.com>
Message-ID: <499D857F.5090905@redhat.com>

Edward Nevill wrote:

> Apologies for the legal notices that appeared in my previous emails. I am now doing this
> from my home machine over a wet piece of string. Email address is now either ed at camswl.com
> or ed at starksfield.org (both resolve to the same machine).

OK.

>>>           INCR_INVOCATION_COUNT;\
>>>           SAFEPOINT;\
>>>         }\
>>>       }  /* UseCompiler ... */\
>>>       INCR_INVOCATION_COUNT;\
>>>       SAFEPOINT;\
>>>     }
>>>
>>>
>>> This macro is invoked in every branch (although the body is only
>>> executed on backwards branches).
>>>
>>> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT
>>> twice on every backwards branch. Surely a mistake?
>> It looks like that to me.
> 
> OK. Can someone with committal rights remove the first instance.

>>> Secondly, can someone tell me under what circumstances in zero
>>> UseCompiler would ever be true (and no, it doesn't resolve to a
>>> constant.
>> This stuff is AFAIK used to decide when to compile a method with Shark.
> 
> OK. So Shark is using the bytecode interpreter rather than the template interpreter?
> Is this correct Gary?

> In that case it is even more important we do something about the dire performance.
> 
> Regardless, when we are building zero (interpreted) this should be defined to be constant
> false. At the moment it is a global.

In purely interpreted mode, yes.

> This is no different from the case where 'notice_safepoints' actually updates the table.
> The VM is still not synchronised, and we have to either wait until control returns or
> until someone else calls notice_safepoints. Remember the state is not set to synchronised
> until all threads have synchronised.
> 
> On exit, before calling ignore_safepoints the VM is places in an unsynchronised state.
> Therefore it doesn't matter whether we call the safe or unsafe branch handler. The
> safe branch handler simply does an extra check on 'is_synchronizing' which returns false.
> 
> The only thing that could really ruin your day is if word write is not atomic, and I think
> there is a lot of code that will break is word write is not atomic.

Yes, right.  This sounds like it'll work.

I admit that this sounds wrong, when it could be done simply by switching
a single pointer, but a word read from memory to get the address of
the table on every bytecode isn't such a great idea either.

>>> Finally, can someone enlighten me as to the purpose of
>>> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of
>> all
>>> backedge checks.
>> Surely the compile broker uses that count.
> 
> Yes, is the compiler the only thing that is using this? Is it used for profiling, or ...
> 
> We need to move to a world where we are not doing all these redundant checks with all the
> attendant crud when the check trips to a world where we have a simple VM which just
> handles the common default case.
> 
> If someone does require all this crud then we simply direct them to
> 
> BytecodeInterpreter::run_with_all_unnecessary_crud(..)

Yes, but it would surely be nice to have a decent fast interpreter when
we're running with Shark.  There ought to be some reasonable compromise.

Andrew.


From gbenson at redhat.com  Thu Feb 19 08:34:47 2009
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 19 Feb 2009 16:34:47 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090219150256.GC31986@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
	<20090218110851.GD3213@redhat.com>
	<20090219143801.GB31986@redhat.com>
	<20090219150256.GC31986@redhat.com>
Message-ID: <20090219163447.GA30223@redhat.com>

Gary Benson wrote:
> Gary Benson wrote:
> ...
> > > > > Edward Nevill wrote:
> > > > > > get_native_u2() and get_Java_u2() ... This seems to be a
> > > > > > misguided attempt of the original authors to optimised
> > > > > > reading of halfwords (judging by the comment immediate
> > > > > > preceding the code).
> ...
> > So I did a quick SPECjvm98 run of this change on amd64.  A couple
> > of the times are improved, by 3-10%, but one of the times is
> > *slower* by 11%.  See attached.  I'm not sure what to make of
> > that...
> 
> I'm wondering if rewriting get_Java_u2() to read directly rather
> than read and swap is speeding it up, but removing the optimization
> from get_native_u2() is slowing it down.  I'm going to try this with
> the original get_native_u2() and with get_Java_u2() just a copy of
> the big-endian get_native_u2().

So this way around it's a little more encouraging; two of the times
are 8-10% faster, one is 5% faster.  Some of the times are still
slower though, though not by as much, maybe 1-2%.  It's still
disturbingly ambiguous though... thoughts?

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Thu Feb 19 08:40:10 2009
From: aph at redhat.com (Andrew Haley)
Date: Thu, 19 Feb 2009 16:40:10 +0000
Subject: backedge checks
In-Reply-To: <200902191927.n1JJR6H9003614@parsley.camswl.com>
References: <200902191927.n1JJR6H9003614@parsley.camswl.com>
Message-ID: <499D8B6A.80809@redhat.com>

Edward Nevill wrote:

>> Hmm, this doesn't look like it's thread safe to me.  It would make more
>> sense to have a pointer to the branch dispatch table that's updated
>> atomically.
> 
> There is a problem with updating the pointer to the table rather than the contents
> of the table.
> 
> Hopefully, the pointer to the table is residing in an ARM register (or MIPS or ...).
> 
> 	register uintptr_t *dispatch_table = ...
> 
> In order to update the pointer we would first of all need to remove the register specifier,
> then we need to may it a static global, otherwise we cant access it in 'notice_safepoints'.
> For good measure we need to declare it volatile to make sure the compiler cant do anything
> clever. Yeuch (remember this pointer is accessed for every single bytecode).
> 
> I believe the above is thread safe.
> 
> In 'notice_safepoints' the VM state is 'synchronizing'. On return from 'notice_safepoints'
> the VM state is changed to 'synchronized' (provided this was the last thread to synchronise).
> 
> Until the VM state is 'synchronized', the VM cannot make any assumptions as to whether
> the interpreter is holding pointer to objects in registers or local storage.
> 
> Therefore unsafe operations such as GC are prohibited until the VM state is 'synchronized'.
> 
> During execution of 'notice_safepoints' half the handlers may point to safe handlers, half
> may point to unsafe handlers. However, this will not cause any problems. The safe and
> unsafe handlers are compatible in operation, its just the unsafe handlers dont call
> 'SafepointSynchronize::block(THREAD)'.
> 
> Now assume there is a pre-emptive thread swap while in the middle of 'notice_safepoints'.
> The new thread then tries an unsafe operation (eg a GC request) which results in
> 'notice_safepoints' being called again.
> 
> The result is that the table may end up getting updated twice. In the really paranoid case
> every single thread may be sitting in 'notice_safepoints'.
> 
> notice_safepoints() {
> /* <- thread swap occurs here */
> }
> 
> This is no different from the case where 'notice_safepoints' actually updates the table.
> The VM is still not synchronised, and we have to either wait until control returns or
> until someone else calls notice_safepoints. Remember the state is not set to synchronised
> until all threads have synchronised.
> 
> On exit, before calling ignore_safepoints the VM is places in an unsynchronised state.
> Therefore it doesn't matter whether we call the safe or unsafe branch handler. The
> safe branch handler simply does an extra check on 'is_synchronizing' which returns false.
> 
> The only thing that could really ruin your day is if word write is not atomic, and I think
> there is a lot of code that will break is word write is not atomic.

I knew this wasn't right, and I just realized why.  Every processor has its own
data cache, and these caches may not be coherent.  You could potentially wait
for ever before a processor would read the changed dispatch table.

I think this could probably be solved by sending a signal to every thread that
is supposed to move to a safepoint.  If there's any additional processing needed
the signal handler can do it.

Also, the thread updating the table would have to flush its own caches/write buffers/etc.

Finally, unless the table itself is marked as volatile there's no guarantee that
it would be written before the flush.

Andrew.


From aph at redhat.com  Thu Feb 19 08:47:55 2009
From: aph at redhat.com (Andrew Haley)
Date: Thu, 19 Feb 2009 16:47:55 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090219163447.GA30223@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>	<20090218104335.GB3213@redhat.com>
	<499BE9A7.3020405@redhat.com>	<20090218110851.GD3213@redhat.com>	<20090219143801.GB31986@redhat.com>	<20090219150256.GC31986@redhat.com>
	<20090219163447.GA30223@redhat.com>
Message-ID: <499D8D3B.3010302@redhat.com>

Gary Benson wrote:
> Gary Benson wrote:
>> Gary Benson wrote:
>> ...
>>>>>> Edward Nevill wrote:
>>>>>>> get_native_u2() and get_Java_u2() ... This seems to be a
>>>>>>> misguided attempt of the original authors to optimised
>>>>>>> reading of halfwords (judging by the comment immediate
>>>>>>> preceding the code).
>> ...
>>> So I did a quick SPECjvm98 run of this change on amd64.  A couple
>>> of the times are improved, by 3-10%, but one of the times is
>>> *slower* by 11%.  See attached.  I'm not sure what to make of
>>> that...
>> I'm wondering if rewriting get_Java_u2() to read directly rather
>> than read and swap is speeding it up, but removing the optimization
>> from get_native_u2() is slowing it down.  I'm going to try this with
>> the original get_native_u2() and with get_Java_u2() just a copy of
>> the big-endian get_native_u2().
> 
> So this way around it's a little more encouraging; two of the times
> are 8-10% faster, one is 5% faster.  Some of the times are still
> slower though, though not by as much, maybe 1-2%.  It's still
> disturbingly ambiguous though... thoughts?

Big test runs like SPECjvm98 are too coarse for this kind of thing.

It's petty unbelievable that with the original get_native_u2() and
the new, definitely faster, get_Java_u2() anything can actually
get slower.  I'd bet you're looking at noise.  1-2% is vey
hard to measure reliably on a multi-user system.

Andrew.


From gbenson at redhat.com  Thu Feb 19 09:43:46 2009
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 19 Feb 2009 17:43:46 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <499D8D3B.3010302@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>
	<20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com>
	<20090218110851.GD3213@redhat.com>
	<20090219143801.GB31986@redhat.com>
	<20090219150256.GC31986@redhat.com>
	<20090219163447.GA30223@redhat.com> <499D8D3B.3010302@redhat.com>
Message-ID: <20090219174346.GC30223@redhat.com>

Andrew Haley wrote:
> Gary Benson wrote:
> > So this way around it's a little more encouraging; two of the
> > times are 8-10% faster, one is 5% faster.  Some of the times are
> > still slower though, though not by as much, maybe 1-2%.  It's
> > still disturbingly ambiguous though... thoughts?
> 
> Big test runs like SPECjvm98 are too coarse for this kind of thing.

Is there something else I could try instead?  I was thinking about
trying it with DaCapo tomorrow, that's a little more real-world IMO.

> It's petty unbelievable that with the original get_native_u2() and
> the new, definitely faster, get_Java_u2() anything can actually get
> slower.  I'd bet you're looking at noise.  1-2% is vey hard to
> measure reliably on a multi-user system.

True.  I'm just wary of declaring it "definitely faster" if, when
measured, it isn't obviously so.  It may be that the optimization of
doing aligned loads 50% of the time and then swapping is actually
faster than reading the bytes individually...

Cheers,
Gary

-- 
http://gbenson.net/


From aph at redhat.com  Thu Feb 19 12:20:14 2009
From: aph at redhat.com (Andrew Haley)
Date: Thu, 19 Feb 2009 20:20:14 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <20090219174346.GC30223@redhat.com>
References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com>	<20090218104335.GB3213@redhat.com>
	<499BE9A7.3020405@redhat.com>	<20090218110851.GD3213@redhat.com>	<20090219143801.GB31986@redhat.com>	<20090219150256.GC31986@redhat.com>	<20090219163447.GA30223@redhat.com>
	<499D8D3B.3010302@redhat.com> <20090219174346.GC30223@redhat.com>
Message-ID: <499DBEFE.8090803@redhat.com>

Gary Benson wrote:
> Andrew Haley wrote:
>> Gary Benson wrote:
>>> So this way around it's a little more encouraging; two of the
>>> times are 8-10% faster, one is 5% faster.  Some of the times are
>>> still slower though, though not by as much, maybe 1-2%.  It's
>>> still disturbingly ambiguous though... thoughts?
>> Big test runs like SPECjvm98 are too coarse for this kind of thing.
> 
> Is there something else I could try instead?  I was thinking about
> trying it with DaCapo tomorrow, that's a little more real-world IMO.

I would have thought DaCapo even worse for looking at this micro-
optimization.  Embedded CaffieneMark was great for this, but isn't
used much these days because it's so small people can cheat.  :-)

>> It's petty unbelievable that with the original get_native_u2() and
>> the new, definitely faster, get_Java_u2() anything can actually get
>> slower.  I'd bet you're looking at noise.  1-2% is vey hard to
>> measure reliably on a multi-user system.
> 
> True.  I'm just wary of declaring it "definitely faster" if, when
> measured, it isn't obviously so.  It may be that the optimization of
> doing aligned loads 50% of the time and then swapping is actually
> faster than reading the bytes individually...

Hard to say.  The real way to find out is to write a test that *only*
does this thing.

Andrew.


From ed at camswl.com  Thu Feb 19 16:51:05 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 00:51:05 GMT
Subject: Profiling
Message-ID: <200902200051.n1K0p5YK004736@parsley.camswl.com>

Hi folks,

I have been doing some profing on the vm using gprof with some interesting
results. To do this I set LINK_INTO=AOUT to generate a static 'gamma' launcher
with the appropriate -pg flags set.

However, I have not been able to do this for the other libraries (eg libjava.so)
only for libjvm.so so we do not get any profiling information on the native
libraries which is what I really wanted.

The problem is that unlike libjvm.so libjava.so and friends are not linked in,
instead they are opened and referenced using dll_open and friends. So I cannot
just statically link in libjava.so.

Does anyone know of a way to get gprof information out of libjava.so.

Anyway, on to the results I got.

The benchmarks I used were CaffeineMark, EEMBC and ThinkFreeOffic (a pure Java
office suite). I ran each benchmark with the original build, the split
interpreter loop (in C), and the split interpreter loop with the split
recoded in ARM asm.

The complete results are available here

http://camswl.com/openjdk/profiles/flat_ecm_original
http://camswl.com/openjdk/profiles/flat_ecm_opt
http://camswl.com/openjdk/profiles/flat_ecm_asm
http://camswl.com/openjdk/profiles/flat_eembc_original
http://camswl.com/openjdk/profiles/flat_eembc_opt
http://camswl.com/openjdk/profiles/flat_eembc_asm
http://camswl.com/openjdk/profiles/flat_office_original
http://camswl.com/openjdk/profiles/flat_office_opt
http://camswl.com/openjdk/profiles/flat_office_asm

Here is a summary, with some discussion. The main point I think is that we
are wasting our time optimising the VM for 'semi real world' applications
like Think Free Office. That doesn't mean it is not worthwhile impriving the
VM, but there are classes of application for which it will make little/no
difference.

--- CaffeineMark --- Original

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 84.12      8.53     8.53  2033574     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  2.27      8.76     0.23      486     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
-----------------------------

As expected almost all the time spent in 'run'

--- CaffeineMark --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 76.80      7.47     7.47  2093698     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
  4.63      7.92     0.45  1872003     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  4.01      8.31     0.39      489     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
-----------------------------

OK, this looks good. 76% of the time is now spent in the simple interpreter
loop, ready for optimisation.

Note: you will notice that the time in seconds remains approx the same
in CaffeineMark because CaffeineMark runs for a fixed time rather that
a fixed number of iterations.


--- CaffeineMark --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 77.15      7.70     7.70                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
  5.71      8.27     0.57  2136095     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  2.51      8.52     0.25      490     0.51     1.65  CppInterpreter::main_loop(int, Thread*)
------------------------------

Beginning to flatten somewhat, but still mainly in 'run_opt'. Still plenty of
scope for optimisation.

Note: The Caffeinemark score for asm was approx 1.9X original.

Now, lookup at EEMBC


--- EEMBC --- Original


  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 59.07     73.29    73.29 100435660     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 11.29     87.29    14.01     1850     0.01     0.02  CppInterpreter::main_loop(int, Thread*)
  4.55     92.94     5.65 47872597     0.00     0.00  methodOopDesc::result_type() const
  3.83     97.69     4.75 47882070     0.00     0.00  SignatureIterator::iterate_returntype()
  3.32    101.82     4.13 47096159     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  2.80    105.30     3.48 94194346     0.00     0.00  os::vm_page_size()
  2.36    108.23     2.93 46326866     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
---------------------------

60% in 'run', note the relatively large % in 'main_loop'. This is because each
invoke and return actually returns (in C) to 'main_loop', the invoke or return
is then handled in 'main_loop' which then calls 'run' again. Hence the
huge number of calls to 'run'.

--- EEMBC --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 31.33     39.03    39.03 113138096     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 22.84     67.48    28.45 100435423     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 12.78     83.39    15.92     1842     0.01     0.03  CppInterpreter::main_loop(int, Thread*)
  5.31     90.00     6.61 47872423     0.00     0.00  methodOopDesc::result_type() const
  4.43     95.52     5.52 47096055     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  3.77    100.22     4.70 47881896     0.00     0.00  SignatureIterator::iterate_returntype()
  2.99    103.95     3.73 47897743     0.00     0.00  SignatureIterator::parse_type()
  2.50    107.07     3.12 94194122     0.00     0.00  os::vm_page_size()
  2.20    109.81     2.75 46326782     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.88    112.16     2.35 74343657     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
----------------------------

Flattening out with 54% spent in 'run' and 'run_opt'. Note the oddity
'vm_page_size()'. What is making almost 1E8 calls to vm_page_size()???
I don't think it is going to change since the last time they called it.

--- EEMBC --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 25.74     28.29    28.29                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 25.02     55.77    27.49 100435695     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 13.82     70.96    15.19     1842     0.01     0.02  CppInterpreter::main_loop(int, Thread*)
  5.48     76.98     6.02 47872566     0.00     0.00  methodOopDesc::result_type() const
  4.51     81.93     4.96 47096184     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.20     86.54     4.61 47882020     0.00     0.00  SignatureIterator::iterate_returntype()
  2.99     89.83     3.29 46326896     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.89     93.01     3.18 94194380     0.00     0.00  os::vm_page_size()
  2.17     95.39     2.38 47897875     0.00     0.00  SignatureIterator::parse_type()
  1.80     97.37     1.98 74343844     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  1.73     99.27     1.90 47884816     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
----------------------------------

OK. So now just 50% in 'run' and 'run_opt' with only 25% in 'run_opt'. This
severly limits further optimisation by improving 'run_opt'. Even if we made
run_opt go 10 X faster it would only improve performance by about 20%.


--- ThinkFreeOffice --- Original

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 42.12     21.35    21.35 33855747     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  9.70     26.27     4.92    34204     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  5.17     28.89     2.62 17783120     0.00     0.00  methodOopDesc::result_type() const
  4.54     31.19     2.30 16484931     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.48     33.46     2.27 17847099     0.00     0.00  SignatureIterator::iterate_returntype()
  2.49     34.72     1.26 15239483     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.41     35.94     1.22 18044932     0.00     0.00  SignatureIterator::parse_type()
  2.05     36.98     1.04 33005883     0.00     0.00  os::vm_page_size()
  1.47     37.72     0.75  1245317     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
  1.44     38.45     0.73 17891344     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.09     39.00     0.55  1773458     0.00     0.00  CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*)
--------------------------------

Note how relatively flat it is to start with. Only 42% in 'run'. And this is
not the whole truth. Think Office is spending a lot of time in native methods
(look at the number of calls to 'native_entry') but these are not shown in the
profile because I cant link against libjava.so. So the 42% of the time is
42% of the time spent in the JVM itself, not in the whole of openjdk.

Look at native_entry. 2.4% of the time spent in native_entry itself but
37 seconds attributable to native_entry & all its descendants.

--- ThinkFreeOffice --- Split interpreter loop (run_opt in C)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 23.24     12.92    12.92 34292092     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 18.08     22.96    10.05 33659712     0.00     0.00  BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 10.66     28.89     5.93    34512     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  5.90     32.17     3.28 17997457     0.00     0.00  methodOopDesc::result_type() const
  4.14     34.47     2.30 16685449     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.07     36.73     2.26 18064129     0.00     0.00  SignatureIterator::iterate_returntype()
  3.17     38.49     1.76 18260032     0.00     0.00  SignatureIterator::parse_type()
  2.23     39.73     1.24 15425660     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.96     40.82     1.09 23829897     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  1.75     41.79     0.97 33407124     0.00     0.00  os::vm_page_size()
  1.40     42.57     0.78 18106431     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.26     43.27     0.70  1259656     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
---------------------------------------------------------

Note interestingly only 18% of the time spent in run_opt with 23% spent in run
because of the huge number of invokes and returns (presumeable to those native
methods).

--- ThinkFreeOffice --- Asm interpreter loop (run_opt in ARM asm)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 24.38     12.23    12.23 34428228     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 14.14     19.32     7.09                             BytecodeInterpreter::run_opt(BytecodeInterpreter*)
 11.36     25.01     5.70    34395     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  6.08     28.06     3.05 18104834     0.00     0.00  methodOopDesc::result_type() const
  4.99     30.56     2.50 18170808     0.00     0.00  SignatureIterator::iterate_returntype()
  4.15     32.64     2.08 16765945     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  2.57     33.93     1.29 18369885     0.00     0.00  SignatureIterator::parse_type()
  2.53     35.20     1.27 23874942     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  2.29     36.35     1.15 33568034     0.00     0.00  os::vm_page_size()
  2.05     37.38     1.03 15479805     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.46     38.11     0.73 18214073     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.39     38.81     0.70  1286010     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
------------------------------

Note much scope for improvement with only 14% spent in run_opt.


What I would really like to do is profile libjava.so so I can find out what
it is doing in those native methods. Any help on how to do this much
appreciated.

Regards,
Ed.


From ed at camswl.com  Thu Feb 19 17:39:10 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 01:39:10 GMT
Subject: Improving the performance of OpenJDK
Message-ID: <200902200139.n1K1dAst004899@parsley.camswl.com>

> Gary Benson wrote:
> > Gary Benson wrote:
> > ...
> > > > > > Edward Nevill wrote:
> > > > > > > get_native_u2() and get_Java_u2() ... This seems to be a
> > > > > > > misguided attempt of the original authors to optimised
> > > > > > > reading of halfwords (judging by the comment immediate
> > > > > > > preceding the code).
> > ...
> > > So I did a quick SPECjvm98 run of this change on amd64.  A couple
> > > of the times are improved, by 3-10%, but one of the times is
> > > *slower* by 11%.  See attached.  I'm not sure what to make of
> > > that...
> > 
> > I'm wondering if rewriting get_Java_u2() to read directly rather
> > than read and swap is speeding it up, but removing the optimization
> > from get_native_u2() is slowing it down.  I'm going to try this with
> > the original get_native_u2() and with get_Java_u2() just a copy of
> > the big-endian get_native_u2().
> 
> So this way around it's a little more encouraging; two of the times
> are 8-10% faster, one is 5% faster.  Some of the times are still
> slower though, though not by as much, maybe 1-2%.  It's still
> disturbingly ambiguous though... thoughts?

I can believe on some architectures that get_native_u2() is faster.

However, x86 supports unaligned accesses, so the code should do something
like.

#ifdef TARGET_SUPPORTS_UNALIGNED_ACCESSES
	return *p;
#else
/* Do current crud, or something better */
#endif

The main killer is the get_Java_u2(). The stupid unsigned hw load, followed
by the sign extension, followed by a full word bytesex reversal.

Question: Why is the class data in the wrong endianness in the first place?
The data affected is the data embeded in the Java bytecode and the data
in the constant pool. Why does the classfile loader not just correct
the endianness on load. It has to do verification anyway so it has to
trawl through the classfile? It might as well correct the endianness.

Regards,
Ed.


From ed at camswl.com  Thu Feb 19 17:46:59 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 01:46:59 GMT
Subject: backedge checks
Message-ID: <200902200146.n1K1kxWK004933@parsley.camswl.com>

> 
> I knew this wasn't right, and I just realized why.  Every processor has its own
> data cache, and these caches may not be coherent.  You could potentially wait
> for ever before a processor would read the changed dispatch table.
> 
> I think this could probably be solved by sending a signal to every thread that
> is supposed to move to a safepoint.  If there's any additional processing needed
> the signal handler can do it.
> 
> Also, the thread updating the table would have to flush its own caches/write buffers/etc.
> 
> Finally, unless the table itself is marked as volatile there's no guarantee that
> it would be written before the flush.

If you seriously think any of this code is MP safe at the thread level
then I suspect you have been doing some recreational pharmaceuticals:-)

Think about is, any update to the thread list needs to be MP safe.
Even the setting of the _state variable needs to me MP safe and its isn't
it just does

	_state = _synchronizing;

Exactly the same problems apply to this, it may not be updated in the cache
of another processor.

Even the kernels user level atomic compare and swap is not MP safe (it may
be MP safe on some implementations but it is not guaranteed so), and you
are going to have a hard job making your program MP safe if your atomic
operations are not MP safe.

Regards,
Ed.


From ed at camswl.com  Fri Feb 20 06:57:32 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 14:57:32 GMT
Subject: Bytecode profiling
Message-ID: <200902201457.n1KEvWQt007786@parsley.camswl.com>

Hi folks,

Another day, more profiling. I have persuaded gprof to tell me the % time spent executed
each bytecode in my ASM loop. This is more useful than the simple bytecode counter in
the CPP interpreter. Some interesting results.

I have just included summaries here down to the 1% level. If anyone wants the full profile
I can send it. At the bottom of each bytecode profile I have include other functions down
to the 1% level for comparison.

First CaffeineMark. Fairly much as expected. I'm surprised it spends quit that amount of
time in getfield. I'll have another look at that. Also notice the high % in iload_0. This
is probably a lie. It is probably aload_0, the asm VM does not distinguish aload_0, it
just goes to iload_0. So the sequence is probably aload_0, getfield.

----- ECM ---
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 11.99      1.21     1.21                             do_getfield
  7.19      1.94     0.73                             do_iload
  4.96      3.06     0.50                             do_iaload
  4.56      3.52     0.46                             do_ifne
  4.56      3.98     0.46                             do_iload_0
  4.36      4.42     0.44                             do_iconst_1
  3.32      4.75     0.34                             do_iload_2
  2.97      5.05     0.30                             do_if_icmplt
  2.78      5.33     0.28                             do_istore
  2.58      5.87     0.26                             do_ixor
  2.38      6.11     0.24                             do_laload
  1.78      6.48     0.18                             do_iadd
  1.73      6.66     0.18                             do_iinc
  1.68      7.00     0.17                             do_iload_1
  1.54      7.15     0.16                             do_iload_3
  1.49      7.45     0.15                             do_iconst_0
  1.44      7.60     0.15                             do_if_icmpgt
  1.39      7.74     0.14                             do_if_icmpge
  1.19      7.86     0.12                             do_dadd
  1.19      7.98     0.12                             do_ifle
  1.19      8.10     0.12                             do_istore_2
  1.14      8.21     0.12                             do_lastore
  1.09      8.43     0.11                             do_lload

  6.14      2.56     0.62  2136180     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
  2.78      5.61     0.28      490     0.57     1.85  CppInterpreter::main_loop(int, Thread*)
  1.88      6.30     0.19  1099047     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  1.68      6.83     0.17  1218147     0.00     0.00  methodOopDesc::result_type() const
  1.49      7.30     0.15  1224036     0.00     0.00  SignatureIterator::iterate_returntype()
  1.09      8.32     0.11    18512     0.01     0.01  typeArrayKlass::allocate(int, Thread*)
----------------------------

Next, EEMBC. Only 5 bytecode above the 1% level. Notice the time spent in os::vm_page_size()
2.93%, more than any bytecode apart from do_getfield.


--- EEMBC -----
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  3.54     61.03     3.84                             do_getfield
  3.27     64.57     3.54                             run_opt_entry
  1.80     79.09     1.95                             do_iload
  1.58     80.80     1.72                             do_iload_0
  1.16     82.06     1.26                             do_iload_1
  1.06     83.21     1.15                             do_if_icmplt

 24.60     26.67    26.67 100435622     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 13.87     41.70    15.03     1847     0.01     0.02  CppInterpreter::main_loop(int, Thread*)
  5.77     47.95     6.25 47872572     0.00     0.00  methodOopDesc::result_type() const
  4.58     52.91     4.97 47096141     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  3.95     57.19     4.28 47882026     0.00     0.00  SignatureIterator::iterate_returntype()
  2.93     67.75     3.18 94194307     0.00     0.00  os::vm_page_size()
  2.64     70.61     2.86 46326855     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.06     72.84     2.23 47897882     0.00     0.00  SignatureIterator::parse_type()
  2.03     75.04     2.20 74343829     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  1.94     77.14     2.10 47884823     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
------------------------------

Finally, Think Office. Only 1 bytecode above the 1% level. And it spends more time doing
os::vm_page_size() than any bytecode. I must get the callgraph profile and find out who
is calling it 33E6 times.

Not the time in run_opt_entry. This is because of all the invokes and returns. Perhaps
I have just deoptimised the VM for think office and it is just spending its time bouncing
between 'run' and 'run_opt'

--- Office ---
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  2.12     31.06     1.07                             run_opt_entry
  1.39     34.18     0.70                             do_getfield

 25.50     12.88    12.88 34862862     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 11.26     18.57     5.69    35015     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  6.36     21.78     3.21 18326576     0.00     0.00  methodOopDesc::result_type() const
  4.59     24.10     2.32 16971754     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  4.30     26.27     2.17 18391356     0.00     0.00  SignatureIterator::iterate_returntype()
  2.67     27.62     1.35 18587041     0.00     0.00  SignatureIterator::parse_type()
  2.36     28.81     1.19 24191043     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  2.34     29.99     1.18 15669590     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  1.94     32.04     0.98 33980388     0.00     0.00  os::vm_page_size()
  1.46     32.77     0.74  1302034     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
  1.41     33.48     0.71 18435409     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
--------------------

For now....

Ed


From aph at redhat.com  Fri Feb 20 03:10:52 2009
From: aph at redhat.com (Andrew Haley)
Date: Fri, 20 Feb 2009 11:10:52 +0000
Subject: backedge checks
In-Reply-To: <200902200146.n1K1kxWK004933@parsley.camswl.com>
References: <200902200146.n1K1kxWK004933@parsley.camswl.com>
Message-ID: <499E8FBC.7020805@redhat.com>

Edward Nevill wrote:
>> I knew this wasn't right, and I just realized why.  Every processor has its own
>> data cache, and these caches may not be coherent.  You could potentially wait
>> for ever before a processor would read the changed dispatch table.
>>
>> I think this could probably be solved by sending a signal to every thread that
>> is supposed to move to a safepoint.  If there's any additional processing needed
>> the signal handler can do it.
>>
>> Also, the thread updating the table would have to flush its own caches/write buffers/etc.
>>
>> Finally, unless the table itself is marked as volatile there's no guarantee that
>> it would be written before the flush.
> 
> If you seriously think any of this code is MP safe at the thread level
> then I suspect you have been doing some recreational pharmaceuticals:-)
> 
> Think about is, any update to the thread list needs to be MP safe.
> Even the setting of the _state variable needs to me MP safe and its isn't
> it just does
> 
> 	_state = _synchronizing;

It wouldn't work if that were all it did: there are memory barriers there
too.

But anyway, it looks like the template interpreter's notice_safepoints()
does simply:

static inline void copy_table(address* from, address* to, int size) {
  // Copy non-overlapping tables. The copy has to occur word wise for MT safety.
  while (size-- > 0) *to++ = *from++;
}

void TemplateInterpreter::notice_safepoints() {
  if (!_notice_safepoints) {
    // switch to safepoint dispatch table
    _notice_safepoints = true;
    copy_table((address*)&_safept_table, (address*)&_active_table, sizeof(_active_table) / sizeof(address));
  }
}

so I can't see any reason  for the C++ interpreter not to do the same.

I do wonder how this works in general, though.  It does seem like the code
assumes memory semantics stronger than those guaranteed by POSIX threads.
Maybe the template interpreter in hotspot isn't used on boxes with weakly
consistent caches? That seems unlikely, though.  I'll ask.

> Exactly the same problems apply to this, it may not be updated in the cache
> of another processor.
> 
> Even the kernels user level atomic compare and swap is not MP safe (it may
> be MP safe on some implementations but it is not guaranteed so), and you
> are going to have a hard job making your program MP safe if your atomic
> operations are not MP safe.

The atomic operations are MP safe by definition.

Andrew.


From ed at camswl.com  Fri Feb 20 07:23:03 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 15:23:03 GMT
Subject: Optimising invoke and return
Message-ID: <200902201523.n1KFN3gx007900@parsley.camswl.com>

Hi again folks,

The benchmarking so far has shown me that to get any real improvement on (semi) real world
application like Think Free Office I must look at a number of things...

1) Optimisation of invoke and return

The way invoke and return is done at the moment in the CC_INTERP is disgusting.

When 'run' encounters an invoke or return it does some limited processing, sets the VM
state to 'method_entry' or 'returning' and then returns from 'run' to 'main_loop' which
then handles the invoke or return and then calls 'run' again.

Now with the 'optimised' interpreter, when it encounters an invoke or return it has to
thread its way back from run_opt to main_loop and then back down to run_opt

Preoptimisation handling of invoke / return

main_loop -> run -> main_loop -> run

Post 'optimisation'

main_loop -> run -> run_opt -> run -> main_loop -> run -> run_opt.

Oops.

The way around this is to try and flatten main_loop into run so that it is all handled
in 'run'. Then the code for invoke/return can be migrated down to run_opt (at least
for the simple cases where it is resolved, not synchronised?, no exceptions).

However this will involve large scale munging of the codebase which I wanted to avoid.

Increasingly I am feeling that I am flogging a dead horse with the CC_INTERP.

I had a look at the template interpreter over the past few days. It doesn't actually look
too bad. It all seems quite neatly architected (at least to the depth I examined it).

2) Optimisation of native invoke and return

3) Optimisation of the native libraries.


My poiny haired boss is in town today and I think my proposal to him will be that we
do the template interpreter. Based on some benchmarking I did on the PC comparing zero
with -Xint this should give us 6 X performance increase over bare bones zero (4 X
over my current optimised implementation).

Doing the template interpreter would also serve as a useful 1st step to doing hotspot
(should we decide to go that way).

Gary: Does the Hotspt Compiler use the template interpreter to do its codegen, or does
it have its own codegen and only use the template interpreter to interpret uncompiled
code. (I could check this myself in the source, but I am lazy).

Regards,
Ed.


From aph at redhat.com  Fri Feb 20 03:58:50 2009
From: aph at redhat.com (Andrew Haley)
Date: Fri, 20 Feb 2009 11:58:50 +0000
Subject: Improving the performance of OpenJDK
In-Reply-To: <200902200139.n1K1dAst004899@parsley.camswl.com>
References: <200902200139.n1K1dAst004899@parsley.camswl.com>
Message-ID: <499E9AFA.3010502@redhat.com>

Edward Nevill wrote:
> 
> Question: Why is the class data in the wrong endianness in the first place?
> The data affected is the data embeded in the Java bytecode and the data
> in the constant pool. Why does the classfile loader not just correct
> the endianness on load. It has to do verification anyway so it has to
> trawl through the classfile? It might as well correct the endianness.

No I don't understand this either.  However, there are other places than
the interpreter that read the bytecode, so it's not just a matter of
fixing it there.

All accesses to the constant pool seem to read the operand in native byte
order, and then constantPoolOopDesc::field_or_method_at() does:

  // Takes either a constant pool cache index in possibly byte-swapped
  // byte order (which comes from the bytecodes after rewriting) or,
  // if "uncached" is true, a vanilla constant pool index
  jint field_or_method_at(int which, bool uncached) {
    int i = -1;
    if (uncached || cache() == NULL) {
      i = which;
    } else {
      // change byte-ordering and go via cache
      i = cache()->entry_at(Bytes::swap_u2(which))->constant_pool_index();
    }
    assert(tag_at(i).is_field_or_method(), "Corrupted constant pool");
    return *int_at_addr(i);
  }

In other words: the operands are rewritten but left in Java byte order.

Andrew.


From gbenson at redhat.com  Fri Feb 20 04:02:15 2009
From: gbenson at redhat.com (Gary Benson)
Date: Fri, 20 Feb 2009 12:02:15 +0000
Subject: Optimising invoke and return
In-Reply-To: <200902201523.n1KFN3gx007900@parsley.camswl.com>
References: <200902201523.n1KFN3gx007900@parsley.camswl.com>
Message-ID: <20090220120215.GA3277@redhat.com>

Edward Nevill wrote:
> My poiny haired boss is in town today and I think my proposal to
> him will be that we do the template interpreter. Based on some
> benchmarking I did on the PC comparing zero with -Xint this should
> give us 6 X performance increase over bare bones zero (4 X over my
> current optimised implementation).
> 
> Doing the template interpreter would also serve as a useful 1st step
> to doing hotspot (should we decide to go that way).
> 
> Gary: Does the Hotspt Compiler use the template interpreter to do
> its codegen, or does it have its own codegen and only use the
> template interpreter to interpret uncompiled code. (I could check
> this myself in the source, but I am lazy).

The two compilers (client and server) and the template interpreter
are essentially independent.  There's shared code, of course, but
they're essentially different.

Whether doing the template interpreter is a good idea depends on what
you want, and when you want it.  If your goal is a high-performance
interpreter only solution then I'd say do the template interpreter.
The work involved is probably something like 6-9 man months to get it
mostly working, and maybe another 3 months to pass the TCK if that's
what you want.

If your goal is a high-performance interpreter+JIT combo, then adding
the client JIT to an existing template interpreter is probably 3-6 man
months, with maybe another 3 months for the TCK.  I'm guessing here,
I haven't done a client JIT interpretation myself.  I'm also assuming
you'd want client rather than server (it's probably better suited for
ARM).

If you want a JIT but don't have a 12-18 months to do it in and can
accept good but not ultimate performance, then maybe Zero and Shark
are worth looking into.  Shark is not bad now, and it could be in
something like a production ready state in 3-6 months.

That's my two cents...

Cheers,
Gary

-- 
http://gbenson.net/


From ed at camswl.com  Fri Feb 20 08:11:36 2009
From: ed at camswl.com (Edward Nevill)
Date: Fri, 20 Feb 2009 16:11:36 GMT
Subject: Improving the performance of OpenJDK
Message-ID: <200902201611.n1KGBark008090@parsley.camswl.com>

> > Question: Why is the class data in the wrong endianness in the first place?
> > The data affected is the data embeded in the Java bytecode and the data
> > in the constant pool. Why does the classfile loader not just correct
> > the endianness on load. It has to do verification anyway so it has to
> > trawl through the classfile? It might as well correct the endianness.
> 
> No I don't understand this either.  However, there are other places than
> the interpreter that read the bytecode, so it's not just a matter of
> fixing it there.

<sarcasm>
Ah yes, but presumably all those other places in the interpreter will use
the wonderfully written and highly optimised get_Java_u2() and friends so
we dont need to trawl through the interpreter. I mean thats why they
were written in the first place.
</sarcasm>

Regards,
Ed.


From xerxes at zafena.se  Fri Feb 20 04:59:27 2009
From: xerxes at zafena.se (=?ISO-8859-1?Q?Xerxes_R=E5nby?=)
Date: Fri, 20 Feb 2009 13:59:27 +0100
Subject: Optimising invoke and return - importing phoneME ARM JIT to
	Icedtea?
In-Reply-To: <20090220120215.GA3277@redhat.com>
References: <200902201523.n1KFN3gx007900@parsley.camswl.com>
	<20090220120215.GA3277@redhat.com>
Message-ID: <499EA92F.4030508@zafena.se>

Hello i have been following the recent days optimization discussion with 
great interest and wanted to fill in some thoughts.

Gary Benson skrev:
> Edward Nevill wrote:
>   
>> My poiny haired boss is in town today and I think my proposal to
>> him will be that we do the template interpreter. Based on some
>> benchmarking I did on the PC comparing zero with -Xint this should
>> give us 6 X performance increase over bare bones zero (4 X over my
>> current optimised implementation).
>>
>> Doing the template interpreter would also serve as a useful 1st step
>> to doing hotspot (should we decide to go that way).
>>
>> Gary: Does the Hotspt Compiler use the template interpreter to do
>> its codegen, or does it have its own codegen and only use the
>> template interpreter to interpret uncompiled code. (I could check
>> this myself in the source, but I am lazy).
>>     
>
> The two compilers (client and server) and the template interpreter
> are essentially independent.  There's shared code, of course, but
> they're essentially different.
>
> Whether doing the template interpreter is a good idea depends on what
> you want, and when you want it.  If your goal is a high-performance
> interpreter only solution then I'd say do the template interpreter.
> The work involved is probably something like 6-9 man months to get it
> mostly working, and maybe another 3 months to pass the TCK if that's
> what you want.
>   
In order to complete the TCK on ARM i belive some focus are needed to 
speedup the compile test cycle on ARM platforms since we dont want a 
week compiletime betweeen the TCK runs.
I have posted links to Robert Schusters tutorials and patches to 
crosscompile openjdk for arm:
http://icedtea.classpath.org/wiki/CrossCompileFaq
I will gladly help debugging any failed tests for ARM.
My own plan before i start dive into the TCK is to first focus getting a 
cross compilation environment up and then start looking into failing 
jtreg tests on ARM/Zero.
> If your goal is a high-performance interpreter+JIT combo, then adding
> the client JIT to an existing template interpreter is probably 3-6 man
> months, with maybe another 3 months for the TCK.  I'm guessing here,
> I haven't done a client JIT interpretation myself.  I'm also assuming
> you'd want client rather than server (it's probably better suited for
> ARM).
>   
A question that have tickeled my mind since fosdem are if the phoneME 
and OpenJDK jvm codebase share any similarity, if so then perhaps a 
allready implemented ARM JIT from  the phoneME project can be patched up 
to support 1.5 bytecodes and imported into a GPL only fork of OpenJDK by 
dropping all classpath exceptions.
For me it is impossible to judge the effort needed to do such a 
suggested merge.
> If you want a JIT but don't have a 12-18 months to do it in and can
> accept good but not ultimate performance, then maybe Zero and Shark
> are worth looking into.  Shark is not bad now, and it could be in
> something like a production ready state in 3-6 months.
>
> That's my two cents...
>
> Cheers,
> Gary
>   
Cheers and have a great day!
Xerxes


From gbenson at redhat.com  Tue Feb 24 04:00:41 2009
From: gbenson at redhat.com (Gary Benson)
Date: Tue, 24 Feb 2009 12:00:41 +0000
Subject: Profiling
In-Reply-To: <200902200051.n1K0p5YK004736@parsley.camswl.com>
References: <200902200051.n1K0p5YK004736@parsley.camswl.com>
Message-ID: <20090224120041.GA3260@redhat.com>

Hi Ed,

Edward Nevill wrote:
> --- ThinkFreeOffice --- Original
> 
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls   s/call   s/call  name    
>  42.12     21.35    21.35 33855747     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
>   9.70     26.27     4.92    34204     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
>   5.17     28.89     2.62 17783120     0.00     0.00  methodOopDesc::result_type() const
>   4.54     31.19     2.30 16484931     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
>   4.48     33.46     2.27 17847099     0.00     0.00  SignatureIterator::iterate_returntype()
>   2.49     34.72     1.26 15239483     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
>   2.41     35.94     1.22 18044932     0.00     0.00  SignatureIterator::parse_type()
>   2.05     36.98     1.04 33005883     0.00     0.00  os::vm_page_size()
>   1.47     37.72     0.75  1245317     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
>   1.44     38.45     0.73 17891344     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
>   1.09     39.00     0.55  1773458     0.00     0.00  CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*)

This one I find really interesting.  Is libffi showing up in your
profile at all?  Andrew Haley and I have often wondered what
performance impace using libffi has, and Andrew has some ideas for
speeding up libffi if it turns out to be a bottleneck.  If your
profile is excluding libffi, it'd be really interesting to see
what it looked like with it in.

Cheers,
Gary

-- 
http://gbenson.net/


From ed at camswl.com  Wed Feb 25 06:01:23 2009
From: ed at camswl.com (Edward Nevill)
Date: Wed, 25 Feb 2009 14:01:23 GMT
Subject: Profiling
Message-ID: <200902251401.n1PE1N44017335@parsley.camswl.com>

>
>This one I find really interesting.  Is libffi showing up in your
>profile at all?  Andrew Haley and I have often wondered what
>performance impace using libffi has, and Andrew has some ideas for
>speeding up libffi if it turns out to be a bottleneck.  If your
>profile is excluding libffi, it'd be really interesting to see
>what it looked like with it in.
>
>Cheers,
>Gary

Hi Gary,

The previous profiles did not include libffi because I was using the
shared library. I have redone it with a static libffi. Here are the
top 50 results. libffi accounts for 0.5% on this particular app
(Think Free Office). I also did CaffeineMark and EEMBC but the
results on those were negligible (<0.10%).

Regards,
Ed.

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 23.81     11.98    11.98 33820691     0.00     0.00  BytecodeInterpreter::run(BytecodeInterpreter*)
 10.99     17.51     5.53    33987     0.00     0.00  CppInterpreter::main_loop(int, Thread*)
  5.96     20.51     3.00 17775389     0.00     0.00  methodOopDesc::result_type() const
  4.79     22.92     2.41 17840744     0.00     0.00  SignatureIterator::iterate_returntype()
  3.90     24.88     1.96 16469679     0.00     0.00  InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*)
  2.48     26.13     1.25 15215340     0.00     0.00  CppInterpreter::normal_entry(methodOopDesc*, int, Thread*)
  2.44     27.36     1.23 23488682     0.00     0.00  ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const
  2.38     28.56     1.20 32975124     0.00     0.00  os::vm_page_size()
  2.29     29.71     1.15 18024369     0.00     0.00  SignatureIterator::parse_type()
  2.13     30.78     1.07                             run_opt_entry
  1.85     31.71     0.93  1254203     0.00     0.00  CppInterpreter::native_entry(methodOopDesc*, int, Thread*)
  1.61     32.52     0.81 17883786     0.00     0.00  SignatureIterator::SignatureIterator(symbolHandle)
  1.25     33.15     0.63                             do_getfield_state1
  0.85     33.58     0.43   207980     0.00     0.00  typeArrayKlass::allocate(int, Thread*)
  0.81     33.99     0.41  1777040     0.00     0.00  CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*)
  0.81     34.40     0.41                             run_opt_exit
  0.70     34.75     0.35  6219137     0.00     0.00  BytecodeInterpreter::astore(int*, int, int*, int)
  0.68     35.09     0.34                             do_iload_state1
  0.64     35.41     0.32 15293338     0.00     0.00  ThreadShadow::clear_pending_exception()
  0.62     35.72     0.31 16657203     0.00     0.00  os::current_stack_pointer()
  0.56     36.00     0.28  2348866     0.00     0.00  void oop_store<oopDesc*>(oopDesc**, oopDesc*)
  0.56     36.28     0.28   304045     0.00     0.00  SymbolTable::lookup_only(char const*, int, unsigned int&)
  0.48     36.52     0.24 17819959     0.00     0.00  ResultTypeFinder::set(int, BasicType)
  0.48     36.76     0.24  4007672     0.00     0.00  SignatureInfo::do_void()
  0.44     36.98     0.22  5069507     0.00     0.00  SignatureInfo::do_object(int, int)
  0.44     37.20     0.22                             do_aload_0_state0
  0.42     37.41     0.21                             do_iload_state0
  0.42     37.62     0.21                             ffi_call
  0.40     37.82     0.20    44574     0.00     0.00  ClassFileParser::parse_method(constantPoolHandle, bool, AccessFlags*, typeArrayHandle*, typeArrayHandle*, typeArrayHandle*, Thread*)
  0.36     38.00     0.18  1326820     0.00     0.00  HandleMarkCleaner::~HandleMarkCleaner()
  0.32     38.16     0.16  1123793     0.00     0.00  Klass::is_subtype_of(klassOopDesc*) const
  0.30     38.31     0.15  4430397     0.00     0.00  SignatureInfo::do_int()
  0.30     38.46     0.15  2539725     0.00     0.00  SignatureInfo::do_bool()
  0.30     38.61     0.15                             do_aload_0_state1
  0.30     38.76     0.15                             do_goto_state0
  0.30     38.91     0.15                             do_iload_2_state1
  0.28     39.05     0.14    43798     0.00     0.00  klassVtable::update_super_vtable(instanceKlass*, methodHandle, int, bool, Thread*)
  0.26     39.18     0.13     9376     0.00     0.00  constantPoolKlass::oop_follow_contents(oopDesc*)
  0.26     39.31     0.13                             do_if_icmpge_state1
  0.26     39.44     0.13                             ffi_prep_args
  0.24     39.56     0.12   128869     0.00     0.00  instanceKlass::uncached_lookup_method(symbolOopDesc*, symbolOopDesc*) const
  0.24     39.68     0.12                             do_aload_1_state0
  0.24     39.80     0.12                             do_bipush_state1
  0.22     39.91     0.11       16     0.01     0.01  ContiguousSpace::prepare_for_compaction(CompactPoint*)
  0.22     40.02     0.11                             do_if_icmpne_state1


From aph at redhat.com  Wed Feb 25 02:41:19 2009
From: aph at redhat.com (Andrew Haley)
Date: Wed, 25 Feb 2009 10:41:19 +0000
Subject: Profiling
In-Reply-To: <200902251401.n1PE1N44017335@parsley.camswl.com>
References: <200902251401.n1PE1N44017335@parsley.camswl.com>
Message-ID: <49A5204F.3010108@redhat.com>

Edward Nevill wrote:
>> This one I find really interesting.  Is libffi showing up in your
>> profile at all?  Andrew Haley and I have often wondered what
>> performance impace using libffi has, and Andrew has some ideas for
>> speeding up libffi if it turns out to be a bottleneck.  If your
>> profile is excluding libffi, it'd be really interesting to see
>> what it looked like with it in.

> The previous profiles did not include libffi because I was using the
> shared library. I have redone it with a static libffi. Here are the
> top 50 results. libffi accounts for 0.5% on this particular app
> (Think Free Office). I also did CaffeineMark and EEMBC but the
> results on those were negligible (<0.10%).

Good news, thanks.  FYI Ed, you might want to consider oprofile: it profiles
everything that's running and it's easy to use.  (The docs are kinda
horrible and make it look difficult, but really it's not.)

Cheers,
Andrew.


From ed at camswl.com  Thu Feb 26 07:45:24 2009
From: ed at camswl.com (Edward Nevill)
Date: Thu, 26 Feb 2009 15:45:24 GMT
Subject: get_native_u2 & friends
Message-ID: <200902261545.n1QFjOY3022789@parsley.camswl.com>

Hi all,

I think the best thing to do with get_native_u2 and friends is to let the compiler decide
how to access unaligned data. Most modern compilers have some facility for doing this.
In gcc you can use __attribute__((packed)) as follows.

typedef union unaligned {
	unsigned	u;
	unsigned short	us;
	unsigned long long ul;
} __attribute__((packed)) unaligned;

unsigned short get_native_u2(unaligned *p)
{
	return p->us;
}

unsigned get_native_u4(unaligned *p)
{
	return p->u;
}

unsigned long long get_native_u8(unaligned *p)
{
	return p->ul;
}

Below is the code generated for ARM and X86. Note that in the X86 case it just does the access
since X86 allows unaligned accesses whereas for ARM it goes ahead and doesbyte loads.

If on some architechture it is better to test the alignment and then do word/halfword loads if
the pointer is aligned and byte loads otherwise then hopefully the compiler will know the
best code to generate rarther than us trying to second guess what is best on individual
architectures.

Also, in many case these functions are called when it is known that the data is aligned as in
this example from _tableswitch...

      CASE(_tableswitch): {
          jint* lpc  = (jint*)VMalignWordUp(pc+1);
          int32_t  key  = STACK_INT(-1);
          int32_t  low  = Bytes::get_Java_u4((address)&lpc[1]);
          int32_t  high = Bytes::get_Java_u4((address)&lpc[2]);

Maybe it is worth having get_Java_u4() and get_Java_u4_unaligned()?

Regards,
Ed.


--- x86.s --------------------------------------------------------
	.file	"test.c"
	.text
	.p2align 4,,15
.globl get_native_u2
	.type	get_native_u2, @function
get_native_u2:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	popl	%ebp
	movzwl	(%eax), %eax
	ret
	.size	get_native_u2, .-get_native_u2
	.p2align 4,,15
.globl get_native_u4
	.type	get_native_u4, @function
get_native_u4:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	popl	%ebp
	movl	(%eax), %eax
	ret
	.size	get_native_u4, .-get_native_u4
	.p2align 4,,15
.globl get_native_u8
	.type	get_native_u8, @function
get_native_u8:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	popl	%ebp
	movl	4(%eax), %edx
	movl	(%eax), %eax
	ret
	.size	get_native_u8, .-get_native_u8
	.ident	"GCC: (GNU) 4.2.4 (Ubuntu 4.2.4-1ubuntu3)"
	.section	.note.GNU-stack,"", at progbits
--- arm.s -------------------------------------------------------------
	.arch armv5t
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 2
	.eabi_attribute 18, 4
	.file	"test.c"
	.text
	.align	2
	.global	get_native_u2
	.type	get_native_u2, %function
get_native_u2:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldrb	r3, [r0, #1]	@ zero_extendqisi2
	ldrb	r0, [r0, #0]	@ zero_extendqisi2
	orr	r0, r0, r3, asl #8
	bx	lr
	.size	get_native_u2, .-get_native_u2
	.align	2
	.global	get_native_u4
	.type	get_native_u4, %function
get_native_u4:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldrb	r1, [r0, #1]	@ zero_extendqisi2
	ldrb	r3, [r0, #0]	@ zero_extendqisi2
	ldrb	r2, [r0, #2]	@ zero_extendqisi2
	ldrb	r0, [r0, #3]	@ zero_extendqisi2
	orr	r3, r3, r1, asl #8
	orr	r3, r3, r2, asl #16
	orr	r0, r3, r0, asl #24
	bx	lr
	.size	get_native_u4, .-get_native_u4
	.align	2
	.global	get_native_u8
	.type	get_native_u8, %function
get_native_u8:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	stmfd	sp!, {r4, r5, r6}
	ldrb	r5, [r0, #1]	@ zero_extendqisi2
	ldrb	r6, [r0, #5]	@ zero_extendqisi2
	ldrb	r3, [r0, #0]	@ zero_extendqisi2
	ldrb	ip, [r0, #2]	@ zero_extendqisi2
	ldrb	r1, [r0, #4]	@ zero_extendqisi2
	ldrb	r2, [r0, #6]	@ zero_extendqisi2
	ldrb	r4, [r0, #7]	@ zero_extendqisi2
	ldrb	r0, [r0, #3]	@ zero_extendqisi2
	orr	r3, r3, r5, asl #8
	orr	r1, r1, r6, asl #8
	orr	r3, r3, ip, asl #16
	orr	r1, r1, r2, asl #16
	orr	r0, r3, r0, asl #24
	orr	r1, r1, r4, asl #24
	ldmfd	sp!, {r4, r5, r6}
	bx	lr
	.size	get_native_u8, .-get_native_u8
	.ident	"GCC: (Ubuntu 4.3.3-1ubuntu1) 4.3.3"
	.section	.note.GNU-stack,"",%progbits


From aph at redhat.com  Thu Feb 26 10:37:31 2009
From: aph at redhat.com (Andrew Haley)
Date: Thu, 26 Feb 2009 18:37:31 +0000
Subject: Shark: JVMTI support for profiling, etc
Message-ID: <49A6E16B.1060205@redhat.com>

This patch adds support for Shark profiling.  Running SPECjvm98 gives you
a profile like this, showing the time spent in Shark-compiled code:

samples  %        image name               app name                 symbol name
168440    5.0754  no-vmlinux               no-vmlinux               /no-vmlinux
125195    3.7724  9541.jo                  java                     spec.benchmarks._209_db.Database::shell_sort
87596     2.6394  libjvm.so                libjvm.so                SignatureIterator::iterate_returntype()
80076     2.4128  9541.jo                  java                     java.lang.String::compareTo
70082     2.1117  9541.jo                  java                     java.util.Vector::elementAt
65929     1.9866  9541.jo                  java                     spec.benchmarks._201_compress.Compressor::compress
62324     1.8779  libjvm.so                libjvm.so                SharedHeap::is_in_permanent(void const*) const
59044     1.7791  libjvm.so                libjvm.so                CppInterpreter::accessor_entry(methodOopDesc*, long, Thread*)
58508     1.7630  libjvm.so                libjvm.so                Hashtable::oops_do(OopClosure*)
43226     1.3025  9541.jo                  java                     spec.benchmarks._222_mpegaudio.q::l
42005     1.2657  9541.jo                  java                     java.util.Vector::elementData
39031     1.1761  9541.jo                  java                     spec.benchmarks._205_raytrace.OctNode::Intersect
38677     1.1654  9541.jo                  java                     spec.benchmarks._201_compress.Decompressor::decompress
...

I committed this.

Andrew.


2009-02-26  Andrew Haley  <aph at redhat.com>

	* patches/openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp: New file.
	* Makefile.am (ICEDTEA_PATCHES): Add icedtea-jvmtiEnv.patch.
	* ports/hotspot/src/share/vm/shark/sharkFunction.cpp
	(SharkFunction::initialize): Use real name, not "func".
	Pass "none" to "-debug-only=".
	Register generated code for profiling, etc.
	* ports/hotspot/src/share/vm/shark/sharkEntry.hpp (class SharkEntry):
	Remove #ifndef PRODUCT.
	* ports/hotspot/src/share/vm/shark/sharkBuilder.hpp
	(SharkBuilder::CreateFunction): Use real name, not "func".
	* ports/hotspot/src/share/vm/shark/sharkBuilder.cpp
	(SharkBuilder::CreateFunction): Use real name, not "func".
	(MyJITMemoryManager::endFunctionBody): Remove #ifndef PRODUCT.

--- openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp.old	2009-02-26 17:18:35.000000000 +0000
+++ openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp	2009-02-26 17:16:59.000000000 +0000
@@ -2702,6 +2702,9 @@
   (*entry_count_ptr) = num_entries;
   (*table_ptr) = jvmti_table;

+  if (num_entries == 0)
+    return JVMTI_ERROR_ABSENT_INFORMATION;
+
   return JVMTI_ERROR_NONE;
 } /* end GetLineNumberTable */
diff -r 90de0ba94422 Makefile.am
--- a/Makefile.am       Thu Feb 26 17:34:19 2009 +0000
+++ b/Makefile.am       Thu Feb 26 18:32:17 2009 +0000
@@ -541,7 +541,8 @@
        patches/icedtea-sunsrc.patch \
        patches/icedtea-libraries.patch \
        patches/icedtea-javafiles.patch \
-       patches/icedtea-core-build.patch
+       patches/icedtea-core-build.patch \
+       patches/icedtea-jvmtiEnv.patch

 if WITH_ALT_HSBUILD
 ICEDTEA_PATCHES += \
diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkBuilder.cpp
--- a/ports/hotspot/src/share/vm/shark/sharkBuilder.cpp	Thu Feb 26 17:34:19 2009 +0000
+++ b/ports/hotspot/src/share/vm/shark/sharkBuilder.cpp	Thu Feb 26 18:29:12 2009 +0000
@@ -97,12 +97,12 @@
     module()->getOrInsertFunction("llvm.memory.barrier", type));
 }

-Function *SharkBuilder::CreateFunction()
+Function *SharkBuilder::CreateFunction(const char *name)
 {
   Function *function = Function::Create(
       SharkType::entry_point_type(),
       GlobalVariable::InternalLinkage,
-      "func");
+      name);
   module()->getFunctionList().push_back(function);
   return function;
 }
@@ -180,13 +180,12 @@

 void SharkBuilder::MyJITMemoryManager::endFunctionBody
   (const llvm::Function *F, unsigned char *FunctionStart,
-   unsigned char *FunctionEnd)
+   unsigned char *FunctionEnd)
 {
   mm->endFunctionBody(F, FunctionStart, FunctionEnd);
-#ifndef PRODUCT
+
   SharkEntry *e = sharkEntry[F];
   if (e)
     e->setBounds(FunctionStart, FunctionEnd);
-#endif // !PRODUCT
 }

diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkBuilder.hpp
--- a/ports/hotspot/src/share/vm/shark/sharkBuilder.hpp	Thu Feb 26 17:34:19 2009 +0000
+++ b/ports/hotspot/src/share/vm/shark/sharkBuilder.hpp	Thu Feb 26 18:29:12 2009 +0000
@@ -109,7 +109,7 @@

   // Function creation
  public:
-  llvm::Function *CreateFunction();
+  llvm::Function *CreateFunction(const char *name = "func");

   // Helpers for accessing structures and arrays
  public:
diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkEntry.hpp
--- a/ports/hotspot/src/share/vm/shark/sharkEntry.hpp	Thu Feb 26 17:34:19 2009 +0000
+++ b/ports/hotspot/src/share/vm/shark/sharkEntry.hpp	Thu Feb 26 18:29:12 2009 +0000
@@ -46,8 +46,6 @@
  public:
   void print_statistics(const char* name) const PRODUCT_RETURN;

-#ifndef PRODUCT
- private:
   address code_start() const
   {
     return start;
@@ -66,6 +64,4 @@
     start = (address)FunctionStart;
     limit = (address)FunctionEnd;
   }
-
-#endif // !PRODUCT
 };
diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkFunction.cpp
--- a/ports/hotspot/src/share/vm/shark/sharkFunction.cpp	Thu Feb 26 17:34:19 2009 +0000
+++ b/ports/hotspot/src/share/vm/shark/sharkFunction.cpp	Thu Feb 26 18:29:12 2009 +0000
@@ -37,7 +37,7 @@
   masm()->advance(sizeof(SharkEntry));

   // Create the function
-  _function = builder()->CreateFunction();
+  _function = builder()->CreateFunction(name());
   entry->set_llvm_function(function());
 #ifndef PRODUCT
   // FIXME: there should be a mutex when updating sharkEntry in case
@@ -142,7 +142,7 @@
 	// target-specific.
 	Args.push_back("-debug-only=" "x86-emitter");
       else
-	Args.push_back("-debug-only=");
+	Args.push_back("-debug-only=" "none");
       Args.push_back(0);  // Null terminator.
       cl::ParseCommandLineOptions(Args.size()-1, (char**)&Args[0]);
 #endif
@@ -150,6 +150,13 @@

   // Compile to native code
   void *code = builder()->execution_engine()->getPointerToFunction(function());
+
+  // Register generated code for profiling, etc
+  if (JvmtiExport::should_post_dynamic_code_generated()) {
+    JvmtiExport::post_dynamic_code_generated
+      (name(), entry->code_start(), entry->code_limit());
+  }
+
   entry->set_entry_point((ZeroEntry::method_entry_t) code);
   if (SharkTraceInstalls)
     entry->print_statistics(name());