From gbenson at redhat.com Sun Feb 8 06:49:10 2009 From: gbenson at redhat.com (Gary Benson) Date: Sun, 8 Feb 2009 06:49:10 -0800 (PST) Subject: Zero build passes TCK Message-ID: <20090208144910.20AED2C434@rexford.dreamhost.com> For the past three months the OpenJDK team at Red Hat has been working on a project to bring the Zero-assembler port of HotSpot to the point where IcedTea builds using Zero are capable of passing the Java SE 6 TCK. As a result of this work I am pleased to announce that the latest OpenJDK packages included in Fedora 10 for 32- and 64-bit PowerPC have passed the Java SE 6 TCK and are compatible with the Java SE 6 platform. This work was funded by Red Hat. From Edward.Nevill at arm.com Tue Feb 17 08:13:19 2009 From: Edward.Nevill at arm.com (Edward Nevill) Date: Tue, 17 Feb 2009 16:13:19 -0000 Subject: Optimised ARM assembler loop Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com> Hi all, I have now completed converting the bytecodeInterpreterOpt.cpp in my previous email into hand crafted ARM assembler and uploaded to my PPA. This gives approx 100% performance improvement over the original Zero. I have conditionalised both sets of optimisations (the generic C optimisations and the Asm optimisations) so they only build on ARM. The conditionalisation is in ports/hotspot/build/linux/makefiles/zero.make # Not included in includeDB because it has no dependencies ifdef ICEDTEA_ZERO_BUILD # ECN - For the time being HOTSPOT_OPT only enabled for ARM # CFLAGS += -DHOTSPOT_OPT ifeq ($(ZERO_LIBARCH),arm) Obj_Files += bytecodeInterpreter_arm.o CFLAGS += -DHOTSPOT_OPT -DHOTSPOT_ASM endif endif If you want to try out the generic optimisations uncomment the CFLAGS += -DHOTSPOT_OPT line. Note: You will also need to update debian/rules or you wont even get the patches. Unconditionalise the lines ifneq (,$(filter $(DEB_HOST_ARCH),armel i386)) DISTRIBUTION_PATCHES += \ debian/patches/zero-port-opt-new.diff \ debian/patches/zero-hotspot-opt-new.diff endif (Of course, if you are running on an ARM platform you don't need to do this). The main change was the addition of the the asm loop, however there were a few minor changes.... (note these are additional changes to those described in my previous document). - Added a rule to openjdk/hotspot/make/linux/makefiles/rules.make o This allows compilation of .S to .o - In openjdk/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp o Added conditionalisation to turn HOTSPOT_OPT off when build jvmti o Changed run_opt so it no longer takes a strange 'bmb' argument and always returns 0 ? The adition is now always to re-execute the bytecode regardless of why it failed (eg exception etc). bytecodeInterpreterOpt.cpp is responsible for putting the world back in a state where the bytecode can be re-executed. Sources available in... https://launchpad.net/~enevill/+archive/ppa openjdk-6-6b14-0ubuntu14~ed01 Binaries available RSN, Regards, Ed. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/351932ac/attachment.html From Edward.Nevill at arm.com Tue Feb 17 08:40:16 2009 From: Edward.Nevill at arm.com (Edward Nevill) Date: Tue, 17 Feb 2009 16:40:16 -0000 Subject: Improving the performance of OpenJDK Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> This seems to have got lost in the ether, ... resending From: Edward Nevill Sent: 17 February 2009 15:25 To: 'zero-dev at openjdk.java.net' Subject: Improving the performance of OpenJDK Hi All, I have been looking at ways of improving the performance of Zero I understand that performance was not one of the original goals of Zero, however... The performance of Zero is absolutely, incredibly, dire. To give an example, I tried an application ThinkFreeOffice (another office suite). To open a new blank word document by clicking on the 'New Document' icon (having already loaded ThinkFreeOffice), took ... 1 Minute
....
18 Seconds>
... thats about as long as it take the Stig to do a lap. So.. I would like to open the discussion on improving Zero performance. Attached is a HTML document describing some simple changes I have made which provide ~50% improvement. Regards, Ed. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/1e7e5933/attachment.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090217/1e7e5933/attachment.htm From aph at redhat.com Tue Feb 17 09:05:50 2009 From: aph at redhat.com (Andrew Haley) Date: Tue, 17 Feb 2009 17:05:50 +0000 Subject: Optimised ARM assembler loop In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CB@ZIPPY.Emea.Arm.com> Message-ID: <499AEE6E.5060706@redhat.com> Edward Nevill wrote: > I have now completed converting the bytecodeInterpreterOpt.cpp in my previous email into hand crafted ARM assembler and uploaded to my PPA. > This gives approx 100% performance improvement over the original Zero. Excellent. Can I make one tiny, teeny request? The first time I read that I assumed that 100% of the time was saved, and thus it was infinitely fast. :-) OK, I knows that's not what it means. I presume it's now twice as fast as Zero. Andrew. From gbenson at redhat.com Wed Feb 18 02:43:35 2009 From: gbenson at redhat.com (Gary Benson) Date: Wed, 18 Feb 2009 10:43:35 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> Message-ID: <20090218104335.GB3213@redhat.com> Hi Ed, I haven't looked into the code particularly -- it's pretty difficult to locate your stuff in that massive patch -- but here are my initial thoughts. Edward Nevill wrote: > Splitting the loop like this improves the code generated by gcc in a > number of ways. Firstly it improves register allocation because the > compiler is not trying to allocate registers across complex > code. This code is infrequently executed, but the compiler has no > way of knowing, and tends to give the complex code more priority for > register allocations (since it is the deepest, most nested piece of > code, it must be the most frequently executed, right? Wrong!!!). I don't know if this would make a huge difference, but there's a conditional, LOTS_OF_REGS, defined in bytecodeInterpreter_zero.hpp, that specifies register keywords for several variables in the bytecode interpreter's main loop. It might be worth turning it on for ARM and seeing if it has an effect. > The interpreter (as is) has two modes of operation, TaggedStacks and > not Tagged. A TaggedStack is one where in addition to the data on > the stack a tag is stored with each datum to say what type it is > (the main types we are interested in are 'a' and non 'a'). This > means that each stack element is 8 bytes. The TaggedStack (as I > understand it) is only used by certain garbage collectors to > identify what elements on the stack are references and it is not the > default. As I understand it, the tagged stack interpreter was written because some applications had such complex code that locating the objects on the stack was taking a huge amount of time. It was a particular problem with automatically generated code, from JSPs for example. I hear it didn't particularly work well, and is pretty much out of favour now as the initial problem was worked around by some other means. I'm not sure it even works correctly in the C++ interpreter, and Zero certainly doesn't support it. It may be that we can just strip it out... > get_native_u2() and get_Java_u2() ... This seems to be a misguided > attempt of the original authors to optimised reading of halfwords > (judging by the comment immediate preceding the code). It's not an optimization, it's to do unaligned access on hardware that doesn't support it. I'm guessing ARM does allow unaligned access by the fact that your code didn't segfault instantly ;) We should probably optimize this for machines that allow it, given that it has the performance impact you describe. Does anyone know which machines do and do not allow it? AFAIK x86, x86_64 yes; ppc, ppc64 no; others? Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Wed Feb 18 02:57:43 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 18 Feb 2009 10:57:43 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090218104335.GB3213@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> Message-ID: <499BE9A7.3020405@redhat.com> Gary Benson wrote: > Hi Ed, > > I haven't looked into the code particularly -- it's pretty difficult > to locate your stuff in that massive patch -- but here are my initial > thoughts. > > Edward Nevill wrote: >> Splitting the loop like this improves the code generated by gcc in a >> number of ways. Firstly it improves register allocation because the >> compiler is not trying to allocate registers across complex >> code. This code is infrequently executed, but the compiler has no >> way of knowing, and tends to give the complex code more priority for >> register allocations (since it is the deepest, most nested piece of >> code, it must be the most frequently executed, right? Wrong!!!). > > I don't know if this would make a huge difference, but there's a > conditional, LOTS_OF_REGS, defined in bytecodeInterpreter_zero.hpp, > that specifies register keywords for several variables in the > bytecode interpreter's main loop. It might be worth turning it on > for ARM and seeing if it has an effect. I suspect it'd make things worse. ARM has only 16 registers, and some of those are fixed by the ABI. The idea of separating frequently-executed code from stuff that is only used occasionally is a good one. Every compiler, and certainly gc, finds it difficult to do a good job of allocating registers in a large routine. It's especially hard for ARM, which is register-starved. >> get_native_u2() and get_Java_u2() ... This seems to be a misguided >> attempt of the original authors to optimised reading of halfwords >> (judging by the comment immediate preceding the code). > > It's not an optimization, it's to do unaligned access on hardware that > doesn't support it. I'm guessing ARM does allow unaligned access by > the fact that your code didn't segfault instantly ;) ARM doesn't support unaligned loads. The new ARM code as posted is ldrsb r0, [java_pc, #0] ldrb r1, [java_pc, #1] orr r1, r1, r0, lsl #8 i.e two byte loads. Andrew. From gbenson at redhat.com Wed Feb 18 03:08:51 2009 From: gbenson at redhat.com (Gary Benson) Date: Wed, 18 Feb 2009 11:08:51 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <499BE9A7.3020405@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> Message-ID: <20090218110851.GD3213@redhat.com> Andrew Haley wrote: > Gary Benson wrote: > > Edward Nevill wrote: > > > get_native_u2() and get_Java_u2() ... This seems to be a > > > misguided attempt of the original authors to optimised reading > > > of halfwords (judging by the comment immediate preceding the > > > code). > > > > It's not an optimization, it's to do unaligned access on hardware > > that doesn't support it. I'm guessing ARM does allow unaligned > > access by the fact that your code didn't segfault instantly ;) > > ARM doesn't support unaligned loads. The new ARM code as posted is > > ldrsb r0, [java_pc, #0] > ldrb r1, [java_pc, #1] > orr r1, r1, r0, lsl #8 > > i.e two byte loads. Ah, I didn't realise. Which is good, it means this optimization is generic :) Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Wed Feb 18 03:19:15 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 18 Feb 2009 11:19:15 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090218110851.GD3213@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> Message-ID: <499BEEB3.3010904@redhat.com> Gary Benson wrote: > Andrew Haley wrote: >> Gary Benson wrote: >>> Edward Nevill wrote: >>>> get_native_u2() and get_Java_u2() ... This seems to be a >>>> misguided attempt of the original authors to optimised reading >>>> of halfwords (judging by the comment immediate preceding the >>>> code). >>> It's not an optimization, it's to do unaligned access on hardware >>> that doesn't support it. I'm guessing ARM does allow unaligned >>> access by the fact that your code didn't segfault instantly ;) >> ARM doesn't support unaligned loads. The new ARM code as posted is >> >> ldrsb r0, [java_pc, #0] >> ldrb r1, [java_pc, #1] >> orr r1, r1, r0, lsl #8 >> >> i.e two byte loads. > > Ah, I didn't realise. Which is good, it means this optimization is > generic :) Right. The whole idea of the way it's don ATM is bonkers: do a byte- at-a-time unaligned load into machine order, then reverse the bytes. Maybe the hope was that the compiler would see all this cruft and silently convert it into an efficient form, but, er, no. :-( Andrew. From gbenson at redhat.com Wed Feb 18 03:37:46 2009 From: gbenson at redhat.com (Gary Benson) Date: Wed, 18 Feb 2009 11:37:46 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <499BEEB3.3010904@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <499BEEB3.3010904@redhat.com> Message-ID: <20090218113746.GE3213@redhat.com> Andrew Haley wrote: > Right. The whole idea of the way it's don ATM is bonkers: do a > byte- at-a-time unaligned load into machine order, then reverse the > bytes. Maybe the hope was that the compiler would see all this > cruft and silently convert it into an efficient form, but, er, no. > :-( How does it work for longer types? For a 64-bit value, for instance, is it better to always do 8 individual loads, or might it be better to try and optimize like they have done? Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Wed Feb 18 04:37:19 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 18 Feb 2009 12:37:19 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090218113746.GE3213@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <499BEEB3.3010904@redhat.com> <20090218113746.GE3213@redhat.com> Message-ID: <499C00FF.8010203@redhat.com> Gary Benson wrote: > Andrew Haley wrote: >> Right. The whole idea of the way it's don ATM is bonkers: do a >> byte- at-a-time unaligned load into machine order, then reverse the >> bytes. Maybe the hope was that the compiler would see all this >> cruft and silently convert it into an efficient form, but, er, no. >> :-( > > How does it work for longer types? For a 64-bit value, for instance, > is it better to always do 8 individual loads, or might it be better > to try and optimize like they have done? It depends on the frequency of execution. If the alignment is no better than random with a uniform distribution, then half the time you'll be looking at an address that is not aligned for any type larger than a byte. If so, there's no point checking for special cases. It is, however, worth avoiding 64-bit operations on 32-bit platforms, so this is probably the best way to do a 64-bit big-endian load, at least on gcc: unsigned long long foo6 (unsigned char *p) { unsigned long u1; u1 = ((unsigned long)p[0] << 24 | (unsigned long)p[1] << 16 | (unsigned long)p[2] << 8 | p[3]); unsigned long u2; u2 = ((unsigned long)p[4] << 24 | (unsigned long)p[5] << 16 | (unsigned long)p[6] << 8 | p[7]); return ((unsigned long long)u1<<32 | u2); } The code generated here may be significantly better on 32-bit platforms than the equivalent that uses unsigned long long, and not significantly worse on 64-bit platforms. Andrew. From Edward.Nevill at arm.com Wed Feb 18 05:48:09 2009 From: Edward.Nevill at arm.com (Edward Nevill) Date: Wed, 18 Feb 2009 13:48:09 -0000 Subject: backedge checks Message-ID: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com> Hi folks, In bytecodeInterpreter.c is this wonderful macro #define DO_BACKEDGE_CHECKS(skip, branch_pc) \ if ((skip) <= 0) { \ if (UseCompiler && UseLoopCounter) { \ bool do_OSR = UseOnStackReplacement; \ BACKEDGE_COUNT->increment(); \ if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit(); \ if (do_OSR) { \ nmethod* osr_nmethod; \ OSR_REQUEST(osr_nmethod, branch_pc); \ if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() != InvalidOSREntryBci) { \ intptr_t* buf; \ CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception); \ istate->set_msg(do_osr); \ istate->set_osr_buf((address)buf); \ istate->set_osr_entry(osr_nmethod->osr_entry()); \ return; \ } \ } else { \ INCR_INVOCATION_COUNT; \ SAFEPOINT; \ } \ } /* UseCompiler ... */ \ INCR_INVOCATION_COUNT; \ SAFEPOINT; \ } This macro is invoked in every branch (although the body is only executed on backwards branches). Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT twice on every backwards branch. Surely a mistake? Secondly, can someone tell me under what circumstances in zero UseCompiler would ever be true (and no, it doesn't resolve to a constant. Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT does... #define SAFEPOINT \ if ( SafepointSynchronize::is_synchronizing()) { \ { \ /* zap freed handles rather than GC'ing them */ \ HandleMarkCleaner __hmc(THREAD); \ } \ CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \ } However, there is an upcall notice_safepoints() which safepoint.cpp calls whenever it is setting is_synchronizing(). This upcall is there to notify the interpreter that it needs to run to a GC safepoint. A corresponding call ignore_safepoints() is called when the interpreter can safely ignore safepoints. So there should be no need for the interpreter to continually check for safepoints. notice_safepoints() and ignore_safepoints() in the template interpreter do indeed do something sensible. However, in the bytecode Interpreter the implementation is just {} The way it would work is... When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather than a switch statement. Currently this is defined as const static void* const opclabels_data[256] = { /* 0x00 */ &&opc_nop, &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, ... }; register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0]; We would change this to const static void* opclabels_data[256] = { /* 0x00 */ &&opc_nop, &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, ... }; IE. I have just removed the 'const' because we are going to change this dynamically We would then define two sets of handlers for branch instructions struct branch_dispatch { int bytecode; /* The bytecode */ void *handler; /* The handler for it */ }; typedef struct branch_dispatch branch_dispatch; branch_dispatch safe_branch_dispatch_table[] = { { Bytecodes::_ifle, safe_ifle_handler }, { Bytecodes::_ifgt, safe_ifgt_handler }, ... }; branch_dispatch unsafe_branch_dispatch_table[] = { { Bytecodes::_ifle, unsafe_ifle_handler }, { Bytecodes::_ifgt, unsafe_ifgt_handler }, ... } notice_safepoints() and ignore_safepoints() then become void update_table(branch_dispatch *p, branch_dispatch *pend) { do { opclabels_data[p->bytecode] = p->handler; } while (p++ < pend); } notice_safepoints() { update_table(safe_branch_dispatch_table, safe_branch_dispatch_table + sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch)); } ignore_safepoint() { update_table(unsafe_branch_dispatch_table, unsafe_branch_dispatch_table + sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch)); } Finally, can someone enlighten me as to the purpose of INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all backedge checks. Regards, Ed. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090218/3cf42a48/attachment.html From aph at redhat.com Wed Feb 18 06:16:23 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 18 Feb 2009 14:16:23 +0000 Subject: backedge checks In-Reply-To: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com> Message-ID: <499C1837.2070509@redhat.com> Hi Ed, Your mail looked like it went through a rubbish compactor. I reformatted it. ------------------------------------------------------------------------------------------------------- Hi folks, In bytecodeInterpreter.c is this wonderful macro #define DO_BACKEDGE_CHECKS(skip, branch_pc)\ if ((skip) <= 0) {\ if (UseCompiler && UseLoopCounter) {\ bool do_OSR = UseOnStackReplacement;\ BACKEDGE_COUNT->increment();\ if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\ if (do_OSR) {\ nmethod* osr_nmethod;\ OSR_REQUEST(osr_nmethod, branch_pc);\ if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) { \ intptr_t* buf;\ CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception); \ istate->set_msg(do_osr);\ istate->set_osr_buf((address)buf);\ istate->set_osr_entry(osr_nmethod->osr_entry());\ return;\ }\ } else {\ INCR_INVOCATION_COUNT;\ SAFEPOINT;\ }\ } /* UseCompiler ... */\ INCR_INVOCATION_COUNT;\ SAFEPOINT;\ } This macro is invoked in every branch (although the body is only executed on backwards branches). Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT twice on every backwards branch. Surely a mistake? Secondly, can someone tell me under what circumstances in zero UseCompiler would ever be true (and no, it doesn't resolve to a constant. Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT does... #define SAFEPOINT \ if ( SafepointSynchronize::is_synchronizing()) { \ { \ /* zap freed handles rather than GC'ing them */ \ HandleMarkCleaner __hmc(THREAD); \ } \ CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \ } However, there is an upcall notice_safepoints() which safepoint.cpp calls whenever it is setting is_synchronizing(). This upcall is there to notify the interpreter that it needs to run to a GC safepoint. A corresponding call ignore_safepoints() is called when the interpreter can safely ignore safepoints. So there should be no need for the interpreter to continually check for safepoints. notice_safepoints() and ignore_safepoints() in the template interpreter do indeed do something sensible. However, in the bytecode Interpreter the implementation is just {} The way it would work is... When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather than a switch statement. Currently this is defined as const static void* const opclabels_data[256] = { /* 0x00 */ &&opc_nop, &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, ... }; register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0]; We would change this to const static void* opclabels_data[256] = { /* 0x00 */ &&opc_nop, &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, ... }; IE. I have just removed the 'const' because we are going to change this dynamically We would then define two sets of handlers for branch instructions struct branch_dispatch { int bytecode; /* The bytecode */ void *handler; /* The handler for it */ }; typedef struct branch_dispatch branch_dispatch; branch_dispatch safe_branch_dispatch_table[] = { { Bytecodes::_ifle,safe_ifle_handler }, { Bytecodes::_ifgt,safe_ifgt_handler }, ... }; branch_dispatch unsafe_branch_dispatch_table[] = { { Bytecodes::_ifle,unsafe_ifle_handler }, { Bytecodes::_ifgt,unsafe_ifgt_handler }, ... } notice_safepoints() and ignore_safepoints() then become void update_table(branch_dispatch *p, branch_dispatch *pend) { do { opclabels_data[p->bytecode] = p->handler; } while (p++ < pend); } notice_safepoints() { update_table(safe_branch_dispatch_table, safe_branch_dispatch_table + sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch)); } ignore_safepoint() { update_table(unsafe_branch_dispatch_table, unsafe_branch_dispatch_table + sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch)); } Finally, can someone enlighten me as to the purpose of INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all backedge checks. Regards, Ed. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. From aph at redhat.com Wed Feb 18 09:24:50 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 18 Feb 2009 17:24:50 +0000 Subject: backedge checks In-Reply-To: <499C1837.2070509@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com> <499C1837.2070509@redhat.com> Message-ID: <499C4462.4080809@redhat.com> > Hi folks, > > In bytecodeInterpreter.c is this wonderful macro > > #define DO_BACKEDGE_CHECKS(skip, branch_pc)\ > if ((skip) <= 0) {\ > if (UseCompiler && UseLoopCounter) {\ > bool do_OSR = UseOnStackReplacement;\ > BACKEDGE_COUNT->increment();\ > if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\ > if (do_OSR) {\ > nmethod* osr_nmethod;\ > OSR_REQUEST(osr_nmethod, branch_pc);\ > if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) { \ > intptr_t* buf;\ > CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception); \ > istate->set_msg(do_osr);\ > istate->set_osr_buf((address)buf);\ > istate->set_osr_entry(osr_nmethod->osr_entry());\ > return;\ > }\ > } else {\ > INCR_INVOCATION_COUNT;\ > SAFEPOINT;\ > }\ > } /* UseCompiler ... */\ > INCR_INVOCATION_COUNT;\ > SAFEPOINT;\ > } > > > This macro is invoked in every branch (although the body is only > executed on backwards branches). > > Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT > twice on every backwards branch. Surely a mistake? It looks like that to me. > Secondly, can someone tell me under what circumstances in zero > UseCompiler would ever be true (and no, it doesn't resolve to a > constant. This stuff is AFAIK used to decide when to compile a method with Shark. > Thirdly, it should be possible to avoid the SAFEPOINT checks. SAFEPOINT > does... > > #define SAFEPOINT \ > if ( SafepointSynchronize::is_synchronizing()) { \ > { \ > /* zap freed handles rather than GC'ing them */ \ > HandleMarkCleaner __hmc(THREAD); \ > } \ > CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \ > } > > > However, there is an upcall notice_safepoints() which safepoint.cpp > calls whenever it is setting is_synchronizing(). This upcall is there to > notify the interpreter that it needs to run to a GC safepoint. A > corresponding call ignore_safepoints() is called when the interpreter > can safely ignore safepoints. So there should be no need for the > interpreter to continually check for safepoints. > > notice_safepoints() and ignore_safepoints() in the template interpreter > do indeed do something sensible. However, in the bytecode Interpreter > the implementation is just {} > > The way it would work is... > > When compiling gcc bytecodeInterpreter.cpp uses a dispatch table rather > than a switch statement. Currently this is defined as > > const static void* const opclabels_data[256] = { > /* 0x00 */ &&opc_nop, > &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, > ... > }; > register uintptr_t *dispatch_table = (uintptr_t*)&opclabels_data[0]; > > We would change this to > > const static void* opclabels_data[256] = { > /* 0x00 */ &&opc_nop, > &&opc_aconst_null,&&opc_iconst_m1,&&opc_iconst_0, > ... > }; > > IE. I have just removed the 'const' because we are going to change this > dynamically > We would then define two sets of handlers for branch instructions > > struct branch_dispatch { > int bytecode; /* The bytecode */ > void *handler; /* The handler for it */ > }; > typedef struct branch_dispatch branch_dispatch; > > branch_dispatch safe_branch_dispatch_table[] = { > { Bytecodes::_ifle,safe_ifle_handler }, > { Bytecodes::_ifgt,safe_ifgt_handler }, > ... > }; > > branch_dispatch unsafe_branch_dispatch_table[] = { > { Bytecodes::_ifle,unsafe_ifle_handler }, > { Bytecodes::_ifgt,unsafe_ifgt_handler }, > ... > } > > notice_safepoints() and ignore_safepoints() then become > > void update_table(branch_dispatch *p, branch_dispatch *pend) > { > do { opclabels_data[p->bytecode] = p->handler; } while (p++ < pend); > } > > notice_safepoints() > { > update_table(safe_branch_dispatch_table, > safe_branch_dispatch_table + > sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch)); > } > > ignore_safepoint() > { > update_table(unsafe_branch_dispatch_table, > unsafe_branch_dispatch_table + > sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch)); > } Hmm, this doesn't look like it's thread safe to me. It would make more sense to have a pointer to the branch dispatch table that's updated atomically. > Finally, can someone enlighten me as to the purpose of > INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of all > backedge checks. Surely the compile broker uses that count. Andrew. From gbenson at redhat.com Thu Feb 19 06:38:01 2009 From: gbenson at redhat.com (Gary Benson) Date: Thu, 19 Feb 2009 14:38:01 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090218110851.GD3213@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> Message-ID: <20090219143801.GB31986@redhat.com> Gary Benson wrote: > Andrew Haley wrote: > > Gary Benson wrote: > > > Edward Nevill wrote: > > > > get_native_u2() and get_Java_u2() ... This seems to be a > > > > misguided attempt of the original authors to optimised reading > > > > of halfwords (judging by the comment immediate preceding the > > > > code). > > > > > > It's not an optimization, it's to do unaligned access on > > > hardware that doesn't support it. I'm guessing ARM does allow > > > unaligned access by the fact that your code didn't segfault > > > instantly ;) > > > > ARM doesn't support unaligned loads. The new ARM code as posted > > is > > > > ldrsb r0, [java_pc, #0] > > ldrb r1, [java_pc, #1] > > orr r1, r1, r0, lsl #8 > > > > i.e two byte loads. > > Ah, I didn't realise. Which is good, it means this optimization is > generic :) So I did a quick SPECjvm98 run of this change on amd64. A couple of the times are improved, by 3-10%, but one of the times is *slower* by 11%. See attached. I'm not sure what to make of that... Cheers, Gary -- http://gbenson.net/ -------------- next part -------------- ,times/zero-92c4cc753f06,times/zero-92c4cc753f06-u2 compress,231.02,255.489 jess,70.363,64.099 db,116.143,112.242 javac,72.624,69.744 mpegaudio,229.505,226.654 mtrt,59.302,59.274 jack,49.195,44.505 -------------- next part -------------- A non-text attachment was scrubbed... Name: result.png Type: image/png Size: 2155 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/zero-dev/attachments/20090219/15362c33/attachment.png From gbenson at redhat.com Thu Feb 19 07:02:56 2009 From: gbenson at redhat.com (Gary Benson) Date: Thu, 19 Feb 2009 15:02:56 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090219143801.GB31986@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <20090219143801.GB31986@redhat.com> Message-ID: <20090219150256.GC31986@redhat.com> Gary Benson wrote: ... > > > > Edward Nevill wrote: > > > > > get_native_u2() and get_Java_u2() ... This seems to be a > > > > > misguided attempt of the original authors to optimised > > > > > reading of halfwords (judging by the comment immediate > > > > > preceding the code). ... > So I did a quick SPECjvm98 run of this change on amd64. A couple of > the times are improved, by 3-10%, but one of the times is *slower* > by 11%. See attached. I'm not sure what to make of that... I'm wondering if rewriting get_Java_u2() to read directly rather than read and swap is speeding it up, but removing the optimization from get_native_u2() is slowing it down. I'm going to try this with the original get_native_u2() and with get_Java_u2() just a copy of the big-endian get_native_u2(). Cheers, Gary -- http://gbenson.net/ From ed at camswl.com Thu Feb 19 11:27:06 2009 From: ed at camswl.com (Edward Nevill) Date: Thu, 19 Feb 2009 19:27:06 GMT Subject: backedge checks Message-ID: <200902191927.n1JJR6H9003614@parsley.camswl.com> Hi, Apologies for the legal notices that appeared in my previous emails. I am now doing this from my home machine over a wet piece of string. Email address is now either ed at camswl.com or ed at starksfield.org (both resolve to the same machine). >> INCR_INVOCATION_COUNT;\ >> SAFEPOINT;\ >> }\ >> } /* UseCompiler ... */\ >> INCR_INVOCATION_COUNT;\ >> SAFEPOINT;\ >> } >> >> >> This macro is invoked in every branch (although the body is only >> executed on backwards branches). >> >> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT >> twice on every backwards branch. Surely a mistake? > >It looks like that to me. OK. Can someone with committal rights remove the first instance. >> Secondly, can someone tell me under what circumstances in zero >> UseCompiler would ever be true (and no, it doesn't resolve to a >> constant. > >This stuff is AFAIK used to decide when to compile a method with Shark. OK. So Shark is using the bytecode interpreter rather than the template interpreter? Is this correct Gary? In that case it is even more important we do something about the dire performance. Regardless, when we are building zero (interpreted) this should be defined to be constant false. At the moment it is a global. Aside: In general access to globals is very expensive (more expensive than you might believe in a PIC world). If we need to do checks on globals can we at least do.. { type *UseCompiler_ptr = &UseCompiler; ... if (*UseCompiler_ptr) { ... } } This is partly doing the compilers work for it, but sometimes it needs help. >> notice_safepoints() >> { >> update_table(safe_branch_dispatch_table, >> safe_branch_dispatch_table + >> >sizeof(safe_branch_dispatch_table)/sizeof(branch_dispatch)); >> } >> >> ignore_safepoint() >> { >> update_table(unsafe_branch_dispatch_table, >> unsafe_branch_dispatch_table + >> >sizeof(unsafe_branch_dispatch_table)/sizeof(branch_dispatch)); >> } > >Hmm, this doesn't look like it's thread safe to me. It would make more >sense to have a pointer to the branch dispatch table that's updated >atomically. There is a problem with updating the pointer to the table rather than the contents of the table. Hopefully, the pointer to the table is residing in an ARM register (or MIPS or ...). register uintptr_t *dispatch_table = ... In order to update the pointer we would first of all need to remove the register specifier, then we need to may it a static global, otherwise we cant access it in 'notice_safepoints'. For good measure we need to declare it volatile to make sure the compiler cant do anything clever. Yeuch (remember this pointer is accessed for every single bytecode). I believe the above is thread safe. In 'notice_safepoints' the VM state is 'synchronizing'. On return from 'notice_safepoints' the VM state is changed to 'synchronized' (provided this was the last thread to synchronise). Until the VM state is 'synchronized', the VM cannot make any assumptions as to whether the interpreter is holding pointer to objects in registers or local storage. Therefore unsafe operations such as GC are prohibited until the VM state is 'synchronized'. During execution of 'notice_safepoints' half the handlers may point to safe handlers, half may point to unsafe handlers. However, this will not cause any problems. The safe and unsafe handlers are compatible in operation, its just the unsafe handlers dont call 'SafepointSynchronize::block(THREAD)'. Now assume there is a pre-emptive thread swap while in the middle of 'notice_safepoints'. The new thread then tries an unsafe operation (eg a GC request) which results in 'notice_safepoints' being called again. The result is that the table may end up getting updated twice. In the really paranoid case every single thread may be sitting in 'notice_safepoints'. notice_safepoints() { /* <- thread swap occurs here */ } This is no different from the case where 'notice_safepoints' actually updates the table. The VM is still not synchronised, and we have to either wait until control returns or until someone else calls notice_safepoints. Remember the state is not set to synchronised until all threads have synchronised. On exit, before calling ignore_safepoints the VM is places in an unsynchronised state. Therefore it doesn't matter whether we call the safe or unsafe branch handler. The safe branch handler simply does an extra check on 'is_synchronizing' which returns false. The only thing that could really ruin your day is if word write is not atomic, and I think there is a lot of code that will break is word write is not atomic. > >> Finally, can someone enlighten me as to the purpose of >> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of >all >> backedge checks. > >Surely the compile broker uses that count. Yes, is the compiler the only thing that is using this? Is it used for profiling, or ... We need to move to a world where we are not doing all these redundant checks with all the attendant crud when the check trips to a world where we have a simple VM which just handles the common default case. If someone does require all this crud then we simply direct them to BytecodeInterpreter::run_with_all_unnecessary_crud(..) Regards, Ed From gbenson at redhat.com Thu Feb 19 07:49:17 2009 From: gbenson at redhat.com (Gary Benson) Date: Thu, 19 Feb 2009 15:49:17 +0000 Subject: backedge checks In-Reply-To: <499C1837.2070509@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251E6@ZIPPY.Emea.Arm.com> <499C1837.2070509@redhat.com> Message-ID: <20090219154917.GD31986@redhat.com> Edward Nevill wrote: > In bytecodeInterpreter.c is this wonderful macro > > #define DO_BACKEDGE_CHECKS(skip, branch_pc)\ > if ((skip) <= 0) {\ > if (UseCompiler && UseLoopCounter) {\ > bool do_OSR = UseOnStackReplacement;\ > BACKEDGE_COUNT->increment();\ > if (do_OSR) do_OSR = BACKEDGE_COUNT->reached_InvocationLimit();\ > if (do_OSR) {\ > nmethod* osr_nmethod;\ > OSR_REQUEST(osr_nmethod, branch_pc);\ > if (osr_nmethod != NULL && osr_nmethod->osr_entry_bci() !=InvalidOSREntryBci) { \ > intptr_t* buf;\ > CALL_VM(buf=SharedRuntime::OSR_migration_begin(THREAD), handle_exception); \ > istate->set_msg(do_osr);\ > istate->set_osr_buf((address)buf);\ > istate->set_osr_entry(osr_nmethod->osr_entry());\ > return;\ > }\ > } else {\ > INCR_INVOCATION_COUNT;\ > SAFEPOINT;\ > }\ > } /* UseCompiler ... */\ > INCR_INVOCATION_COUNT;\ > SAFEPOINT;\ > } > > > This macro is invoked in every branch (although the body is only > executed on backwards branches). > > Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; > SAFEPOINT twice on every backwards branch. Surely a mistake? It looks that way. I'll take it out... > Secondly, can someone tell me under what circumstances in zero > UseCompiler would ever be true (and no, it doesn't resolve to a > constant. It's true whenever you're using Shark. Shark doesn't do OSR at the moment though, so most of that macro is unentered. > Thirdly, it should be possible to avoid the SAFEPOINT > checks. SAFEPOINT does... > > #define SAFEPOINT \ > if ( SafepointSynchronize::is_synchronizing()) { \ > { \ > /* zap freed handles rather than GC'ing them */ \ > HandleMarkCleaner __hmc(THREAD); \ > } \ > CALL_VM(SafepointSynchronize::block(THREAD), handle_exception); \ > } > > However, there is an upcall notice_safepoints() which safepoint.cpp > calls whenever it is setting is_synchronizing(). This upcall is > there to notify the interpreter that it needs to run to a GC > safepoint. A corresponding call ignore_safepoints() is called when > the interpreter can safely ignore safepoints. So there should be no > need for the interpreter to continually check for safepoints. > > notice_safepoints() and ignore_safepoints() in the template > interpreter do indeed do something sensible. However, in the > bytecode Interpreter the implementation is just {} Sadly the one person who could answer that question categorically is no longer with us, so it'll be trial and error. If you make a patch I can run it past the TCK for you, if you like... > Finally, can someone enlighten me as to the purpose of > INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of > all backedge checks. It's how methods are selected for compilation, when you're running with a JIT. Methods are considered for compilation when their invocation count passes a certain threshold. Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Thu Feb 19 08:14:55 2009 From: aph at redhat.com (Andrew Haley) Date: Thu, 19 Feb 2009 16:14:55 +0000 Subject: backedge checks In-Reply-To: <200902191927.n1JJR6H9003614@parsley.camswl.com> References: <200902191927.n1JJR6H9003614@parsley.camswl.com> Message-ID: <499D857F.5090905@redhat.com> Edward Nevill wrote: > Apologies for the legal notices that appeared in my previous emails. I am now doing this > from my home machine over a wet piece of string. Email address is now either ed at camswl.com > or ed at starksfield.org (both resolve to the same machine). OK. >>> INCR_INVOCATION_COUNT;\ >>> SAFEPOINT;\ >>> }\ >>> } /* UseCompiler ... */\ >>> INCR_INVOCATION_COUNT;\ >>> SAFEPOINT;\ >>> } >>> >>> >>> This macro is invoked in every branch (although the body is only >>> executed on backwards branches). >>> >>> Firstly, is it just me, or does it do INCR_INVOCATION_COUNT; SAFEPOINT >>> twice on every backwards branch. Surely a mistake? >> It looks like that to me. > > OK. Can someone with committal rights remove the first instance. >>> Secondly, can someone tell me under what circumstances in zero >>> UseCompiler would ever be true (and no, it doesn't resolve to a >>> constant. >> This stuff is AFAIK used to decide when to compile a method with Shark. > > OK. So Shark is using the bytecode interpreter rather than the template interpreter? > Is this correct Gary? > In that case it is even more important we do something about the dire performance. > > Regardless, when we are building zero (interpreted) this should be defined to be constant > false. At the moment it is a global. In purely interpreted mode, yes. > This is no different from the case where 'notice_safepoints' actually updates the table. > The VM is still not synchronised, and we have to either wait until control returns or > until someone else calls notice_safepoints. Remember the state is not set to synchronised > until all threads have synchronised. > > On exit, before calling ignore_safepoints the VM is places in an unsynchronised state. > Therefore it doesn't matter whether we call the safe or unsafe branch handler. The > safe branch handler simply does an extra check on 'is_synchronizing' which returns false. > > The only thing that could really ruin your day is if word write is not atomic, and I think > there is a lot of code that will break is word write is not atomic. Yes, right. This sounds like it'll work. I admit that this sounds wrong, when it could be done simply by switching a single pointer, but a word read from memory to get the address of the table on every bytecode isn't such a great idea either. >>> Finally, can someone enlighten me as to the purpose of >>> INCR_INVOCATION_COUNT, if we can get rid of that we could get rid of >> all >>> backedge checks. >> Surely the compile broker uses that count. > > Yes, is the compiler the only thing that is using this? Is it used for profiling, or ... > > We need to move to a world where we are not doing all these redundant checks with all the > attendant crud when the check trips to a world where we have a simple VM which just > handles the common default case. > > If someone does require all this crud then we simply direct them to > > BytecodeInterpreter::run_with_all_unnecessary_crud(..) Yes, but it would surely be nice to have a decent fast interpreter when we're running with Shark. There ought to be some reasonable compromise. Andrew. From gbenson at redhat.com Thu Feb 19 08:34:47 2009 From: gbenson at redhat.com (Gary Benson) Date: Thu, 19 Feb 2009 16:34:47 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090219150256.GC31986@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <20090219143801.GB31986@redhat.com> <20090219150256.GC31986@redhat.com> Message-ID: <20090219163447.GA30223@redhat.com> Gary Benson wrote: > Gary Benson wrote: > ... > > > > > Edward Nevill wrote: > > > > > > get_native_u2() and get_Java_u2() ... This seems to be a > > > > > > misguided attempt of the original authors to optimised > > > > > > reading of halfwords (judging by the comment immediate > > > > > > preceding the code). > ... > > So I did a quick SPECjvm98 run of this change on amd64. A couple > > of the times are improved, by 3-10%, but one of the times is > > *slower* by 11%. See attached. I'm not sure what to make of > > that... > > I'm wondering if rewriting get_Java_u2() to read directly rather > than read and swap is speeding it up, but removing the optimization > from get_native_u2() is slowing it down. I'm going to try this with > the original get_native_u2() and with get_Java_u2() just a copy of > the big-endian get_native_u2(). So this way around it's a little more encouraging; two of the times are 8-10% faster, one is 5% faster. Some of the times are still slower though, though not by as much, maybe 1-2%. It's still disturbingly ambiguous though... thoughts? Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Thu Feb 19 08:40:10 2009 From: aph at redhat.com (Andrew Haley) Date: Thu, 19 Feb 2009 16:40:10 +0000 Subject: backedge checks In-Reply-To: <200902191927.n1JJR6H9003614@parsley.camswl.com> References: <200902191927.n1JJR6H9003614@parsley.camswl.com> Message-ID: <499D8B6A.80809@redhat.com> Edward Nevill wrote: >> Hmm, this doesn't look like it's thread safe to me. It would make more >> sense to have a pointer to the branch dispatch table that's updated >> atomically. > > There is a problem with updating the pointer to the table rather than the contents > of the table. > > Hopefully, the pointer to the table is residing in an ARM register (or MIPS or ...). > > register uintptr_t *dispatch_table = ... > > In order to update the pointer we would first of all need to remove the register specifier, > then we need to may it a static global, otherwise we cant access it in 'notice_safepoints'. > For good measure we need to declare it volatile to make sure the compiler cant do anything > clever. Yeuch (remember this pointer is accessed for every single bytecode). > > I believe the above is thread safe. > > In 'notice_safepoints' the VM state is 'synchronizing'. On return from 'notice_safepoints' > the VM state is changed to 'synchronized' (provided this was the last thread to synchronise). > > Until the VM state is 'synchronized', the VM cannot make any assumptions as to whether > the interpreter is holding pointer to objects in registers or local storage. > > Therefore unsafe operations such as GC are prohibited until the VM state is 'synchronized'. > > During execution of 'notice_safepoints' half the handlers may point to safe handlers, half > may point to unsafe handlers. However, this will not cause any problems. The safe and > unsafe handlers are compatible in operation, its just the unsafe handlers dont call > 'SafepointSynchronize::block(THREAD)'. > > Now assume there is a pre-emptive thread swap while in the middle of 'notice_safepoints'. > The new thread then tries an unsafe operation (eg a GC request) which results in > 'notice_safepoints' being called again. > > The result is that the table may end up getting updated twice. In the really paranoid case > every single thread may be sitting in 'notice_safepoints'. > > notice_safepoints() { > /* <- thread swap occurs here */ > } > > This is no different from the case where 'notice_safepoints' actually updates the table. > The VM is still not synchronised, and we have to either wait until control returns or > until someone else calls notice_safepoints. Remember the state is not set to synchronised > until all threads have synchronised. > > On exit, before calling ignore_safepoints the VM is places in an unsynchronised state. > Therefore it doesn't matter whether we call the safe or unsafe branch handler. The > safe branch handler simply does an extra check on 'is_synchronizing' which returns false. > > The only thing that could really ruin your day is if word write is not atomic, and I think > there is a lot of code that will break is word write is not atomic. I knew this wasn't right, and I just realized why. Every processor has its own data cache, and these caches may not be coherent. You could potentially wait for ever before a processor would read the changed dispatch table. I think this could probably be solved by sending a signal to every thread that is supposed to move to a safepoint. If there's any additional processing needed the signal handler can do it. Also, the thread updating the table would have to flush its own caches/write buffers/etc. Finally, unless the table itself is marked as volatile there's no guarantee that it would be written before the flush. Andrew. From aph at redhat.com Thu Feb 19 08:47:55 2009 From: aph at redhat.com (Andrew Haley) Date: Thu, 19 Feb 2009 16:47:55 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090219163447.GA30223@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <20090219143801.GB31986@redhat.com> <20090219150256.GC31986@redhat.com> <20090219163447.GA30223@redhat.com> Message-ID: <499D8D3B.3010302@redhat.com> Gary Benson wrote: > Gary Benson wrote: >> Gary Benson wrote: >> ... >>>>>> Edward Nevill wrote: >>>>>>> get_native_u2() and get_Java_u2() ... This seems to be a >>>>>>> misguided attempt of the original authors to optimised >>>>>>> reading of halfwords (judging by the comment immediate >>>>>>> preceding the code). >> ... >>> So I did a quick SPECjvm98 run of this change on amd64. A couple >>> of the times are improved, by 3-10%, but one of the times is >>> *slower* by 11%. See attached. I'm not sure what to make of >>> that... >> I'm wondering if rewriting get_Java_u2() to read directly rather >> than read and swap is speeding it up, but removing the optimization >> from get_native_u2() is slowing it down. I'm going to try this with >> the original get_native_u2() and with get_Java_u2() just a copy of >> the big-endian get_native_u2(). > > So this way around it's a little more encouraging; two of the times > are 8-10% faster, one is 5% faster. Some of the times are still > slower though, though not by as much, maybe 1-2%. It's still > disturbingly ambiguous though... thoughts? Big test runs like SPECjvm98 are too coarse for this kind of thing. It's petty unbelievable that with the original get_native_u2() and the new, definitely faster, get_Java_u2() anything can actually get slower. I'd bet you're looking at noise. 1-2% is vey hard to measure reliably on a multi-user system. Andrew. From gbenson at redhat.com Thu Feb 19 09:43:46 2009 From: gbenson at redhat.com (Gary Benson) Date: Thu, 19 Feb 2009 17:43:46 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <499D8D3B.3010302@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <20090219143801.GB31986@redhat.com> <20090219150256.GC31986@redhat.com> <20090219163447.GA30223@redhat.com> <499D8D3B.3010302@redhat.com> Message-ID: <20090219174346.GC30223@redhat.com> Andrew Haley wrote: > Gary Benson wrote: > > So this way around it's a little more encouraging; two of the > > times are 8-10% faster, one is 5% faster. Some of the times are > > still slower though, though not by as much, maybe 1-2%. It's > > still disturbingly ambiguous though... thoughts? > > Big test runs like SPECjvm98 are too coarse for this kind of thing. Is there something else I could try instead? I was thinking about trying it with DaCapo tomorrow, that's a little more real-world IMO. > It's petty unbelievable that with the original get_native_u2() and > the new, definitely faster, get_Java_u2() anything can actually get > slower. I'd bet you're looking at noise. 1-2% is vey hard to > measure reliably on a multi-user system. True. I'm just wary of declaring it "definitely faster" if, when measured, it isn't obviously so. It may be that the optimization of doing aligned loads 50% of the time and then swapping is actually faster than reading the bytes individually... Cheers, Gary -- http://gbenson.net/ From aph at redhat.com Thu Feb 19 12:20:14 2009 From: aph at redhat.com (Andrew Haley) Date: Thu, 19 Feb 2009 20:20:14 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <20090219174346.GC30223@redhat.com> References: <757128096FE8CD4B9EC01C3B29CFB8431251CD@ZIPPY.Emea.Arm.com> <20090218104335.GB3213@redhat.com> <499BE9A7.3020405@redhat.com> <20090218110851.GD3213@redhat.com> <20090219143801.GB31986@redhat.com> <20090219150256.GC31986@redhat.com> <20090219163447.GA30223@redhat.com> <499D8D3B.3010302@redhat.com> <20090219174346.GC30223@redhat.com> Message-ID: <499DBEFE.8090803@redhat.com> Gary Benson wrote: > Andrew Haley wrote: >> Gary Benson wrote: >>> So this way around it's a little more encouraging; two of the >>> times are 8-10% faster, one is 5% faster. Some of the times are >>> still slower though, though not by as much, maybe 1-2%. It's >>> still disturbingly ambiguous though... thoughts? >> Big test runs like SPECjvm98 are too coarse for this kind of thing. > > Is there something else I could try instead? I was thinking about > trying it with DaCapo tomorrow, that's a little more real-world IMO. I would have thought DaCapo even worse for looking at this micro- optimization. Embedded CaffieneMark was great for this, but isn't used much these days because it's so small people can cheat. :-) >> It's petty unbelievable that with the original get_native_u2() and >> the new, definitely faster, get_Java_u2() anything can actually get >> slower. I'd bet you're looking at noise. 1-2% is vey hard to >> measure reliably on a multi-user system. > > True. I'm just wary of declaring it "definitely faster" if, when > measured, it isn't obviously so. It may be that the optimization of > doing aligned loads 50% of the time and then swapping is actually > faster than reading the bytes individually... Hard to say. The real way to find out is to write a test that *only* does this thing. Andrew. From ed at camswl.com Thu Feb 19 16:51:05 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 00:51:05 GMT Subject: Profiling Message-ID: <200902200051.n1K0p5YK004736@parsley.camswl.com> Hi folks, I have been doing some profing on the vm using gprof with some interesting results. To do this I set LINK_INTO=AOUT to generate a static 'gamma' launcher with the appropriate -pg flags set. However, I have not been able to do this for the other libraries (eg libjava.so) only for libjvm.so so we do not get any profiling information on the native libraries which is what I really wanted. The problem is that unlike libjvm.so libjava.so and friends are not linked in, instead they are opened and referenced using dll_open and friends. So I cannot just statically link in libjava.so. Does anyone know of a way to get gprof information out of libjava.so. Anyway, on to the results I got. The benchmarks I used were CaffeineMark, EEMBC and ThinkFreeOffic (a pure Java office suite). I ran each benchmark with the original build, the split interpreter loop (in C), and the split interpreter loop with the split recoded in ARM asm. The complete results are available here http://camswl.com/openjdk/profiles/flat_ecm_original http://camswl.com/openjdk/profiles/flat_ecm_opt http://camswl.com/openjdk/profiles/flat_ecm_asm http://camswl.com/openjdk/profiles/flat_eembc_original http://camswl.com/openjdk/profiles/flat_eembc_opt http://camswl.com/openjdk/profiles/flat_eembc_asm http://camswl.com/openjdk/profiles/flat_office_original http://camswl.com/openjdk/profiles/flat_office_opt http://camswl.com/openjdk/profiles/flat_office_asm Here is a summary, with some discussion. The main point I think is that we are wasting our time optimising the VM for 'semi real world' applications like Think Free Office. That doesn't mean it is not worthwhile impriving the VM, but there are classes of application for which it will make little/no difference. --- CaffeineMark --- Original % cumulative self self total time seconds seconds calls s/call s/call name 84.12 8.53 8.53 2033574 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 2.27 8.76 0.23 486 0.00 0.00 CppInterpreter::main_loop(int, Thread*) ----------------------------- As expected almost all the time spent in 'run' --- CaffeineMark --- Split interpreter loop (run_opt in C) % cumulative self self total time seconds seconds calls s/call s/call name 76.80 7.47 7.47 2093698 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 4.63 7.92 0.45 1872003 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 4.01 8.31 0.39 489 0.00 0.00 CppInterpreter::main_loop(int, Thread*) ----------------------------- OK, this looks good. 76% of the time is now spent in the simple interpreter loop, ready for optimisation. Note: you will notice that the time in seconds remains approx the same in CaffeineMark because CaffeineMark runs for a fixed time rather that a fixed number of iterations. --- CaffeineMark --- Asm interpreter loop (run_opt in ARM asm) % cumulative self self total time seconds seconds calls ms/call ms/call name 77.15 7.70 7.70 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 5.71 8.27 0.57 2136095 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 2.51 8.52 0.25 490 0.51 1.65 CppInterpreter::main_loop(int, Thread*) ------------------------------ Beginning to flatten somewhat, but still mainly in 'run_opt'. Still plenty of scope for optimisation. Note: The Caffeinemark score for asm was approx 1.9X original. Now, lookup at EEMBC --- EEMBC --- Original % cumulative self self total time seconds seconds calls s/call s/call name 59.07 73.29 73.29 100435660 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 11.29 87.29 14.01 1850 0.01 0.02 CppInterpreter::main_loop(int, Thread*) 4.55 92.94 5.65 47872597 0.00 0.00 methodOopDesc::result_type() const 3.83 97.69 4.75 47882070 0.00 0.00 SignatureIterator::iterate_returntype() 3.32 101.82 4.13 47096159 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 2.80 105.30 3.48 94194346 0.00 0.00 os::vm_page_size() 2.36 108.23 2.93 46326866 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) --------------------------- 60% in 'run', note the relatively large % in 'main_loop'. This is because each invoke and return actually returns (in C) to 'main_loop', the invoke or return is then handled in 'main_loop' which then calls 'run' again. Hence the huge number of calls to 'run'. --- EEMBC --- Split interpreter loop (run_opt in C) % cumulative self self total time seconds seconds calls s/call s/call name 31.33 39.03 39.03 113138096 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 22.84 67.48 28.45 100435423 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 12.78 83.39 15.92 1842 0.01 0.03 CppInterpreter::main_loop(int, Thread*) 5.31 90.00 6.61 47872423 0.00 0.00 methodOopDesc::result_type() const 4.43 95.52 5.52 47096055 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 3.77 100.22 4.70 47881896 0.00 0.00 SignatureIterator::iterate_returntype() 2.99 103.95 3.73 47897743 0.00 0.00 SignatureIterator::parse_type() 2.50 107.07 3.12 94194122 0.00 0.00 os::vm_page_size() 2.20 109.81 2.75 46326782 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 1.88 112.16 2.35 74343657 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const ---------------------------- Flattening out with 54% spent in 'run' and 'run_opt'. Note the oddity 'vm_page_size()'. What is making almost 1E8 calls to vm_page_size()??? I don't think it is going to change since the last time they called it. --- EEMBC --- Asm interpreter loop (run_opt in ARM asm) % cumulative self self total time seconds seconds calls s/call s/call name 25.74 28.29 28.29 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 25.02 55.77 27.49 100435695 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 13.82 70.96 15.19 1842 0.01 0.02 CppInterpreter::main_loop(int, Thread*) 5.48 76.98 6.02 47872566 0.00 0.00 methodOopDesc::result_type() const 4.51 81.93 4.96 47096184 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 4.20 86.54 4.61 47882020 0.00 0.00 SignatureIterator::iterate_returntype() 2.99 89.83 3.29 46326896 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 2.89 93.01 3.18 94194380 0.00 0.00 os::vm_page_size() 2.17 95.39 2.38 47897875 0.00 0.00 SignatureIterator::parse_type() 1.80 97.37 1.98 74343844 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 1.73 99.27 1.90 47884816 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) ---------------------------------- OK. So now just 50% in 'run' and 'run_opt' with only 25% in 'run_opt'. This severly limits further optimisation by improving 'run_opt'. Even if we made run_opt go 10 X faster it would only improve performance by about 20%. --- ThinkFreeOffice --- Original % cumulative self self total time seconds seconds calls s/call s/call name 42.12 21.35 21.35 33855747 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 9.70 26.27 4.92 34204 0.00 0.00 CppInterpreter::main_loop(int, Thread*) 5.17 28.89 2.62 17783120 0.00 0.00 methodOopDesc::result_type() const 4.54 31.19 2.30 16484931 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 4.48 33.46 2.27 17847099 0.00 0.00 SignatureIterator::iterate_returntype() 2.49 34.72 1.26 15239483 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 2.41 35.94 1.22 18044932 0.00 0.00 SignatureIterator::parse_type() 2.05 36.98 1.04 33005883 0.00 0.00 os::vm_page_size() 1.47 37.72 0.75 1245317 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) 1.44 38.45 0.73 17891344 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) 1.09 39.00 0.55 1773458 0.00 0.00 CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*) -------------------------------- Note how relatively flat it is to start with. Only 42% in 'run'. And this is not the whole truth. Think Office is spending a lot of time in native methods (look at the number of calls to 'native_entry') but these are not shown in the profile because I cant link against libjava.so. So the 42% of the time is 42% of the time spent in the JVM itself, not in the whole of openjdk. Look at native_entry. 2.4% of the time spent in native_entry itself but 37 seconds attributable to native_entry & all its descendants. --- ThinkFreeOffice --- Split interpreter loop (run_opt in C) % cumulative self self total time seconds seconds calls s/call s/call name 23.24 12.92 12.92 34292092 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 18.08 22.96 10.05 33659712 0.00 0.00 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 10.66 28.89 5.93 34512 0.00 0.00 CppInterpreter::main_loop(int, Thread*) 5.90 32.17 3.28 17997457 0.00 0.00 methodOopDesc::result_type() const 4.14 34.47 2.30 16685449 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 4.07 36.73 2.26 18064129 0.00 0.00 SignatureIterator::iterate_returntype() 3.17 38.49 1.76 18260032 0.00 0.00 SignatureIterator::parse_type() 2.23 39.73 1.24 15425660 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 1.96 40.82 1.09 23829897 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 1.75 41.79 0.97 33407124 0.00 0.00 os::vm_page_size() 1.40 42.57 0.78 18106431 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) 1.26 43.27 0.70 1259656 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) --------------------------------------------------------- Note interestingly only 18% of the time spent in run_opt with 23% spent in run because of the huge number of invokes and returns (presumeable to those native methods). --- ThinkFreeOffice --- Asm interpreter loop (run_opt in ARM asm) % cumulative self self total time seconds seconds calls s/call s/call name 24.38 12.23 12.23 34428228 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 14.14 19.32 7.09 BytecodeInterpreter::run_opt(BytecodeInterpreter*) 11.36 25.01 5.70 34395 0.00 0.00 CppInterpreter::main_loop(int, Thread*) 6.08 28.06 3.05 18104834 0.00 0.00 methodOopDesc::result_type() const 4.99 30.56 2.50 18170808 0.00 0.00 SignatureIterator::iterate_returntype() 4.15 32.64 2.08 16765945 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 2.57 33.93 1.29 18369885 0.00 0.00 SignatureIterator::parse_type() 2.53 35.20 1.27 23874942 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 2.29 36.35 1.15 33568034 0.00 0.00 os::vm_page_size() 2.05 37.38 1.03 15479805 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 1.46 38.11 0.73 18214073 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) 1.39 38.81 0.70 1286010 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) ------------------------------ Note much scope for improvement with only 14% spent in run_opt. What I would really like to do is profile libjava.so so I can find out what it is doing in those native methods. Any help on how to do this much appreciated. Regards, Ed. From ed at camswl.com Thu Feb 19 17:39:10 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 01:39:10 GMT Subject: Improving the performance of OpenJDK Message-ID: <200902200139.n1K1dAst004899@parsley.camswl.com> > Gary Benson wrote: > > Gary Benson wrote: > > ... > > > > > > Edward Nevill wrote: > > > > > > > get_native_u2() and get_Java_u2() ... This seems to be a > > > > > > > misguided attempt of the original authors to optimised > > > > > > > reading of halfwords (judging by the comment immediate > > > > > > > preceding the code). > > ... > > > So I did a quick SPECjvm98 run of this change on amd64. A couple > > > of the times are improved, by 3-10%, but one of the times is > > > *slower* by 11%. See attached. I'm not sure what to make of > > > that... > > > > I'm wondering if rewriting get_Java_u2() to read directly rather > > than read and swap is speeding it up, but removing the optimization > > from get_native_u2() is slowing it down. I'm going to try this with > > the original get_native_u2() and with get_Java_u2() just a copy of > > the big-endian get_native_u2(). > > So this way around it's a little more encouraging; two of the times > are 8-10% faster, one is 5% faster. Some of the times are still > slower though, though not by as much, maybe 1-2%. It's still > disturbingly ambiguous though... thoughts? I can believe on some architectures that get_native_u2() is faster. However, x86 supports unaligned accesses, so the code should do something like. #ifdef TARGET_SUPPORTS_UNALIGNED_ACCESSES return *p; #else /* Do current crud, or something better */ #endif The main killer is the get_Java_u2(). The stupid unsigned hw load, followed by the sign extension, followed by a full word bytesex reversal. Question: Why is the class data in the wrong endianness in the first place? The data affected is the data embeded in the Java bytecode and the data in the constant pool. Why does the classfile loader not just correct the endianness on load. It has to do verification anyway so it has to trawl through the classfile? It might as well correct the endianness. Regards, Ed. From ed at camswl.com Thu Feb 19 17:46:59 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 01:46:59 GMT Subject: backedge checks Message-ID: <200902200146.n1K1kxWK004933@parsley.camswl.com> > > I knew this wasn't right, and I just realized why. Every processor has its own > data cache, and these caches may not be coherent. You could potentially wait > for ever before a processor would read the changed dispatch table. > > I think this could probably be solved by sending a signal to every thread that > is supposed to move to a safepoint. If there's any additional processing needed > the signal handler can do it. > > Also, the thread updating the table would have to flush its own caches/write buffers/etc. > > Finally, unless the table itself is marked as volatile there's no guarantee that > it would be written before the flush. If you seriously think any of this code is MP safe at the thread level then I suspect you have been doing some recreational pharmaceuticals:-) Think about is, any update to the thread list needs to be MP safe. Even the setting of the _state variable needs to me MP safe and its isn't it just does _state = _synchronizing; Exactly the same problems apply to this, it may not be updated in the cache of another processor. Even the kernels user level atomic compare and swap is not MP safe (it may be MP safe on some implementations but it is not guaranteed so), and you are going to have a hard job making your program MP safe if your atomic operations are not MP safe. Regards, Ed. From ed at camswl.com Fri Feb 20 06:57:32 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 14:57:32 GMT Subject: Bytecode profiling Message-ID: <200902201457.n1KEvWQt007786@parsley.camswl.com> Hi folks, Another day, more profiling. I have persuaded gprof to tell me the % time spent executed each bytecode in my ASM loop. This is more useful than the simple bytecode counter in the CPP interpreter. Some interesting results. I have just included summaries here down to the 1% level. If anyone wants the full profile I can send it. At the bottom of each bytecode profile I have include other functions down to the 1% level for comparison. First CaffeineMark. Fairly much as expected. I'm surprised it spends quit that amount of time in getfield. I'll have another look at that. Also notice the high % in iload_0. This is probably a lie. It is probably aload_0, the asm VM does not distinguish aload_0, it just goes to iload_0. So the sequence is probably aload_0, getfield. ----- ECM --- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 11.99 1.21 1.21 do_getfield 7.19 1.94 0.73 do_iload 4.96 3.06 0.50 do_iaload 4.56 3.52 0.46 do_ifne 4.56 3.98 0.46 do_iload_0 4.36 4.42 0.44 do_iconst_1 3.32 4.75 0.34 do_iload_2 2.97 5.05 0.30 do_if_icmplt 2.78 5.33 0.28 do_istore 2.58 5.87 0.26 do_ixor 2.38 6.11 0.24 do_laload 1.78 6.48 0.18 do_iadd 1.73 6.66 0.18 do_iinc 1.68 7.00 0.17 do_iload_1 1.54 7.15 0.16 do_iload_3 1.49 7.45 0.15 do_iconst_0 1.44 7.60 0.15 do_if_icmpgt 1.39 7.74 0.14 do_if_icmpge 1.19 7.86 0.12 do_dadd 1.19 7.98 0.12 do_ifle 1.19 8.10 0.12 do_istore_2 1.14 8.21 0.12 do_lastore 1.09 8.43 0.11 do_lload 6.14 2.56 0.62 2136180 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 2.78 5.61 0.28 490 0.57 1.85 CppInterpreter::main_loop(int, Thread*) 1.88 6.30 0.19 1099047 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 1.68 6.83 0.17 1218147 0.00 0.00 methodOopDesc::result_type() const 1.49 7.30 0.15 1224036 0.00 0.00 SignatureIterator::iterate_returntype() 1.09 8.32 0.11 18512 0.01 0.01 typeArrayKlass::allocate(int, Thread*) ---------------------------- Next, EEMBC. Only 5 bytecode above the 1% level. Notice the time spent in os::vm_page_size() 2.93%, more than any bytecode apart from do_getfield. --- EEMBC ----- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 3.54 61.03 3.84 do_getfield 3.27 64.57 3.54 run_opt_entry 1.80 79.09 1.95 do_iload 1.58 80.80 1.72 do_iload_0 1.16 82.06 1.26 do_iload_1 1.06 83.21 1.15 do_if_icmplt 24.60 26.67 26.67 100435622 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 13.87 41.70 15.03 1847 0.01 0.02 CppInterpreter::main_loop(int, Thread*) 5.77 47.95 6.25 47872572 0.00 0.00 methodOopDesc::result_type() const 4.58 52.91 4.97 47096141 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 3.95 57.19 4.28 47882026 0.00 0.00 SignatureIterator::iterate_returntype() 2.93 67.75 3.18 94194307 0.00 0.00 os::vm_page_size() 2.64 70.61 2.86 46326855 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 2.06 72.84 2.23 47897882 0.00 0.00 SignatureIterator::parse_type() 2.03 75.04 2.20 74343829 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 1.94 77.14 2.10 47884823 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) ------------------------------ Finally, Think Office. Only 1 bytecode above the 1% level. And it spends more time doing os::vm_page_size() than any bytecode. I must get the callgraph profile and find out who is calling it 33E6 times. Not the time in run_opt_entry. This is because of all the invokes and returns. Perhaps I have just deoptimised the VM for think office and it is just spending its time bouncing between 'run' and 'run_opt' --- Office --- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 2.12 31.06 1.07 run_opt_entry 1.39 34.18 0.70 do_getfield 25.50 12.88 12.88 34862862 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 11.26 18.57 5.69 35015 0.00 0.00 CppInterpreter::main_loop(int, Thread*) 6.36 21.78 3.21 18326576 0.00 0.00 methodOopDesc::result_type() const 4.59 24.10 2.32 16971754 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 4.30 26.27 2.17 18391356 0.00 0.00 SignatureIterator::iterate_returntype() 2.67 27.62 1.35 18587041 0.00 0.00 SignatureIterator::parse_type() 2.36 28.81 1.19 24191043 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 2.34 29.99 1.18 15669590 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 1.94 32.04 0.98 33980388 0.00 0.00 os::vm_page_size() 1.46 32.77 0.74 1302034 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) 1.41 33.48 0.71 18435409 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) -------------------- For now.... Ed From aph at redhat.com Fri Feb 20 03:10:52 2009 From: aph at redhat.com (Andrew Haley) Date: Fri, 20 Feb 2009 11:10:52 +0000 Subject: backedge checks In-Reply-To: <200902200146.n1K1kxWK004933@parsley.camswl.com> References: <200902200146.n1K1kxWK004933@parsley.camswl.com> Message-ID: <499E8FBC.7020805@redhat.com> Edward Nevill wrote: >> I knew this wasn't right, and I just realized why. Every processor has its own >> data cache, and these caches may not be coherent. You could potentially wait >> for ever before a processor would read the changed dispatch table. >> >> I think this could probably be solved by sending a signal to every thread that >> is supposed to move to a safepoint. If there's any additional processing needed >> the signal handler can do it. >> >> Also, the thread updating the table would have to flush its own caches/write buffers/etc. >> >> Finally, unless the table itself is marked as volatile there's no guarantee that >> it would be written before the flush. > > If you seriously think any of this code is MP safe at the thread level > then I suspect you have been doing some recreational pharmaceuticals:-) > > Think about is, any update to the thread list needs to be MP safe. > Even the setting of the _state variable needs to me MP safe and its isn't > it just does > > _state = _synchronizing; It wouldn't work if that were all it did: there are memory barriers there too. But anyway, it looks like the template interpreter's notice_safepoints() does simply: static inline void copy_table(address* from, address* to, int size) { // Copy non-overlapping tables. The copy has to occur word wise for MT safety. while (size-- > 0) *to++ = *from++; } void TemplateInterpreter::notice_safepoints() { if (!_notice_safepoints) { // switch to safepoint dispatch table _notice_safepoints = true; copy_table((address*)&_safept_table, (address*)&_active_table, sizeof(_active_table) / sizeof(address)); } } so I can't see any reason for the C++ interpreter not to do the same. I do wonder how this works in general, though. It does seem like the code assumes memory semantics stronger than those guaranteed by POSIX threads. Maybe the template interpreter in hotspot isn't used on boxes with weakly consistent caches? That seems unlikely, though. I'll ask. > Exactly the same problems apply to this, it may not be updated in the cache > of another processor. > > Even the kernels user level atomic compare and swap is not MP safe (it may > be MP safe on some implementations but it is not guaranteed so), and you > are going to have a hard job making your program MP safe if your atomic > operations are not MP safe. The atomic operations are MP safe by definition. Andrew. From ed at camswl.com Fri Feb 20 07:23:03 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 15:23:03 GMT Subject: Optimising invoke and return Message-ID: <200902201523.n1KFN3gx007900@parsley.camswl.com> Hi again folks, The benchmarking so far has shown me that to get any real improvement on (semi) real world application like Think Free Office I must look at a number of things... 1) Optimisation of invoke and return The way invoke and return is done at the moment in the CC_INTERP is disgusting. When 'run' encounters an invoke or return it does some limited processing, sets the VM state to 'method_entry' or 'returning' and then returns from 'run' to 'main_loop' which then handles the invoke or return and then calls 'run' again. Now with the 'optimised' interpreter, when it encounters an invoke or return it has to thread its way back from run_opt to main_loop and then back down to run_opt Preoptimisation handling of invoke / return main_loop -> run -> main_loop -> run Post 'optimisation' main_loop -> run -> run_opt -> run -> main_loop -> run -> run_opt. Oops. The way around this is to try and flatten main_loop into run so that it is all handled in 'run'. Then the code for invoke/return can be migrated down to run_opt (at least for the simple cases where it is resolved, not synchronised?, no exceptions). However this will involve large scale munging of the codebase which I wanted to avoid. Increasingly I am feeling that I am flogging a dead horse with the CC_INTERP. I had a look at the template interpreter over the past few days. It doesn't actually look too bad. It all seems quite neatly architected (at least to the depth I examined it). 2) Optimisation of native invoke and return 3) Optimisation of the native libraries. My poiny haired boss is in town today and I think my proposal to him will be that we do the template interpreter. Based on some benchmarking I did on the PC comparing zero with -Xint this should give us 6 X performance increase over bare bones zero (4 X over my current optimised implementation). Doing the template interpreter would also serve as a useful 1st step to doing hotspot (should we decide to go that way). Gary: Does the Hotspt Compiler use the template interpreter to do its codegen, or does it have its own codegen and only use the template interpreter to interpret uncompiled code. (I could check this myself in the source, but I am lazy). Regards, Ed. From aph at redhat.com Fri Feb 20 03:58:50 2009 From: aph at redhat.com (Andrew Haley) Date: Fri, 20 Feb 2009 11:58:50 +0000 Subject: Improving the performance of OpenJDK In-Reply-To: <200902200139.n1K1dAst004899@parsley.camswl.com> References: <200902200139.n1K1dAst004899@parsley.camswl.com> Message-ID: <499E9AFA.3010502@redhat.com> Edward Nevill wrote: > > Question: Why is the class data in the wrong endianness in the first place? > The data affected is the data embeded in the Java bytecode and the data > in the constant pool. Why does the classfile loader not just correct > the endianness on load. It has to do verification anyway so it has to > trawl through the classfile? It might as well correct the endianness. No I don't understand this either. However, there are other places than the interpreter that read the bytecode, so it's not just a matter of fixing it there. All accesses to the constant pool seem to read the operand in native byte order, and then constantPoolOopDesc::field_or_method_at() does: // Takes either a constant pool cache index in possibly byte-swapped // byte order (which comes from the bytecodes after rewriting) or, // if "uncached" is true, a vanilla constant pool index jint field_or_method_at(int which, bool uncached) { int i = -1; if (uncached || cache() == NULL) { i = which; } else { // change byte-ordering and go via cache i = cache()->entry_at(Bytes::swap_u2(which))->constant_pool_index(); } assert(tag_at(i).is_field_or_method(), "Corrupted constant pool"); return *int_at_addr(i); } In other words: the operands are rewritten but left in Java byte order. Andrew. From gbenson at redhat.com Fri Feb 20 04:02:15 2009 From: gbenson at redhat.com (Gary Benson) Date: Fri, 20 Feb 2009 12:02:15 +0000 Subject: Optimising invoke and return In-Reply-To: <200902201523.n1KFN3gx007900@parsley.camswl.com> References: <200902201523.n1KFN3gx007900@parsley.camswl.com> Message-ID: <20090220120215.GA3277@redhat.com> Edward Nevill wrote: > My poiny haired boss is in town today and I think my proposal to > him will be that we do the template interpreter. Based on some > benchmarking I did on the PC comparing zero with -Xint this should > give us 6 X performance increase over bare bones zero (4 X over my > current optimised implementation). > > Doing the template interpreter would also serve as a useful 1st step > to doing hotspot (should we decide to go that way). > > Gary: Does the Hotspt Compiler use the template interpreter to do > its codegen, or does it have its own codegen and only use the > template interpreter to interpret uncompiled code. (I could check > this myself in the source, but I am lazy). The two compilers (client and server) and the template interpreter are essentially independent. There's shared code, of course, but they're essentially different. Whether doing the template interpreter is a good idea depends on what you want, and when you want it. If your goal is a high-performance interpreter only solution then I'd say do the template interpreter. The work involved is probably something like 6-9 man months to get it mostly working, and maybe another 3 months to pass the TCK if that's what you want. If your goal is a high-performance interpreter+JIT combo, then adding the client JIT to an existing template interpreter is probably 3-6 man months, with maybe another 3 months for the TCK. I'm guessing here, I haven't done a client JIT interpretation myself. I'm also assuming you'd want client rather than server (it's probably better suited for ARM). If you want a JIT but don't have a 12-18 months to do it in and can accept good but not ultimate performance, then maybe Zero and Shark are worth looking into. Shark is not bad now, and it could be in something like a production ready state in 3-6 months. That's my two cents... Cheers, Gary -- http://gbenson.net/ From ed at camswl.com Fri Feb 20 08:11:36 2009 From: ed at camswl.com (Edward Nevill) Date: Fri, 20 Feb 2009 16:11:36 GMT Subject: Improving the performance of OpenJDK Message-ID: <200902201611.n1KGBark008090@parsley.camswl.com> > > Question: Why is the class data in the wrong endianness in the first place? > > The data affected is the data embeded in the Java bytecode and the data > > in the constant pool. Why does the classfile loader not just correct > > the endianness on load. It has to do verification anyway so it has to > > trawl through the classfile? It might as well correct the endianness. > > No I don't understand this either. However, there are other places than > the interpreter that read the bytecode, so it's not just a matter of > fixing it there. Ah yes, but presumably all those other places in the interpreter will use the wonderfully written and highly optimised get_Java_u2() and friends so we dont need to trawl through the interpreter. I mean thats why they were written in the first place. Regards, Ed. From xerxes at zafena.se Fri Feb 20 04:59:27 2009 From: xerxes at zafena.se (=?ISO-8859-1?Q?Xerxes_R=E5nby?=) Date: Fri, 20 Feb 2009 13:59:27 +0100 Subject: Optimising invoke and return - importing phoneME ARM JIT to Icedtea? In-Reply-To: <20090220120215.GA3277@redhat.com> References: <200902201523.n1KFN3gx007900@parsley.camswl.com> <20090220120215.GA3277@redhat.com> Message-ID: <499EA92F.4030508@zafena.se> Hello i have been following the recent days optimization discussion with great interest and wanted to fill in some thoughts. Gary Benson skrev: > Edward Nevill wrote: > >> My poiny haired boss is in town today and I think my proposal to >> him will be that we do the template interpreter. Based on some >> benchmarking I did on the PC comparing zero with -Xint this should >> give us 6 X performance increase over bare bones zero (4 X over my >> current optimised implementation). >> >> Doing the template interpreter would also serve as a useful 1st step >> to doing hotspot (should we decide to go that way). >> >> Gary: Does the Hotspt Compiler use the template interpreter to do >> its codegen, or does it have its own codegen and only use the >> template interpreter to interpret uncompiled code. (I could check >> this myself in the source, but I am lazy). >> > > The two compilers (client and server) and the template interpreter > are essentially independent. There's shared code, of course, but > they're essentially different. > > Whether doing the template interpreter is a good idea depends on what > you want, and when you want it. If your goal is a high-performance > interpreter only solution then I'd say do the template interpreter. > The work involved is probably something like 6-9 man months to get it > mostly working, and maybe another 3 months to pass the TCK if that's > what you want. > In order to complete the TCK on ARM i belive some focus are needed to speedup the compile test cycle on ARM platforms since we dont want a week compiletime betweeen the TCK runs. I have posted links to Robert Schusters tutorials and patches to crosscompile openjdk for arm: http://icedtea.classpath.org/wiki/CrossCompileFaq I will gladly help debugging any failed tests for ARM. My own plan before i start dive into the TCK is to first focus getting a cross compilation environment up and then start looking into failing jtreg tests on ARM/Zero. > If your goal is a high-performance interpreter+JIT combo, then adding > the client JIT to an existing template interpreter is probably 3-6 man > months, with maybe another 3 months for the TCK. I'm guessing here, > I haven't done a client JIT interpretation myself. I'm also assuming > you'd want client rather than server (it's probably better suited for > ARM). > A question that have tickeled my mind since fosdem are if the phoneME and OpenJDK jvm codebase share any similarity, if so then perhaps a allready implemented ARM JIT from the phoneME project can be patched up to support 1.5 bytecodes and imported into a GPL only fork of OpenJDK by dropping all classpath exceptions. For me it is impossible to judge the effort needed to do such a suggested merge. > If you want a JIT but don't have a 12-18 months to do it in and can > accept good but not ultimate performance, then maybe Zero and Shark > are worth looking into. Shark is not bad now, and it could be in > something like a production ready state in 3-6 months. > > That's my two cents... > > Cheers, > Gary > Cheers and have a great day! Xerxes From gbenson at redhat.com Tue Feb 24 04:00:41 2009 From: gbenson at redhat.com (Gary Benson) Date: Tue, 24 Feb 2009 12:00:41 +0000 Subject: Profiling In-Reply-To: <200902200051.n1K0p5YK004736@parsley.camswl.com> References: <200902200051.n1K0p5YK004736@parsley.camswl.com> Message-ID: <20090224120041.GA3260@redhat.com> Hi Ed, Edward Nevill wrote: > --- ThinkFreeOffice --- Original > > % cumulative self self total > time seconds seconds calls s/call s/call name > 42.12 21.35 21.35 33855747 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) > 9.70 26.27 4.92 34204 0.00 0.00 CppInterpreter::main_loop(int, Thread*) > 5.17 28.89 2.62 17783120 0.00 0.00 methodOopDesc::result_type() const > 4.54 31.19 2.30 16484931 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) > 4.48 33.46 2.27 17847099 0.00 0.00 SignatureIterator::iterate_returntype() > 2.49 34.72 1.26 15239483 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) > 2.41 35.94 1.22 18044932 0.00 0.00 SignatureIterator::parse_type() > 2.05 36.98 1.04 33005883 0.00 0.00 os::vm_page_size() > 1.47 37.72 0.75 1245317 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) > 1.44 38.45 0.73 17891344 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) > 1.09 39.00 0.55 1773458 0.00 0.00 CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*) This one I find really interesting. Is libffi showing up in your profile at all? Andrew Haley and I have often wondered what performance impace using libffi has, and Andrew has some ideas for speeding up libffi if it turns out to be a bottleneck. If your profile is excluding libffi, it'd be really interesting to see what it looked like with it in. Cheers, Gary -- http://gbenson.net/ From ed at camswl.com Wed Feb 25 06:01:23 2009 From: ed at camswl.com (Edward Nevill) Date: Wed, 25 Feb 2009 14:01:23 GMT Subject: Profiling Message-ID: <200902251401.n1PE1N44017335@parsley.camswl.com> > >This one I find really interesting. Is libffi showing up in your >profile at all? Andrew Haley and I have often wondered what >performance impace using libffi has, and Andrew has some ideas for >speeding up libffi if it turns out to be a bottleneck. If your >profile is excluding libffi, it'd be really interesting to see >what it looked like with it in. > >Cheers, >Gary Hi Gary, The previous profiles did not include libffi because I was using the shared library. I have redone it with a static libffi. Here are the top 50 results. libffi accounts for 0.5% on this particular app (Think Free Office). I also did CaffeineMark and EEMBC but the results on those were negligible (<0.10%). Regards, Ed. Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 23.81 11.98 11.98 33820691 0.00 0.00 BytecodeInterpreter::run(BytecodeInterpreter*) 10.99 17.51 5.53 33987 0.00 0.00 CppInterpreter::main_loop(int, Thread*) 5.96 20.51 3.00 17775389 0.00 0.00 methodOopDesc::result_type() const 4.79 22.92 2.41 17840744 0.00 0.00 SignatureIterator::iterate_returntype() 3.90 24.88 1.96 16469679 0.00 0.00 InterpreterFrame::build(ZeroStack*, methodOopDesc*, JavaThread*) 2.48 26.13 1.25 15215340 0.00 0.00 CppInterpreter::normal_entry(methodOopDesc*, int, Thread*) 2.44 27.36 1.23 23488682 0.00 0.00 ConstantPoolCacheEntry::is_resolved(Bytecodes::Code) const 2.38 28.56 1.20 32975124 0.00 0.00 os::vm_page_size() 2.29 29.71 1.15 18024369 0.00 0.00 SignatureIterator::parse_type() 2.13 30.78 1.07 run_opt_entry 1.85 31.71 0.93 1254203 0.00 0.00 CppInterpreter::native_entry(methodOopDesc*, int, Thread*) 1.61 32.52 0.81 17883786 0.00 0.00 SignatureIterator::SignatureIterator(symbolHandle) 1.25 33.15 0.63 do_getfield_state1 0.85 33.58 0.43 207980 0.00 0.00 typeArrayKlass::allocate(int, Thread*) 0.81 33.99 0.41 1777040 0.00 0.00 CppInterpreter::accessor_entry(methodOopDesc*, int, Thread*) 0.81 34.40 0.41 run_opt_exit 0.70 34.75 0.35 6219137 0.00 0.00 BytecodeInterpreter::astore(int*, int, int*, int) 0.68 35.09 0.34 do_iload_state1 0.64 35.41 0.32 15293338 0.00 0.00 ThreadShadow::clear_pending_exception() 0.62 35.72 0.31 16657203 0.00 0.00 os::current_stack_pointer() 0.56 36.00 0.28 2348866 0.00 0.00 void oop_store(oopDesc**, oopDesc*) 0.56 36.28 0.28 304045 0.00 0.00 SymbolTable::lookup_only(char const*, int, unsigned int&) 0.48 36.52 0.24 17819959 0.00 0.00 ResultTypeFinder::set(int, BasicType) 0.48 36.76 0.24 4007672 0.00 0.00 SignatureInfo::do_void() 0.44 36.98 0.22 5069507 0.00 0.00 SignatureInfo::do_object(int, int) 0.44 37.20 0.22 do_aload_0_state0 0.42 37.41 0.21 do_iload_state0 0.42 37.62 0.21 ffi_call 0.40 37.82 0.20 44574 0.00 0.00 ClassFileParser::parse_method(constantPoolHandle, bool, AccessFlags*, typeArrayHandle*, typeArrayHandle*, typeArrayHandle*, Thread*) 0.36 38.00 0.18 1326820 0.00 0.00 HandleMarkCleaner::~HandleMarkCleaner() 0.32 38.16 0.16 1123793 0.00 0.00 Klass::is_subtype_of(klassOopDesc*) const 0.30 38.31 0.15 4430397 0.00 0.00 SignatureInfo::do_int() 0.30 38.46 0.15 2539725 0.00 0.00 SignatureInfo::do_bool() 0.30 38.61 0.15 do_aload_0_state1 0.30 38.76 0.15 do_goto_state0 0.30 38.91 0.15 do_iload_2_state1 0.28 39.05 0.14 43798 0.00 0.00 klassVtable::update_super_vtable(instanceKlass*, methodHandle, int, bool, Thread*) 0.26 39.18 0.13 9376 0.00 0.00 constantPoolKlass::oop_follow_contents(oopDesc*) 0.26 39.31 0.13 do_if_icmpge_state1 0.26 39.44 0.13 ffi_prep_args 0.24 39.56 0.12 128869 0.00 0.00 instanceKlass::uncached_lookup_method(symbolOopDesc*, symbolOopDesc*) const 0.24 39.68 0.12 do_aload_1_state0 0.24 39.80 0.12 do_bipush_state1 0.22 39.91 0.11 16 0.01 0.01 ContiguousSpace::prepare_for_compaction(CompactPoint*) 0.22 40.02 0.11 do_if_icmpne_state1 From aph at redhat.com Wed Feb 25 02:41:19 2009 From: aph at redhat.com (Andrew Haley) Date: Wed, 25 Feb 2009 10:41:19 +0000 Subject: Profiling In-Reply-To: <200902251401.n1PE1N44017335@parsley.camswl.com> References: <200902251401.n1PE1N44017335@parsley.camswl.com> Message-ID: <49A5204F.3010108@redhat.com> Edward Nevill wrote: >> This one I find really interesting. Is libffi showing up in your >> profile at all? Andrew Haley and I have often wondered what >> performance impace using libffi has, and Andrew has some ideas for >> speeding up libffi if it turns out to be a bottleneck. If your >> profile is excluding libffi, it'd be really interesting to see >> what it looked like with it in. > The previous profiles did not include libffi because I was using the > shared library. I have redone it with a static libffi. Here are the > top 50 results. libffi accounts for 0.5% on this particular app > (Think Free Office). I also did CaffeineMark and EEMBC but the > results on those were negligible (<0.10%). Good news, thanks. FYI Ed, you might want to consider oprofile: it profiles everything that's running and it's easy to use. (The docs are kinda horrible and make it look difficult, but really it's not.) Cheers, Andrew. From ed at camswl.com Thu Feb 26 07:45:24 2009 From: ed at camswl.com (Edward Nevill) Date: Thu, 26 Feb 2009 15:45:24 GMT Subject: get_native_u2 & friends Message-ID: <200902261545.n1QFjOY3022789@parsley.camswl.com> Hi all, I think the best thing to do with get_native_u2 and friends is to let the compiler decide how to access unaligned data. Most modern compilers have some facility for doing this. In gcc you can use __attribute__((packed)) as follows. typedef union unaligned { unsigned u; unsigned short us; unsigned long long ul; } __attribute__((packed)) unaligned; unsigned short get_native_u2(unaligned *p) { return p->us; } unsigned get_native_u4(unaligned *p) { return p->u; } unsigned long long get_native_u8(unaligned *p) { return p->ul; } Below is the code generated for ARM and X86. Note that in the X86 case it just does the access since X86 allows unaligned accesses whereas for ARM it goes ahead and doesbyte loads. If on some architechture it is better to test the alignment and then do word/halfword loads if the pointer is aligned and byte loads otherwise then hopefully the compiler will know the best code to generate rarther than us trying to second guess what is best on individual architectures. Also, in many case these functions are called when it is known that the data is aligned as in this example from _tableswitch... CASE(_tableswitch): { jint* lpc = (jint*)VMalignWordUp(pc+1); int32_t key = STACK_INT(-1); int32_t low = Bytes::get_Java_u4((address)&lpc[1]); int32_t high = Bytes::get_Java_u4((address)&lpc[2]); Maybe it is worth having get_Java_u4() and get_Java_u4_unaligned()? Regards, Ed. --- x86.s -------------------------------------------------------- .file "test.c" .text .p2align 4,,15 .globl get_native_u2 .type get_native_u2, @function get_native_u2: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax popl %ebp movzwl (%eax), %eax ret .size get_native_u2, .-get_native_u2 .p2align 4,,15 .globl get_native_u4 .type get_native_u4, @function get_native_u4: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax popl %ebp movl (%eax), %eax ret .size get_native_u4, .-get_native_u4 .p2align 4,,15 .globl get_native_u8 .type get_native_u8, @function get_native_u8: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax popl %ebp movl 4(%eax), %edx movl (%eax), %eax ret .size get_native_u8, .-get_native_u8 .ident "GCC: (GNU) 4.2.4 (Ubuntu 4.2.4-1ubuntu3)" .section .note.GNU-stack,"", at progbits --- arm.s ------------------------------------------------------------- .arch armv5t .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 18, 4 .file "test.c" .text .align 2 .global get_native_u2 .type get_native_u2, %function get_native_u2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldrb r3, [r0, #1] @ zero_extendqisi2 ldrb r0, [r0, #0] @ zero_extendqisi2 orr r0, r0, r3, asl #8 bx lr .size get_native_u2, .-get_native_u2 .align 2 .global get_native_u4 .type get_native_u4, %function get_native_u4: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldrb r1, [r0, #1] @ zero_extendqisi2 ldrb r3, [r0, #0] @ zero_extendqisi2 ldrb r2, [r0, #2] @ zero_extendqisi2 ldrb r0, [r0, #3] @ zero_extendqisi2 orr r3, r3, r1, asl #8 orr r3, r3, r2, asl #16 orr r0, r3, r0, asl #24 bx lr .size get_native_u4, .-get_native_u4 .align 2 .global get_native_u8 .type get_native_u8, %function get_native_u8: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. stmfd sp!, {r4, r5, r6} ldrb r5, [r0, #1] @ zero_extendqisi2 ldrb r6, [r0, #5] @ zero_extendqisi2 ldrb r3, [r0, #0] @ zero_extendqisi2 ldrb ip, [r0, #2] @ zero_extendqisi2 ldrb r1, [r0, #4] @ zero_extendqisi2 ldrb r2, [r0, #6] @ zero_extendqisi2 ldrb r4, [r0, #7] @ zero_extendqisi2 ldrb r0, [r0, #3] @ zero_extendqisi2 orr r3, r3, r5, asl #8 orr r1, r1, r6, asl #8 orr r3, r3, ip, asl #16 orr r1, r1, r2, asl #16 orr r0, r3, r0, asl #24 orr r1, r1, r4, asl #24 ldmfd sp!, {r4, r5, r6} bx lr .size get_native_u8, .-get_native_u8 .ident "GCC: (Ubuntu 4.3.3-1ubuntu1) 4.3.3" .section .note.GNU-stack,"",%progbits From aph at redhat.com Thu Feb 26 10:37:31 2009 From: aph at redhat.com (Andrew Haley) Date: Thu, 26 Feb 2009 18:37:31 +0000 Subject: Shark: JVMTI support for profiling, etc Message-ID: <49A6E16B.1060205@redhat.com> This patch adds support for Shark profiling. Running SPECjvm98 gives you a profile like this, showing the time spent in Shark-compiled code: samples % image name app name symbol name 168440 5.0754 no-vmlinux no-vmlinux /no-vmlinux 125195 3.7724 9541.jo java spec.benchmarks._209_db.Database::shell_sort 87596 2.6394 libjvm.so libjvm.so SignatureIterator::iterate_returntype() 80076 2.4128 9541.jo java java.lang.String::compareTo 70082 2.1117 9541.jo java java.util.Vector::elementAt 65929 1.9866 9541.jo java spec.benchmarks._201_compress.Compressor::compress 62324 1.8779 libjvm.so libjvm.so SharedHeap::is_in_permanent(void const*) const 59044 1.7791 libjvm.so libjvm.so CppInterpreter::accessor_entry(methodOopDesc*, long, Thread*) 58508 1.7630 libjvm.so libjvm.so Hashtable::oops_do(OopClosure*) 43226 1.3025 9541.jo java spec.benchmarks._222_mpegaudio.q::l 42005 1.2657 9541.jo java java.util.Vector::elementData 39031 1.1761 9541.jo java spec.benchmarks._205_raytrace.OctNode::Intersect 38677 1.1654 9541.jo java spec.benchmarks._201_compress.Decompressor::decompress ... I committed this. Andrew. 2009-02-26 Andrew Haley * patches/openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp: New file. * Makefile.am (ICEDTEA_PATCHES): Add icedtea-jvmtiEnv.patch. * ports/hotspot/src/share/vm/shark/sharkFunction.cpp (SharkFunction::initialize): Use real name, not "func". Pass "none" to "-debug-only=". Register generated code for profiling, etc. * ports/hotspot/src/share/vm/shark/sharkEntry.hpp (class SharkEntry): Remove #ifndef PRODUCT. * ports/hotspot/src/share/vm/shark/sharkBuilder.hpp (SharkBuilder::CreateFunction): Use real name, not "func". * ports/hotspot/src/share/vm/shark/sharkBuilder.cpp (SharkBuilder::CreateFunction): Use real name, not "func". (MyJITMemoryManager::endFunctionBody): Remove #ifndef PRODUCT. --- openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp.old 2009-02-26 17:18:35.000000000 +0000 +++ openjdk/hotspot/src/share/vm/prims/jvmtiEnv.cpp 2009-02-26 17:16:59.000000000 +0000 @@ -2702,6 +2702,9 @@ (*entry_count_ptr) = num_entries; (*table_ptr) = jvmti_table; + if (num_entries == 0) + return JVMTI_ERROR_ABSENT_INFORMATION; + return JVMTI_ERROR_NONE; } /* end GetLineNumberTable */ diff -r 90de0ba94422 Makefile.am --- a/Makefile.am Thu Feb 26 17:34:19 2009 +0000 +++ b/Makefile.am Thu Feb 26 18:32:17 2009 +0000 @@ -541,7 +541,8 @@ patches/icedtea-sunsrc.patch \ patches/icedtea-libraries.patch \ patches/icedtea-javafiles.patch \ - patches/icedtea-core-build.patch + patches/icedtea-core-build.patch \ + patches/icedtea-jvmtiEnv.patch if WITH_ALT_HSBUILD ICEDTEA_PATCHES += \ diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkBuilder.cpp --- a/ports/hotspot/src/share/vm/shark/sharkBuilder.cpp Thu Feb 26 17:34:19 2009 +0000 +++ b/ports/hotspot/src/share/vm/shark/sharkBuilder.cpp Thu Feb 26 18:29:12 2009 +0000 @@ -97,12 +97,12 @@ module()->getOrInsertFunction("llvm.memory.barrier", type)); } -Function *SharkBuilder::CreateFunction() +Function *SharkBuilder::CreateFunction(const char *name) { Function *function = Function::Create( SharkType::entry_point_type(), GlobalVariable::InternalLinkage, - "func"); + name); module()->getFunctionList().push_back(function); return function; } @@ -180,13 +180,12 @@ void SharkBuilder::MyJITMemoryManager::endFunctionBody (const llvm::Function *F, unsigned char *FunctionStart, - unsigned char *FunctionEnd) + unsigned char *FunctionEnd) { mm->endFunctionBody(F, FunctionStart, FunctionEnd); -#ifndef PRODUCT + SharkEntry *e = sharkEntry[F]; if (e) e->setBounds(FunctionStart, FunctionEnd); -#endif // !PRODUCT } diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkBuilder.hpp --- a/ports/hotspot/src/share/vm/shark/sharkBuilder.hpp Thu Feb 26 17:34:19 2009 +0000 +++ b/ports/hotspot/src/share/vm/shark/sharkBuilder.hpp Thu Feb 26 18:29:12 2009 +0000 @@ -109,7 +109,7 @@ // Function creation public: - llvm::Function *CreateFunction(); + llvm::Function *CreateFunction(const char *name = "func"); // Helpers for accessing structures and arrays public: diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkEntry.hpp --- a/ports/hotspot/src/share/vm/shark/sharkEntry.hpp Thu Feb 26 17:34:19 2009 +0000 +++ b/ports/hotspot/src/share/vm/shark/sharkEntry.hpp Thu Feb 26 18:29:12 2009 +0000 @@ -46,8 +46,6 @@ public: void print_statistics(const char* name) const PRODUCT_RETURN; -#ifndef PRODUCT - private: address code_start() const { return start; @@ -66,6 +64,4 @@ start = (address)FunctionStart; limit = (address)FunctionEnd; } - -#endif // !PRODUCT }; diff -r 90de0ba94422 ports/hotspot/src/share/vm/shark/sharkFunction.cpp --- a/ports/hotspot/src/share/vm/shark/sharkFunction.cpp Thu Feb 26 17:34:19 2009 +0000 +++ b/ports/hotspot/src/share/vm/shark/sharkFunction.cpp Thu Feb 26 18:29:12 2009 +0000 @@ -37,7 +37,7 @@ masm()->advance(sizeof(SharkEntry)); // Create the function - _function = builder()->CreateFunction(); + _function = builder()->CreateFunction(name()); entry->set_llvm_function(function()); #ifndef PRODUCT // FIXME: there should be a mutex when updating sharkEntry in case @@ -142,7 +142,7 @@ // target-specific. Args.push_back("-debug-only=" "x86-emitter"); else - Args.push_back("-debug-only="); + Args.push_back("-debug-only=" "none"); Args.push_back(0); // Null terminator. cl::ParseCommandLineOptions(Args.size()-1, (char**)&Args[0]); #endif @@ -150,6 +150,13 @@ // Compile to native code void *code = builder()->execution_engine()->getPointerToFunction(function()); + + // Register generated code for profiling, etc + if (JvmtiExport::should_post_dynamic_code_generated()) { + JvmtiExport::post_dynamic_code_generated + (name(), entry->code_start(), entry->code_limit()); + } + entry->set_entry_point((ZeroEntry::method_entry_t) code); if (SharkTraceInstalls) entry->print_statistics(name());