Native wrapper optimization

Sun Nov 20 20:53:49 PST 2011

Hi all,

(Just in case my company email strips attachment again, I'm replying with
my personal email)

I've got the patch ported to 32-bit x86. See attachment.

Additional comments about the patch:
Joseph's original patch moves the IC miss jump out-of-line, but on x64 with
compressed oops, that doesn't really save space in the unverified entry
point code sequence, due to the 8-byte alignment. Examples in [1].

The version in this mail's attachment uses jump_cc() inline instead of
jcc() and a out-of-line jump(). The UEP code generated by both C1 and C2
uses the same pattern.

There's a similar pattern in generate_i2c2i_adapters() that could have used
jump_cc() to call the ic_miss_stub. But the gains doesn't look significant
enough so I didn't modify it.

Another note:
In x64's version of SharedRuntime::generate_dtrace_nmethod(), the IC check
isn't using load_klass().

__ verify_oop(receiver);
__ cmpl(ic_reg, Address(receiver, oopDesc::klass_offset_in_bytes()));
__ jcc(Assembler::equal, hit);

Is this correct, or should it be modified to use load_klass(), too? My take
is the latter.

load_klass() was introduced in [2], and later, generate_dtrace_nmethod()
was introduced in [3]. I think [3] missed the compressed oops changes.

Regards,
Kris Mok
Software Engineer, Taobao (http://www.taobao.com)

[1]: https://gist.github.com/1380416#file_notes.md
[2]:
http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/ba764ed4b6f2/src/cpu/x86/vm/sharedRuntime_x86_64.cpp
[3]:
http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/018d5b58dd4f/src/cpu/x86/vm/sharedRuntime_x86_64.cpp

2011/11/18 changren <changren at taobao.com>

> Ok, Kris will help to port to 32bit.
> Thanks,
> Joseph
>
> ÓÚ 2011-11-18 17:09, Christian Thalinger Ð´µÀ:
> > Looks like a good patch to me.  What about 32-bit x86?
> >
> > -- Chris
> >
> > On Nov 18, 2011, at 7:39 AM, changren wrote:
> >
> >> Hi, all
> >> Attached patch(diff with hsx20) is supposed to speed up native
> >> invocation. It rearranges the compiled-to-native wrapper code to
> >> straighten branches which improves spatial locality. Micro
> >> benchmark(500m consecutive JNI invocations with warm up) shows the
> >> stalled CPU cycles caused by instruction fetch due to L1 ICache miss
> >> decrease 3.4% on Intel Nehalem microarchitecture and 9.6% on Core
> >> microarchitecture. The real execution time of the micro benchmark is
> >> also decreased 5-10% respectively which reflects the improvement.
> >> Thanks,
> >> Joseph
> >>
> >>
> >> ________________________________
> >>
> >> This email (including any attachments) is confidential and may be
> legally privileged. If you received this email in error, please delete it
> immediately and do not copy it or use it for any purpose or disclose its
> contents to any other person. Thank you.
> >>
> >>
> ±¾µçÓÊ(°üÀ¨ÈÎºÎ¸½¼þ)¿ÉÄÜº¬ÓÐ»úÃÜ×ÊÁÏ²¢ÊÜ·¨ÂÉ±£»¤¡£ÈçÄú²»ÊÇÕýÈ·µÄÊÕ¼þÈË£¬ÇëÄúÁ¢¼´É¾³ý±¾ÓÊ¼þ¡£Çë²»Òª½«±¾µçÓÊ½øÐÐ¸´ÖÆ²¢ÓÃ×÷ÈÎºÎÆäËûÓÃÍ¾¡¢»òÍ¸Â¶±¾ÓÊ¼þÖ®ÄÚÈÝ¡£Ð»Ð»¡£
> >> <JNIWrapperOpt.patch>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20111121/68f7eb42/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: JNI_wrapper_ver2.patch
Type: application/octet-stream
Size: 7936 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20111121/68f7eb42/JNI_wrapper_ver2.patch