New portion of improvements for Dual-Pivot Quicksort
Osvaldo Doederlein
opinali at gmail.com
Mon Jun 21 14:18:20 UTC 2010
Hi Vladimir,
2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru>
> Hello Osvaldo,
>
> I've prepared simple test which scans an array and does assignments for
> each element,
> see attached Test class:
>
> a[k] = a[less];
> a[less++] = 0; // or a[less] = 0; less++;
>
> The result of running "java -client Test" is:
>
> a[less], less++; Time: 6998
> a[less++]; Time: 8416
>
> It is much more than 1%. Is it bug in JVM? Note that under server VM
>
The amount of diff surely depends on the benchmark; your bench should "zoom"
the problem by not having much other work polluting the measurement. But in
my own test with b98 (32-bit), Q6600 CPU, I've got 5065/5079, so the diff is
< 1%. The excessive disadvantage you report may be something bad in your
older b84.
Anyway I investigated the JIT-compiled code, details in the end (I've split
the benchmark in two classes just to make the comparison simpler). The
problem with "a[less++]" is that "less++" will first increment "less",
_then_ it will use the old value (not-incremented) to index "a". C1
generates code that is equivalent to:
int less_incremented = less + 1;
a[less] = 0;
less = less_incremented;
...which is a 1-to-1 translation of the IR coming off the bytecode. C1 is
not smart enough to see that the increment can be reordered after the
indexing, maybe because there's a data dependency as the indexing uses
"less"; but due to the semantics of postfix "++" this dependency is for the
before-increment value, so the reordering would be safe. Maybe that's just
some simple missing heuristics that could be easily added?
there is no difference between "a[less++]" and "a[less], less++".
>
C2 generates completely different code,with 16X unrolling - this is the
inner loop:
0x026a6e40: pxor %xmm0,%xmm0 ;*aload_0
; - Test1::sort1 at 9 (line 23)
0x026a6e44: movq %xmm0,0xc(%ecx,%esi,4)
0x026a6e4a: movq %xmm0,0x14(%ecx,%esi,4)
0x026a6e50: movq %xmm0,0x1c(%ecx,%esi,4)
0x026a6e56: movq %xmm0,0x24(%ecx,%esi,4)
0x026a6e5c: movq %xmm0,0x2c(%ecx,%esi,4)
0x026a6e62: movq %xmm0,0x34(%ecx,%esi,4)
0x026a6e68: movq %xmm0,0x3c(%ecx,%esi,4)
0x026a6e6e: movq %xmm0,0x44(%ecx,%esi,4) ;*iastore
; - Test1::sort1 at 21 (line 24)
0x026a6e74: add $0x10,%esi ;*iinc
; - Test1::sort1 at 22 (line 22)
0x026a6e77: cmp %ebp,%esi
0x026a6e79: jl 0x026a6e44
There is some extra slow-path code to fill the remaining 1...15 elements if
the loop length is not multiple of 16, and that's all. C2 detects the
redundancy between the "k" and "less" vars, and kills the also-redundant
"a[k] = a[less]" assignment so the net result is a simple zero-fill of the
array. Maybe a different benchmark without these redundancies would make
easier to see that C2 doesn't have a problem with the postfix "++", but if
it had, I think it wouldn't reach the excellent result above.
A+
Osvaldo
> I'm using JDK 7 on Windows XP:
>
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
> Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>
> Thanks,
> Vladimir
>
>
This is the C1 code for sort2()'s loop:
0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception: dispatches to
0x0251c21e
;; 30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
0x0251c1df: jae 0x0251c24a
0x0251c1e5: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
; - Test2::sort2 at 13 (line 23)
0x0251c1e9: cmp 0x8(%ecx),%edi
;; 36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
0x0251c1ec: jae 0x0251c263
0x0251c1f2: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
; - Test2::sort2 at 14 (line 23)
(sort1/sort2 start to differ here)
0x0251c1f6: cmp 0x8(%ecx),%esi
;; 42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
0x0251c1f9: jae 0x0251c27c
0x0251c1ff: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less] = 0;
; - Test2::sort2 at 18 (line 24)
0x0251c207: inc %esi ; ++less;
0x0251c208: inc %edi ; OopMap{ecx=Oop off=73}
;*goto: for k++
; - Test2::sort2 at 25 (line 22)
0x0251c209: test %eax,0x1a0100 ;*goto
; - Test2::sort2 at 25 (line 22)
; {poll}
;; block B1 [4, 6]
0x0251c20f: cmp %edx,%edi
;; 22 branch [LT] [B2]
0x0251c211: jl 0x0251c1dc ;*if_icmpge: for k < right
; - Test2::sort2 at 6 (line 22)
The code looks OK; C1 doesn't do much optimization - no unrolling, bounds
check elimination - but it's otherwise just about the code you would expect
for a simple JITting.
Now, C1 code for sort1()'s loop:
0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception: dispatches to
0x024bc262
;; 30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
0x024bc21f: jae 0x024bc28e
0x024bc225: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
; - Test1::sort1 at 13 (line 23)
0x024bc229: cmp 0x8(%ecx),%edi
;; 36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
0x024bc22c: jae 0x024bc2a7
0x024bc232: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
; - Test1::sort1 at 14 (line 23)
(sort1/sort2 start to differ here)
0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into %ebx
0x024bc238: inc %ebx ; ++less; (for the temp "less from
future")
0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "old less"....
;; 46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
0x024bc23c: jae 0x024bc2c0
0x024bc242: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less++] = 0;
; - Test1::sort1 at 21 (line 24)
0x024bc24a: inc %edi ; OopMap{ecx=Oop off=75}
;*goto: for k++
; - Test1::sort1 at 25 (line 22)
0x024bc24b: test %eax,0x1a0100 ; {poll}
0x024bc251: mov %ebx,%esi ;*goto
; - Test1::sort1 at 25 (line 22):
for...
;; block B1 [4, 6]
0x024bc253: cmp %edx,%edi
;; 22 branch [LT] [B2]
0x024bc255: jl 0x024bc21c ;*if_icmpge
; - Test1::sort1 at 6 (line 22): for...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100621/9ba8d864/attachment.html>
More information about the core-libs-dev
mailing list