New portion of improvements for Dual-Pivot Quicksort

Tue Jun 22 12:15:56 UTC 2010

Hello,

I tried with the latest JDK, build 98 and see different behaviour
on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
works slower with a[less++] instruction on both computers:

                  first               second
           a[less] = ak; less++; / (a[less++] = ak;

       random: 16371 / 16696        14018 / 14326
    ascendant:  2706 /  2762         2884 /  2897
   descendant:  2994 /  3108         3170 /  3258
  organ pipes:  7073 /  7296         6939 /  7090
   stagger(2):  7765 /  8069         7531 /  7731
  period(1,2):   653 /   743          719 /   753
random(1..4):  2152 /  2234         1567 /  1591

If I change Test class and populate the array with descendant
values, I still see difference in times: 4793 / 5718
see updated attached Test class.

Also I'm attaching the latest version of DualPivotQuicksort
with minor format changes. If you don't have comments, I'll
ask to to integrate the code at the end of this week.

Thank you,
Vladimir

Osvaldo Doederlein wrote:
> Hi Vladimir,
> 
> 2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru 
> <mailto:iaroslavski at mail.ru>>
> 
>     Hello Osvaldo,
> 
>     I've prepared simple test which scans an array and does assignments
>     for each element,
>     see attached Test class:
> 
>     a[k] = a[less];
>     a[less++] = 0; // or a[less] = 0; less++;
> 
>     The result of running "java -client Test" is:
> 
>     a[less], less++;   Time: 6998
>     a[less++];         Time: 8416
> 
>     It is much more than 1%. Is it bug in JVM? Note that under server VM
> 
> The amount of diff surely depends on the benchmark; your bench should 
> "zoom" the problem by not having much other work polluting the 
> measurement. But in my own test with b98 (32-bit), Q6600 CPU, I've got 
> 5065/5079, so the diff is < 1%. The excessive disadvantage you report 
> may be something bad in your older b84.
> 
> Anyway I investigated the JIT-compiled code, details in the end (I've 
> split the benchmark in two classes just to make the comparison simpler). 
> The problem with "a[less++]" is that "less++" will first increment 
> "less", _then_ it will use the old value (not-incremented) to index "a". 
> C1 generates code that is equivalent to:
> 
> int less_incremented = less + 1;
> a[less] = 0;
> less = less_incremented;
> 
> ...which is a 1-to-1 translation of the IR coming off the bytecode. C1 
> is not smart enough to see that the increment can be reordered  after 
> the indexing, maybe because there's a data dependency as the indexing 
> uses "less"; but due to the semantics of postfix "++" this dependency is 
> for the before-increment value, so the reordering would be safe. Maybe 
> that's just some simple missing heuristics that could be easily added?
> 
>     there is no difference between "a[less++]" and "a[less], less++".
> 
> C2 generates completely different code,with 16X unrolling - this is the 
> inner loop:
> 
>   0x026a6e40: pxor   %xmm0,%xmm0        ;*aload_0
>                                         ; - Test1::sort1 at 9 (line 23)
>   0x026a6e44: movq   %xmm0,0xc(%ecx,%esi,4)
>   0x026a6e4a: movq   %xmm0,0x14(%ecx,%esi,4)
>   0x026a6e50: movq   %xmm0,0x1c(%ecx,%esi,4)
>   0x026a6e56: movq   %xmm0,0x24(%ecx,%esi,4)
>   0x026a6e5c: movq   %xmm0,0x2c(%ecx,%esi,4)
>   0x026a6e62: movq   %xmm0,0x34(%ecx,%esi,4)
>   0x026a6e68: movq   %xmm0,0x3c(%ecx,%esi,4)
>   0x026a6e6e: movq   %xmm0,0x44(%ecx,%esi,4)  ;*iastore
>                                         ; - Test1::sort1 at 21 (line 24)
>   0x026a6e74: add    $0x10,%esi         ;*iinc
>                                         ; - Test1::sort1 at 22 (line 22)
>   0x026a6e77: cmp    %ebp,%esi
>   0x026a6e79: jl     0x026a6e44
> 
> There is some extra slow-path code to fill the remaining 1...15 elements 
> if the loop length is not multiple of 16, and that's all. C2 detects the 
> redundancy between the "k" and "less" vars, and kills the also-redundant 
> "a[k] = a[less]" assignment so the net result is a simple zero-fill of 
> the array. Maybe a different benchmark without these redundancies would 
> make easier to see that C2 doesn't have a problem with the postfix "++", 
> but if it had, I think it wouldn't reach the excellent result above.
> 
> A+
> Osvaldo
>  
> 
>     I'm using JDK 7 on Windows XP:
> 
>     java version "1.7.0-ea"
>     Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
>     Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
> 
>     Thanks,
>     Vladimir
> 
> 
> 
> This is the C1 code for sort2()'s loop:
> 
>   0x0251c1dc: cmp    0x8(%ecx),%esi     ; implicit exception: dispatches 
> to 0x0251c21e
>   ;;   30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
>   0x0251c1df: jae    0x0251c24a
>   0x0251c1e5: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>                                         ; - Test2::sort2 at 13 (line 23)
>   0x0251c1e9: cmp    0x8(%ecx),%edi
>   ;;   36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
>   0x0251c1ec: jae    0x0251c263
>   0x0251c1f2: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>                                         ; - Test2::sort2 at 14 (line 23)
>  
> (sort1/sort2 start to differ here)
>  
>   0x0251c1f6: cmp    0x8(%ecx),%esi
>   ;;   42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
>   0x0251c1f9: jae    0x0251c27c
>   0x0251c1ff: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less] = 0;
>                                         ; - Test2::sort2 at 18 (line 24)
>   0x0251c207: inc    %esi               ; ++less;
>   0x0251c208: inc    %edi               ; OopMap{ecx=Oop off=73}
>                                         ;*goto: for k++
>                                         ; - Test2::sort2 at 25 (line 22)
>   0x0251c209: test   %eax,0x1a0100      ;*goto
>                                         ; - Test2::sort2 at 25 (line 22)
>                                         ;   {poll}
>   ;;  block B1 [4, 6]
> 
>   0x0251c20f: cmp    %edx,%edi
>   ;;   22 branch [LT] [B2]
>   0x0251c211: jl     0x0251c1dc         ;*if_icmpge: for k < right
>                                         ; - Test2::sort2 at 6 (line 22)
> 
> The code looks OK; C1 doesn't do much optimization - no unrolling, 
> bounds check elimination - but it's otherwise just about the code you 
> would expect for a simple JITting.
>                    
> Now, C1 code for sort1()'s loop:
> 
>   0x024bc21c: cmp    0x8(%ecx),%esi     ; implicit exception: dispatches 
> to 0x024bc262
>   ;;   30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
>   0x024bc21f: jae    0x024bc28e
>   0x024bc225: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>                                         ; - Test1::sort1 at 13 (line 23)
>   0x024bc229: cmp    0x8(%ecx),%edi
>   ;;   36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
>   0x024bc22c: jae    0x024bc2a7
>   0x024bc232: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>                                         ; - Test1::sort1 at 14 (line 23)
> 
> (sort1/sort2 start to differ here)
> 
>   0x024bc236: mov    %esi,%ebx          ; Crap! C1 temps 'less' into %ebx
>   0x024bc238: inc    %ebx               ; ++less; (for the temp "less 
> from future")
>  
>   0x024bc239: cmp    0x8(%ecx),%esi     ; %esi is still the "old less"....
>   ;;   46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
>   0x024bc23c: jae    0x024bc2c0
>   0x024bc242: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less++] = 0;
>                                         ; - Test1::sort1 at 21 (line 24)
>   0x024bc24a: inc    %edi               ; OopMap{ecx=Oop off=75}
>                                         ;*goto: for k++
>                                         ; - Test1::sort1 at 25 (line 22)
>   0x024bc24b: test   %eax,0x1a0100      ;   {poll}
>   0x024bc251: mov    %ebx,%esi          ;*goto
>                                         ; - Test1::sort1 at 25 (line 22): 
> for...
>   ;;  block B1 [4, 6]
> 
>   0x024bc253: cmp    %edx,%edi
>   ;;   22 branch [LT] [B2]
>   0x024bc255: jl     0x024bc21c         ;*if_icmpge
>                                         ; - Test1::sort1 at 6 (line 22): for...
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Test.java
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100622/99c34452/Test.java>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: DualPivotQuicksort.java
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100622/99c34452/DualPivotQuicksort.java>