New portion of improvements for Dual-Pivot Quicksort

Tue Jun 22 13:24:08 UTC 2010

both computers with Intel processor:

Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM

Osvaldo Pinali Doederlein wrote:
>  Hi Vladimir,
> 
>> Hello,
>>
>> I tried with the latest JDK, build 98 and see different behaviour
>> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
>> works slower with a[less++] instruction on both computers:
> 
> Is the first computer (with larger performance diff) an AMD by any 
> chance? It's odd that just the a[less++] would make the code so much 
> slower. Perhaps the worst compiled code has additional costs in some 
> CPUs, e.g. spoiling branch prediction for the bounds checking guards.
> 
> Anyway the change is risk-free (including performance) and the advantage 
> is important at least in some CPUs, so go ahead (FWIW wrt my 
> opinion...). I just wish that C1 would be fixed to not need this kind of 
> hacking - as I categorize this as a hack, not even as a "normal" 
> low-level Java code optimization - because there are certainly millions 
> of uses of the idiom "array[index++/--]" in JavaSE APIs and elsewhere. 
> But I'm not familiar with the C1 source so I don't know if this is some 
> low hanging fruit that could be addressed quickly (any HotSpot engineers 
> there to comment?).
> 
> A+
> Osvaldo
> 
>>                  first               second
>>           a[less] = ak; less++; / (a[less++] = ak;
>>
>>       random: 16371 / 16696        14018 / 14326
>>    ascendant:  2706 /  2762         2884 /  2897
>>   descendant:  2994 /  3108         3170 /  3258
>>  organ pipes:  7073 /  7296         6939 /  7090
>>   stagger(2):  7765 /  8069         7531 /  7731
>>  period(1,2):   653 /   743          719 /   753
>> random(1..4):  2152 /  2234         1567 /  1591
>>
>> If I change Test class and populate the array with descendant
>> values, I still see difference in times: 4793 / 5718
>> see updated attached Test class.
>>
>> Also I'm attaching the latest version of DualPivotQuicksort
>> with minor format changes. If you don't have comments, I'll
>> ask to to integrate the code at the end of this week.
>>
>> Thank you,
>> Vladimir
>>
>> Osvaldo Doederlein wrote:
>>> Hi Vladimir,
>>>
>>> 2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru 
>>> <mailto:iaroslavski at mail.ru>>
>>>
>>>     Hello Osvaldo,
>>>
>>>     I've prepared simple test which scans an array and does assignments
>>>     for each element,
>>>     see attached Test class:
>>>
>>>     a[k] = a[less];
>>>     a[less++] = 0; // or a[less] = 0; less++;
>>>
>>>     The result of running "java -client Test" is:
>>>
>>>     a[less], less++;   Time: 6998
>>>     a[less++];         Time: 8416
>>>
>>>     It is much more than 1%. Is it bug in JVM? Note that under server VM
>>>
>>> The amount of diff surely depends on the benchmark; your bench should 
>>> "zoom" the problem by not having much other work polluting the 
>>> measurement. But in my own test with b98 (32-bit), Q6600 CPU, I've 
>>> got 5065/5079, so the diff is < 1%. The excessive disadvantage you 
>>> report may be something bad in your older b84.
>>>
>>> Anyway I investigated the JIT-compiled code, details in the end (I've 
>>> split the benchmark in two classes just to make the comparison 
>>> simpler). The problem with "a[less++]" is that "less++" will first 
>>> increment "less", _then_ it will use the old value (not-incremented) 
>>> to index "a". C1 generates code that is equivalent to:
>>>
>>> int less_incremented = less + 1;
>>> a[less] = 0;
>>> less = less_incremented;
>>>
>>> ...which is a 1-to-1 translation of the IR coming off the bytecode. 
>>> C1 is not smart enough to see that the increment can be reordered  
>>> after the indexing, maybe because there's a data dependency as the 
>>> indexing uses "less"; but due to the semantics of postfix "++" this 
>>> dependency is for the before-increment value, so the reordering would 
>>> be safe. Maybe that's just some simple missing heuristics that could 
>>> be easily added?
>>>
>>>     there is no difference between "a[less++]" and "a[less], less++".
>>>
>>> C2 generates completely different code,with 16X unrolling - this is 
>>> the inner loop:
>>>
>>>   0x026a6e40: pxor   %xmm0,%xmm0        ;*aload_0
>>>                                         ; - Test1::sort1 at 9 (line 23)
>>>   0x026a6e44: movq   %xmm0,0xc(%ecx,%esi,4)
>>>   0x026a6e4a: movq   %xmm0,0x14(%ecx,%esi,4)
>>>   0x026a6e50: movq   %xmm0,0x1c(%ecx,%esi,4)
>>>   0x026a6e56: movq   %xmm0,0x24(%ecx,%esi,4)
>>>   0x026a6e5c: movq   %xmm0,0x2c(%ecx,%esi,4)
>>>   0x026a6e62: movq   %xmm0,0x34(%ecx,%esi,4)
>>>   0x026a6e68: movq   %xmm0,0x3c(%ecx,%esi,4)
>>>   0x026a6e6e: movq   %xmm0,0x44(%ecx,%esi,4)  ;*iastore
>>>                                         ; - Test1::sort1 at 21 (line 24)
>>>   0x026a6e74: add    $0x10,%esi         ;*iinc
>>>                                         ; - Test1::sort1 at 22 (line 22)
>>>   0x026a6e77: cmp    %ebp,%esi
>>>   0x026a6e79: jl     0x026a6e44
>>>
>>> There is some extra slow-path code to fill the remaining 1...15 
>>> elements if the loop length is not multiple of 16, and that's all. C2 
>>> detects the redundancy between the "k" and "less" vars, and kills the 
>>> also-redundant "a[k] = a[less]" assignment so the net result is a 
>>> simple zero-fill of the array. Maybe a different benchmark without 
>>> these redundancies would make easier to see that C2 doesn't have a 
>>> problem with the postfix "++", but if it had, I think it wouldn't 
>>> reach the excellent result above.
>>>
>>> A+
>>> Osvaldo
>>>
>>>
>>>     I'm using JDK 7 on Windows XP:
>>>
>>>     java version "1.7.0-ea"
>>>     Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
>>>     Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>>>
>>>     Thanks,
>>>     Vladimir
>>>
>>>
>>>
>>> This is the C1 code for sort2()'s loop:
>>>
>>>   0x0251c1dc: cmp    0x8(%ecx),%esi     ; implicit exception: 
>>> dispatches to 0x0251c21e
>>>   ;;   30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
>>>   0x0251c1df: jae    0x0251c24a
>>>   0x0251c1e5: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>>>                                         ; - Test2::sort2 at 13 (line 23)
>>>   0x0251c1e9: cmp    0x8(%ecx),%edi
>>>   ;;   36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
>>>   0x0251c1ec: jae    0x0251c263
>>>   0x0251c1f2: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>>>                                         ; - Test2::sort2 at 14 (line 23)
>>>
>>> (sort1/sort2 start to differ here)
>>>
>>>   0x0251c1f6: cmp    0x8(%ecx),%esi
>>>   ;;   42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
>>>   0x0251c1f9: jae    0x0251c27c
>>>   0x0251c1ff: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less] = 0;
>>>                                         ; - Test2::sort2 at 18 (line 24)
>>>   0x0251c207: inc    %esi               ; ++less;
>>>   0x0251c208: inc    %edi               ; OopMap{ecx=Oop off=73}
>>>                                         ;*goto: for k++
>>>                                         ; - Test2::sort2 at 25 (line 22)
>>>   0x0251c209: test   %eax,0x1a0100      ;*goto
>>>                                         ; - Test2::sort2 at 25 (line 22)
>>>                                         ;   {poll}
>>>   ;;  block B1 [4, 6]
>>>
>>>   0x0251c20f: cmp    %edx,%edi
>>>   ;;   22 branch [LT] [B2]
>>>   0x0251c211: jl     0x0251c1dc         ;*if_icmpge: for k < right
>>>                                         ; - Test2::sort2 at 6 (line 22)
>>>
>>> The code looks OK; C1 doesn't do much optimization - no unrolling, 
>>> bounds check elimination - but it's otherwise just about the code you 
>>> would expect for a simple JITting.
>>>                    Now, C1 code for sort1()'s loop:
>>>
>>>   0x024bc21c: cmp    0x8(%ecx),%esi     ; implicit exception: 
>>> dispatches to 0x024bc262
>>>   ;;   30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
>>>   0x024bc21f: jae    0x024bc28e
>>>   0x024bc225: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>>>                                         ; - Test1::sort1 at 13 (line 23)
>>>   0x024bc229: cmp    0x8(%ecx),%edi
>>>   ;;   36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
>>>   0x024bc22c: jae    0x024bc2a7
>>>   0x024bc232: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>>>                                         ; - Test1::sort1 at 14 (line 23)
>>>
>>> (sort1/sort2 start to differ here)
>>>
>>>   0x024bc236: mov    %esi,%ebx          ; Crap! C1 temps 'less' into 
>>> %ebx
>>>   0x024bc238: inc    %ebx               ; ++less; (for the temp "less 
>>> from future")
>>>
>>>   0x024bc239: cmp    0x8(%ecx),%esi     ; %esi is still the "old 
>>> less"....
>>>   ;;   46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
>>>   0x024bc23c: jae    0x024bc2c0
>>>   0x024bc242: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less++] = 0;
>>>                                         ; - Test1::sort1 at 21 (line 24)
>>>   0x024bc24a: inc    %edi               ; OopMap{ecx=Oop off=75}
>>>                                         ;*goto: for k++
>>>                                         ; - Test1::sort1 at 25 (line 22)
>>>   0x024bc24b: test   %eax,0x1a0100      ;   {poll}
>>>   0x024bc251: mov    %ebx,%esi          ;*goto
>>>                                         ; - Test1::sort1 at 25 (line 
>>> 22): for...
>>>   ;;  block B1 [4, 6]
>>>
>>>   0x024bc253: cmp    %edx,%edi
>>>   ;;   22 branch [LT] [B2]
>>>   0x024bc255: jl     0x024bc21c         ;*if_icmpge
>>>                                         ; - Test1::sort1 at 6 (line 22): 
>>> for...