New portion of improvements for Dual-Pivot Quicksort
Vladimir Iaroslavski
iaroslavski at mail.ru
Tue Jun 22 13:24:08 UTC 2010
both computers with Intel processor:
Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM
Osvaldo Pinali Doederlein wrote:
> Hi Vladimir,
>
>> Hello,
>>
>> I tried with the latest JDK, build 98 and see different behaviour
>> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
>> works slower with a[less++] instruction on both computers:
>
> Is the first computer (with larger performance diff) an AMD by any
> chance? It's odd that just the a[less++] would make the code so much
> slower. Perhaps the worst compiled code has additional costs in some
> CPUs, e.g. spoiling branch prediction for the bounds checking guards.
>
> Anyway the change is risk-free (including performance) and the advantage
> is important at least in some CPUs, so go ahead (FWIW wrt my
> opinion...). I just wish that C1 would be fixed to not need this kind of
> hacking - as I categorize this as a hack, not even as a "normal"
> low-level Java code optimization - because there are certainly millions
> of uses of the idiom "array[index++/--]" in JavaSE APIs and elsewhere.
> But I'm not familiar with the C1 source so I don't know if this is some
> low hanging fruit that could be addressed quickly (any HotSpot engineers
> there to comment?).
>
> A+
> Osvaldo
>
>> first second
>> a[less] = ak; less++; / (a[less++] = ak;
>>
>> random: 16371 / 16696 14018 / 14326
>> ascendant: 2706 / 2762 2884 / 2897
>> descendant: 2994 / 3108 3170 / 3258
>> organ pipes: 7073 / 7296 6939 / 7090
>> stagger(2): 7765 / 8069 7531 / 7731
>> period(1,2): 653 / 743 719 / 753
>> random(1..4): 2152 / 2234 1567 / 1591
>>
>> If I change Test class and populate the array with descendant
>> values, I still see difference in times: 4793 / 5718
>> see updated attached Test class.
>>
>> Also I'm attaching the latest version of DualPivotQuicksort
>> with minor format changes. If you don't have comments, I'll
>> ask to to integrate the code at the end of this week.
>>
>> Thank you,
>> Vladimir
>>
>> Osvaldo Doederlein wrote:
>>> Hi Vladimir,
>>>
>>> 2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru
>>> <mailto:iaroslavski at mail.ru>>
>>>
>>> Hello Osvaldo,
>>>
>>> I've prepared simple test which scans an array and does assignments
>>> for each element,
>>> see attached Test class:
>>>
>>> a[k] = a[less];
>>> a[less++] = 0; // or a[less] = 0; less++;
>>>
>>> The result of running "java -client Test" is:
>>>
>>> a[less], less++; Time: 6998
>>> a[less++]; Time: 8416
>>>
>>> It is much more than 1%. Is it bug in JVM? Note that under server VM
>>>
>>> The amount of diff surely depends on the benchmark; your bench should
>>> "zoom" the problem by not having much other work polluting the
>>> measurement. But in my own test with b98 (32-bit), Q6600 CPU, I've
>>> got 5065/5079, so the diff is < 1%. The excessive disadvantage you
>>> report may be something bad in your older b84.
>>>
>>> Anyway I investigated the JIT-compiled code, details in the end (I've
>>> split the benchmark in two classes just to make the comparison
>>> simpler). The problem with "a[less++]" is that "less++" will first
>>> increment "less", _then_ it will use the old value (not-incremented)
>>> to index "a". C1 generates code that is equivalent to:
>>>
>>> int less_incremented = less + 1;
>>> a[less] = 0;
>>> less = less_incremented;
>>>
>>> ...which is a 1-to-1 translation of the IR coming off the bytecode.
>>> C1 is not smart enough to see that the increment can be reordered
>>> after the indexing, maybe because there's a data dependency as the
>>> indexing uses "less"; but due to the semantics of postfix "++" this
>>> dependency is for the before-increment value, so the reordering would
>>> be safe. Maybe that's just some simple missing heuristics that could
>>> be easily added?
>>>
>>> there is no difference between "a[less++]" and "a[less], less++".
>>>
>>> C2 generates completely different code,with 16X unrolling - this is
>>> the inner loop:
>>>
>>> 0x026a6e40: pxor %xmm0,%xmm0 ;*aload_0
>>> ; - Test1::sort1 at 9 (line 23)
>>> 0x026a6e44: movq %xmm0,0xc(%ecx,%esi,4)
>>> 0x026a6e4a: movq %xmm0,0x14(%ecx,%esi,4)
>>> 0x026a6e50: movq %xmm0,0x1c(%ecx,%esi,4)
>>> 0x026a6e56: movq %xmm0,0x24(%ecx,%esi,4)
>>> 0x026a6e5c: movq %xmm0,0x2c(%ecx,%esi,4)
>>> 0x026a6e62: movq %xmm0,0x34(%ecx,%esi,4)
>>> 0x026a6e68: movq %xmm0,0x3c(%ecx,%esi,4)
>>> 0x026a6e6e: movq %xmm0,0x44(%ecx,%esi,4) ;*iastore
>>> ; - Test1::sort1 at 21 (line 24)
>>> 0x026a6e74: add $0x10,%esi ;*iinc
>>> ; - Test1::sort1 at 22 (line 22)
>>> 0x026a6e77: cmp %ebp,%esi
>>> 0x026a6e79: jl 0x026a6e44
>>>
>>> There is some extra slow-path code to fill the remaining 1...15
>>> elements if the loop length is not multiple of 16, and that's all. C2
>>> detects the redundancy between the "k" and "less" vars, and kills the
>>> also-redundant "a[k] = a[less]" assignment so the net result is a
>>> simple zero-fill of the array. Maybe a different benchmark without
>>> these redundancies would make easier to see that C2 doesn't have a
>>> problem with the postfix "++", but if it had, I think it wouldn't
>>> reach the excellent result above.
>>>
>>> A+
>>> Osvaldo
>>>
>>>
>>> I'm using JDK 7 on Windows XP:
>>>
>>> java version "1.7.0-ea"
>>> Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
>>> Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>>
>>> This is the C1 code for sort2()'s loop:
>>>
>>> 0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception:
>>> dispatches to 0x0251c21e
>>> ;; 30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
>>> 0x0251c1df: jae 0x0251c24a
>>> 0x0251c1e5: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
>>> ; - Test2::sort2 at 13 (line 23)
>>> 0x0251c1e9: cmp 0x8(%ecx),%edi
>>> ;; 36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
>>> 0x0251c1ec: jae 0x0251c263
>>> 0x0251c1f2: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
>>> ; - Test2::sort2 at 14 (line 23)
>>>
>>> (sort1/sort2 start to differ here)
>>>
>>> 0x0251c1f6: cmp 0x8(%ecx),%esi
>>> ;; 42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
>>> 0x0251c1f9: jae 0x0251c27c
>>> 0x0251c1ff: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less] = 0;
>>> ; - Test2::sort2 at 18 (line 24)
>>> 0x0251c207: inc %esi ; ++less;
>>> 0x0251c208: inc %edi ; OopMap{ecx=Oop off=73}
>>> ;*goto: for k++
>>> ; - Test2::sort2 at 25 (line 22)
>>> 0x0251c209: test %eax,0x1a0100 ;*goto
>>> ; - Test2::sort2 at 25 (line 22)
>>> ; {poll}
>>> ;; block B1 [4, 6]
>>>
>>> 0x0251c20f: cmp %edx,%edi
>>> ;; 22 branch [LT] [B2]
>>> 0x0251c211: jl 0x0251c1dc ;*if_icmpge: for k < right
>>> ; - Test2::sort2 at 6 (line 22)
>>>
>>> The code looks OK; C1 doesn't do much optimization - no unrolling,
>>> bounds check elimination - but it's otherwise just about the code you
>>> would expect for a simple JITting.
>>> Now, C1 code for sort1()'s loop:
>>>
>>> 0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception:
>>> dispatches to 0x024bc262
>>> ;; 30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
>>> 0x024bc21f: jae 0x024bc28e
>>> 0x024bc225: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
>>> ; - Test1::sort1 at 13 (line 23)
>>> 0x024bc229: cmp 0x8(%ecx),%edi
>>> ;; 36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
>>> 0x024bc22c: jae 0x024bc2a7
>>> 0x024bc232: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
>>> ; - Test1::sort1 at 14 (line 23)
>>>
>>> (sort1/sort2 start to differ here)
>>>
>>> 0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into
>>> %ebx
>>> 0x024bc238: inc %ebx ; ++less; (for the temp "less
>>> from future")
>>>
>>> 0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "old
>>> less"....
>>> ;; 46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
>>> 0x024bc23c: jae 0x024bc2c0
>>> 0x024bc242: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less++] = 0;
>>> ; - Test1::sort1 at 21 (line 24)
>>> 0x024bc24a: inc %edi ; OopMap{ecx=Oop off=75}
>>> ;*goto: for k++
>>> ; - Test1::sort1 at 25 (line 22)
>>> 0x024bc24b: test %eax,0x1a0100 ; {poll}
>>> 0x024bc251: mov %ebx,%esi ;*goto
>>> ; - Test1::sort1 at 25 (line
>>> 22): for...
>>> ;; block B1 [4, 6]
>>>
>>> 0x024bc253: cmp %edx,%edi
>>> ;; 22 branch [LT] [B2]
>>> 0x024bc255: jl 0x024bc21c ;*if_icmpge
>>> ; - Test1::sort1 at 6 (line 22):
>>> for...
More information about the core-libs-dev
mailing list