New portion of improvements for Dual-Pivot Quicksort
Osvaldo Doederlein
opinali at gmail.com
Tue Jul 6 18:42:14 PDT 2010
The ironic thing is that these operators were originally introduced in the C
language, at least according to well-known folklore, as an optimization -
taking advantage of inc/dec instructions from the PDP arch, in the old times
when (I suppose) compilers wouldn't do even basic strength-reduction
optimizations like x + 1 => inc x. But this works well for pre-inc/dec, but
post-increment and post-decrement have a more complex semantics that
requires inducing an extra temporary if the compiler is not smart enough.
My personal coding style, since my days of C, enforces pre- instead of post-
operators; I will always write ++i, never i++, when both choices have the
same effect. This is exactly because it's the post- operators that have
confusing and error-prone semantics, as soon as the inc/dec operation is
used as RHS of a bigger expression (or used a parameter etc.), and not as an
isolated statement. I will use post- operators only when I really need its
semantics to make my code a little extra terse. And now we discover another
gotcha of post- operators: potentially less efficient code generation. It's
worth noticing this, because I the post- operators seem to be the preferred
style for most people, they can even be found in many Java style
guidelines... I wonder why.
Perhaps it's just the name of the C++ language that has taught a generation
to like this wrong style? In that case, the blame goes to Stroustrup, he
should have named it "++C" instead. That would get extra nerd points too ;-)
A+
Osvaldo
2010/7/6 David Holmes <David.Holmes at oracle.com>
> I asked someone to take a look at this and it seems that the problem is
> that a[less++] requires introduction of a temporary to store the
> pre-incremented index value which in turn increases register pressure and
> can lead to a register spill. This seems to not only be architecture
> dependent but even variable on the same architecture (as observed below).
>
> As Osvaldo noted C1 does not attempt to decide whether the increment can be
> reordered with the access and so allow removal of the temporary. You'd need
> the compiler folk (C1/C2) to comment on whether or not it would be
> reasonable/possible/practical for C1 to do that - so I've cc'ed them (though
> I'm not a member of their list so it may be a little while before it gets
> pushed through to them.)
>
> David Holmes
>
> Vladimir Iaroslavski said the following on 06/22/10 23:24:
>
> both computers with Intel processor:
>>
>> Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
>> Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM
>>
>> Osvaldo Pinali Doederlein wrote:
>>
>>> Hi Vladimir,
>>>
>>> Hello,
>>>>
>>>> I tried with the latest JDK, build 98 and see different behaviour
>>>> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
>>>> works slower with a[less++] instruction on both computers:
>>>>
>>>
>>> Is the first computer (with larger performance diff) an AMD by any
>>> chance? It's odd that just the a[less++] would make the code so much slower.
>>> Perhaps the worst compiled code has additional costs in some CPUs, e.g.
>>> spoiling branch prediction for the bounds checking guards.
>>>
>>> Anyway the change is risk-free (including performance) and the advantage
>>> is important at least in some CPUs, so go ahead (FWIW wrt my opinion...). I
>>> just wish that C1 would be fixed to not need this kind of hacking - as I
>>> categorize this as a hack, not even as a "normal" low-level Java code
>>> optimization - because there are certainly millions of uses of the idiom
>>> "array[index++/--]" in JavaSE APIs and elsewhere. But I'm not familiar with
>>> the C1 source so I don't know if this is some low hanging fruit that could
>>> be addressed quickly (any HotSpot engineers there to comment?).
>>>
>>> A+
>>> Osvaldo
>>>
>>> first second
>>>> a[less] = ak; less++; / (a[less++] = ak;
>>>>
>>>> random: 16371 / 16696 14018 / 14326
>>>> ascendant: 2706 / 2762 2884 / 2897
>>>> descendant: 2994 / 3108 3170 / 3258
>>>> organ pipes: 7073 / 7296 6939 / 7090
>>>> stagger(2): 7765 / 8069 7531 / 7731
>>>> period(1,2): 653 / 743 719 / 753
>>>> random(1..4): 2152 / 2234 1567 / 1591
>>>>
>>>> If I change Test class and populate the array with descendant
>>>> values, I still see difference in times: 4793 / 5718
>>>> see updated attached Test class.
>>>>
>>>> Also I'm attaching the latest version of DualPivotQuicksort
>>>> with minor format changes. If you don't have comments, I'll
>>>> ask to to integrate the code at the end of this week.
>>>>
>>>> Thank you,
>>>> Vladimir
>>>>
>>>> Osvaldo Doederlein wrote:
>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> 2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru <mailto:
>>>>> iaroslavski at mail.ru>>
>>>>>
>>>>> Hello Osvaldo,
>>>>>
>>>>> I've prepared simple test which scans an array and does assignments
>>>>> for each element,
>>>>> see attached Test class:
>>>>>
>>>>> a[k] = a[less];
>>>>> a[less++] = 0; // or a[less] = 0; less++;
>>>>>
>>>>> The result of running "java -client Test" is:
>>>>>
>>>>> a[less], less++; Time: 6998
>>>>> a[less++]; Time: 8416
>>>>>
>>>>> It is much more than 1%. Is it bug in JVM? Note that under server VM
>>>>>
>>>>> The amount of diff surely depends on the benchmark; your bench should
>>>>> "zoom" the problem by not having much other work polluting the measurement.
>>>>> But in my own test with b98 (32-bit), Q6600 CPU, I've got 5065/5079, so the
>>>>> diff is < 1%. The excessive disadvantage you report may be something bad in
>>>>> your older b84.
>>>>>
>>>>> Anyway I investigated the JIT-compiled code, details in the end (I've
>>>>> split the benchmark in two classes just to make the comparison simpler). The
>>>>> problem with "a[less++]" is that "less++" will first increment "less",
>>>>> _then_ it will use the old value (not-incremented) to index "a". C1
>>>>> generates code that is equivalent to:
>>>>>
>>>>> int less_incremented = less + 1;
>>>>> a[less] = 0;
>>>>> less = less_incremented;
>>>>>
>>>>> ...which is a 1-to-1 translation of the IR coming off the bytecode. C1
>>>>> is not smart enough to see that the increment can be reordered after the
>>>>> indexing, maybe because there's a data dependency as the indexing uses
>>>>> "less"; but due to the semantics of postfix "++" this dependency is for the
>>>>> before-increment value, so the reordering would be safe. Maybe that's just
>>>>> some simple missing heuristics that could be easily added?
>>>>>
>>>>> there is no difference between "a[less++]" and "a[less], less++".
>>>>>
>>>>> C2 generates completely different code,with 16X unrolling - this is the
>>>>> inner loop:
>>>>>
>>>>> 0x026a6e40: pxor %xmm0,%xmm0 ;*aload_0
>>>>> ; - Test1::sort1 at 9 (line 23)
>>>>> 0x026a6e44: movq %xmm0,0xc(%ecx,%esi,4)
>>>>> 0x026a6e4a: movq %xmm0,0x14(%ecx,%esi,4)
>>>>> 0x026a6e50: movq %xmm0,0x1c(%ecx,%esi,4)
>>>>> 0x026a6e56: movq %xmm0,0x24(%ecx,%esi,4)
>>>>> 0x026a6e5c: movq %xmm0,0x2c(%ecx,%esi,4)
>>>>> 0x026a6e62: movq %xmm0,0x34(%ecx,%esi,4)
>>>>> 0x026a6e68: movq %xmm0,0x3c(%ecx,%esi,4)
>>>>> 0x026a6e6e: movq %xmm0,0x44(%ecx,%esi,4) ;*iastore
>>>>> ; - Test1::sort1 at 21 (line 24)
>>>>> 0x026a6e74: add $0x10,%esi ;*iinc
>>>>> ; - Test1::sort1 at 22 (line 22)
>>>>> 0x026a6e77: cmp %ebp,%esi
>>>>> 0x026a6e79: jl 0x026a6e44
>>>>>
>>>>> There is some extra slow-path code to fill the remaining 1...15
>>>>> elements if the loop length is not multiple of 16, and that's all. C2
>>>>> detects the redundancy between the "k" and "less" vars, and kills the
>>>>> also-redundant "a[k] = a[less]" assignment so the net result is a simple
>>>>> zero-fill of the array. Maybe a different benchmark without these
>>>>> redundancies would make easier to see that C2 doesn't have a problem with
>>>>> the postfix "++", but if it had, I think it wouldn't reach the excellent
>>>>> result above.
>>>>>
>>>>> A+
>>>>> Osvaldo
>>>>>
>>>>>
>>>>> I'm using JDK 7 on Windows XP:
>>>>>
>>>>> java version "1.7.0-ea"
>>>>> Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
>>>>> Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>
>>>>>
>>>>> This is the C1 code for sort2()'s loop:
>>>>>
>>>>> 0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception: dispatches
>>>>> to 0x0251c21e
>>>>> ;; 30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
>>>>> 0x0251c1df: jae 0x0251c24a
>>>>> 0x0251c1e5: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
>>>>> ; - Test2::sort2 at 13 (line 23)
>>>>> 0x0251c1e9: cmp 0x8(%ecx),%edi
>>>>> ;; 36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
>>>>> 0x0251c1ec: jae 0x0251c263
>>>>> 0x0251c1f2: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
>>>>> ; - Test2::sort2 at 14 (line 23)
>>>>>
>>>>> (sort1/sort2 start to differ here)
>>>>>
>>>>> 0x0251c1f6: cmp 0x8(%ecx),%esi
>>>>> ;; 42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
>>>>> 0x0251c1f9: jae 0x0251c27c
>>>>> 0x0251c1ff: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less] = 0;
>>>>> ; - Test2::sort2 at 18 (line 24)
>>>>> 0x0251c207: inc %esi ; ++less;
>>>>> 0x0251c208: inc %edi ; OopMap{ecx=Oop off=73}
>>>>> ;*goto: for k++
>>>>> ; - Test2::sort2 at 25 (line 22)
>>>>> 0x0251c209: test %eax,0x1a0100 ;*goto
>>>>> ; - Test2::sort2 at 25 (line 22)
>>>>> ; {poll}
>>>>> ;; block B1 [4, 6]
>>>>>
>>>>> 0x0251c20f: cmp %edx,%edi
>>>>> ;; 22 branch [LT] [B2]
>>>>> 0x0251c211: jl 0x0251c1dc ;*if_icmpge: for k < right
>>>>> ; - Test2::sort2 at 6 (line 22)
>>>>>
>>>>> The code looks OK; C1 doesn't do much optimization - no unrolling,
>>>>> bounds check elimination - but it's otherwise just about the code you would
>>>>> expect for a simple JITting.
>>>>> Now, C1 code for sort1()'s loop:
>>>>>
>>>>> 0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception: dispatches
>>>>> to 0x024bc262
>>>>> ;; 30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
>>>>> 0x024bc21f: jae 0x024bc28e
>>>>> 0x024bc225: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
>>>>> ; - Test1::sort1 at 13 (line 23)
>>>>> 0x024bc229: cmp 0x8(%ecx),%edi
>>>>> ;; 36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
>>>>> 0x024bc22c: jae 0x024bc2a7
>>>>> 0x024bc232: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
>>>>> ; - Test1::sort1 at 14 (line 23)
>>>>>
>>>>> (sort1/sort2 start to differ here)
>>>>>
>>>>> 0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into
>>>>> %ebx
>>>>> 0x024bc238: inc %ebx ; ++less; (for the temp "less
>>>>> from future")
>>>>>
>>>>> 0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "old
>>>>> less"....
>>>>> ;; 46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
>>>>> 0x024bc23c: jae 0x024bc2c0
>>>>> 0x024bc242: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less++] = 0;
>>>>> ; - Test1::sort1 at 21 (line 24)
>>>>> 0x024bc24a: inc %edi ; OopMap{ecx=Oop off=75}
>>>>> ;*goto: for k++
>>>>> ; - Test1::sort1 at 25 (line 22)
>>>>> 0x024bc24b: test %eax,0x1a0100 ; {poll}
>>>>> 0x024bc251: mov %ebx,%esi ;*goto
>>>>> ; - Test1::sort1 at 25 (line 22):
>>>>> for...
>>>>> ;; block B1 [4, 6]
>>>>>
>>>>> 0x024bc253: cmp %edx,%edi
>>>>> ;; 22 branch [LT] [B2]
>>>>> 0x024bc255: jl 0x024bc21c ;*if_icmpge
>>>>> ; - Test1::sort1 at 6 (line 22):
>>>>> for...
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20100706/975f9c80/attachment.html
More information about the hotspot-compiler-dev
mailing list