New portion of improvements for Dual-Pivot Quicksort
Tom Rodriguez
tom.rodriguez at oracle.com
Tue Jul 6 20:27:22 PDT 2010
I actually looked into this in some detail and the reason for the difference in C1 is two fold. First the state of the JVM at array load is slightly different for a[i++] vs. a[i], i++. In the first case the top state has a copy of the original i but the local for i has i+1. In the second there is only i. This increases the register pressure since we C1 currently keeps all the locals live in the JVM state for the possible exception for the range check. C1 doesn't always need to keep those values live and there is a workspace where where we are more aggressive about killing unneeded locals in the states used for exceptions. Those changes are probably (hopefully) coming to hs19.
Unfortunately there's another reason for this specific example that we force both i and i + 1 to be live simultaneously. C1 doesn't do any cross block scheduling so values which are computed in one block but only used in a later one are evaluated in their original bytecode order relative to other bytecodes which could cause exceptions. This again means we evaluated i + 1 earlier than we absolutely have to since i + 1 is live out of the block into a phi. This is relatively easy to fix by replacing the pin_for_linear_scan logic with an explicit walk of the final ValueStack to evaluate anything which is live out but not yet evaluated. That way values which are live out but unused within the block are evaluated last. This gives you code that's exactly the same for both variants.
So that's the long explanation. My recommendation is to continue to write the code in a natural way.
tom
On Jul 6, 2010, at 6:42 PM, Osvaldo Doederlein wrote:
> The ironic thing is that these operators were originally introduced in the C language, at least according to well-known folklore, as an optimization - taking advantage of inc/dec instructions from the PDP arch, in the old times when (I suppose) compilers wouldn't do even basic strength-reduction optimizations like x + 1 => inc x. But this works well for pre-inc/dec, but post-increment and post-decrement have a more complex semantics that requires inducing an extra temporary if the compiler is not smart enough.
>
> My personal coding style, since my days of C, enforces pre- instead of post- operators; I will always write ++i, never i++, when both choices have the same effect. This is exactly because it's the post- operators that have confusing and error-prone semantics, as soon as the inc/dec operation is used as RHS of a bigger expression (or used a parameter etc.), and not as an isolated statement. I will use post- operators only when I really need its semantics to make my code a little extra terse. And now we discover another gotcha of post- operators: potentially less efficient code generation. It's worth noticing this, because I the post- operators seem to be the preferred style for most people, they can even be found in many Java style guidelines... I wonder why.
>
> Perhaps it's just the name of the C++ language that has taught a generation to like this wrong style? In that case, the blame goes to Stroustrup, he should have named it "++C" instead. That would get extra nerd points too ;-)
>
> A+
> Osvaldo
>
> 2010/7/6 David Holmes <David.Holmes at oracle.com>
> I asked someone to take a look at this and it seems that the problem is that a[less++] requires introduction of a temporary to store the pre-incremented index value which in turn increases register pressure and can lead to a register spill. This seems to not only be architecture dependent but even variable on the same architecture (as observed below).
>
> As Osvaldo noted C1 does not attempt to decide whether the increment can be reordered with the access and so allow removal of the temporary. You'd need the compiler folk (C1/C2) to comment on whether or not it would be reasonable/possible/practical for C1 to do that - so I've cc'ed them (though I'm not a member of their list so it may be a little while before it gets pushed through to them.)
>
> David Holmes
>
> Vladimir Iaroslavski said the following on 06/22/10 23:24:
>
> both computers with Intel processor:
>
> Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
> Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM
>
> Osvaldo Pinali Doederlein wrote:
> Hi Vladimir,
>
> Hello,
>
> I tried with the latest JDK, build 98 and see different behaviour
> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
> works slower with a[less++] instruction on both computers:
>
> Is the first computer (with larger performance diff) an AMD by any chance? It's odd that just the a[less++] would make the code so much slower. Perhaps the worst compiled code has additional costs in some CPUs, e.g. spoiling branch prediction for the bounds checking guards.
>
> Anyway the change is risk-free (including performance) and the advantage is important at least in some CPUs, so go ahead (FWIW wrt my opinion...). I just wish that C1 would be fixed to not need this kind of hacking - as I categorize this as a hack, not even as a "normal" low-level Java code optimization - because there are certainly millions of uses of the idiom "array[index++/--]" in JavaSE APIs and elsewhere. But I'm not familiar with the C1 source so I don't know if this is some low hanging fruit that could be addressed quickly (any HotSpot engineers there to comment?).
>
> A+
> Osvaldo
>
> first second
> a[less] = ak; less++; / (a[less++] = ak;
>
> random: 16371 / 16696 14018 / 14326
> ascendant: 2706 / 2762 2884 / 2897
> descendant: 2994 / 3108 3170 / 3258
> organ pipes: 7073 / 7296 6939 / 7090
> stagger(2): 7765 / 8069 7531 / 7731
> period(1,2): 653 / 743 719 / 753
> random(1..4): 2152 / 2234 1567 / 1591
>
> If I change Test class and populate the array with descendant
> values, I still see difference in times: 4793 / 5718
> see updated attached Test class.
>
> Also I'm attaching the latest version of DualPivotQuicksort
> with minor format changes. If you don't have comments, I'll
> ask to to integrate the code at the end of this week.
>
> Thank you,
> Vladimir
>
> Osvaldo Doederlein wrote:
> Hi Vladimir,
>
> 2010/6/19 Vladimir Iaroslavski <iaroslavski at mail.ru <mailto:iaroslavski at mail.ru>>
>
> Hello Osvaldo,
>
> I've prepared simple test which scans an array and does assignments
> for each element,
> see attached Test class:
>
> a[k] = a[less];
> a[less++] = 0; // or a[less] = 0; less++;
>
> The result of running "java -client Test" is:
>
> a[less], less++; Time: 6998
> a[less++]; Time: 8416
>
> It is much more than 1%. Is it bug in JVM? Note that under server VM
>
> The amount of diff surely depends on the benchmark; your bench should "zoom" the problem by not having much other work polluting the measurement. But in my own test with b98 (32-bit), Q6600 CPU, I've got 5065/5079, so the diff is < 1%. The excessive disadvantage you report may be something bad in your older b84.
>
> Anyway I investigated the JIT-compiled code, details in the end (I've split the benchmark in two classes just to make the comparison simpler). The problem with "a[less++]" is that "less++" will first increment "less", _then_ it will use the old value (not-incremented) to index "a". C1 generates code that is equivalent to:
>
> int less_incremented = less + 1;
> a[less] = 0;
> less = less_incremented;
>
> ...which is a 1-to-1 translation of the IR coming off the bytecode. C1 is not smart enough to see that the increment can be reordered after the indexing, maybe because there's a data dependency as the indexing uses "less"; but due to the semantics of postfix "++" this dependency is for the before-increment value, so the reordering would be safe. Maybe that's just some simple missing heuristics that could be easily added?
>
> there is no difference between "a[less++]" and "a[less], less++".
>
> C2 generates completely different code,with 16X unrolling - this is the inner loop:
>
> 0x026a6e40: pxor %xmm0,%xmm0 ;*aload_0
> ; - Test1::sort1 at 9 (line 23)
> 0x026a6e44: movq %xmm0,0xc(%ecx,%esi,4)
> 0x026a6e4a: movq %xmm0,0x14(%ecx,%esi,4)
> 0x026a6e50: movq %xmm0,0x1c(%ecx,%esi,4)
> 0x026a6e56: movq %xmm0,0x24(%ecx,%esi,4)
> 0x026a6e5c: movq %xmm0,0x2c(%ecx,%esi,4)
> 0x026a6e62: movq %xmm0,0x34(%ecx,%esi,4)
> 0x026a6e68: movq %xmm0,0x3c(%ecx,%esi,4)
> 0x026a6e6e: movq %xmm0,0x44(%ecx,%esi,4) ;*iastore
> ; - Test1::sort1 at 21 (line 24)
> 0x026a6e74: add $0x10,%esi ;*iinc
> ; - Test1::sort1 at 22 (line 22)
> 0x026a6e77: cmp %ebp,%esi
> 0x026a6e79: jl 0x026a6e44
>
> There is some extra slow-path code to fill the remaining 1...15 elements if the loop length is not multiple of 16, and that's all. C2 detects the redundancy between the "k" and "less" vars, and kills the also-redundant "a[k] = a[less]" assignment so the net result is a simple zero-fill of the array. Maybe a different benchmark without these redundancies would make easier to see that C2 doesn't have a problem with the postfix "++", but if it had, I think it wouldn't reach the excellent result above.
>
> A+
> Osvaldo
>
>
> I'm using JDK 7 on Windows XP:
>
> java version "1.7.0-ea"
> Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
> Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>
> Thanks,
> Vladimir
>
>
>
> This is the C1 code for sort2()'s loop:
>
> 0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception: dispatches to 0x0251c21e
> ;; 30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
> 0x0251c1df: jae 0x0251c24a
> 0x0251c1e5: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
> ; - Test2::sort2 at 13 (line 23)
> 0x0251c1e9: cmp 0x8(%ecx),%edi
> ;; 36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
> 0x0251c1ec: jae 0x0251c263
> 0x0251c1f2: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
> ; - Test2::sort2 at 14 (line 23)
>
> (sort1/sort2 start to differ here)
>
> 0x0251c1f6: cmp 0x8(%ecx),%esi
> ;; 42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
> 0x0251c1f9: jae 0x0251c27c
> 0x0251c1ff: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less] = 0;
> ; - Test2::sort2 at 18 (line 24)
> 0x0251c207: inc %esi ; ++less;
> 0x0251c208: inc %edi ; OopMap{ecx=Oop off=73}
> ;*goto: for k++
> ; - Test2::sort2 at 25 (line 22)
> 0x0251c209: test %eax,0x1a0100 ;*goto
> ; - Test2::sort2 at 25 (line 22)
> ; {poll}
> ;; block B1 [4, 6]
>
> 0x0251c20f: cmp %edx,%edi
> ;; 22 branch [LT] [B2]
> 0x0251c211: jl 0x0251c1dc ;*if_icmpge: for k < right
> ; - Test2::sort2 at 6 (line 22)
>
> The code looks OK; C1 doesn't do much optimization - no unrolling, bounds check elimination - but it's otherwise just about the code you would expect for a simple JITting.
> Now, C1 code for sort1()'s loop:
>
> 0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception: dispatches to 0x024bc262
> ;; 30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
> 0x024bc21f: jae 0x024bc28e
> 0x024bc225: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less];
> ; - Test1::sort1 at 13 (line 23)
> 0x024bc229: cmp 0x8(%ecx),%edi
> ;; 36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
> 0x024bc22c: jae 0x024bc2a7
> 0x024bc232: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx;
> ; - Test1::sort1 at 14 (line 23)
>
> (sort1/sort2 start to differ here)
>
> 0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into %ebx
> 0x024bc238: inc %ebx ; ++less; (for the temp "less from future")
>
> 0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "old less"....
> ;; 46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
> 0x024bc23c: jae 0x024bc2c0
> 0x024bc242: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less++] = 0;
> ; - Test1::sort1 at 21 (line 24)
> 0x024bc24a: inc %edi ; OopMap{ecx=Oop off=75}
> ;*goto: for k++
> ; - Test1::sort1 at 25 (line 22)
> 0x024bc24b: test %eax,0x1a0100 ; {poll}
> 0x024bc251: mov %ebx,%esi ;*goto
> ; - Test1::sort1 at 25 (line 22): for...
> ;; block B1 [4, 6]
>
> 0x024bc253: cmp %edx,%edi
> ;; 22 branch [LT] [B2]
> 0x024bc255: jl 0x024bc21c ;*if_icmpge
> ; - Test1::sort1 at 6 (line 22): for...
>
More information about the hotspot-compiler-dev
mailing list