Question on sorting

Wed Aug 4 13:43:23 UTC 2010

Hello Dmytro,

Thank you for investigation, the results are interesting.

I prepared simpler test [1], which has no recursion, it just
sorts 5 candidates and compare all elements with first (or second)
candidate. I run on array of 2000000 elements and result is:

e1/e2 267/460 on random data
e1/e2 193/200 on sorted array

and interesting fact: if I change comparison from "a[i] < pivot"
to "a[i] == pivot", it shows 189 for e1 and e2, for random and
sorted data.

I tried new schema [1/2, 1/4, 1/4], but it shows almost the same
time as [1/3, 1/3, 1/3], no improvements.

Thank you,
Vladimir

[1] ---------------------------------------------------------------
public static void sort(int[] a) {
     int step = a.length >>> 3;
     int e1 =      step;
     int e2 = e1 + step;
     int e3 = e2 + step;
     int e4 = e3 + step;
     int e5 = e4 + step;

     if (a[e1] > a[e2]) { int t = a[e1]; a[e1] = a[e2]; a[e2] = t; }
     if (a[e4] > a[e5]) { int t = a[e4]; a[e4] = a[e5]; a[e5] = t; }
     if (a[e1] > a[e3]) { int t = a[e1]; a[e1] = a[e3]; a[e3] = t; }
     if (a[e2] > a[e3]) { int t = a[e2]; a[e2] = a[e3]; a[e3] = t; }
     if (a[e1] > a[e4]) { int t = a[e1]; a[e1] = a[e4]; a[e4] = t; }
     if (a[e3] > a[e4]) { int t = a[e3]; a[e3] = a[e4]; a[e4] = t; }
     if (a[e2] > a[e5]) { int t = a[e2]; a[e2] = a[e5]; a[e5] = t; }
     if (a[e2] > a[e3]) { int t = a[e2]; a[e2] = a[e3]; a[e3] = t; }
     if (a[e4] > a[e5]) { int t = a[e4]; a[e4] = a[e5]; a[e5] = t; }

     int pivot = a[e1]; // or a[e2]

     for (int i = 0; i < a.length; i++) {
         if (a[i] < pivot);
     }
}
[/1] ---------------------------------------------------------------

Dmytro Sheyko wrote:
> Hi Vladimir,
> 
> There could be many reasons for this.
> The verisimilar ones are imprecise time measurement with highly 
> dispersed results and biased samples.
> 
> Another reason is that an attempt to divide whole array into equal-size 
> partition not always gives us the best result. And hence choosing
> "wrong" pivots could make partitions balanced slightly better.
> 
> Let me clarify this counter-intuitive statement regarding not-equal 
> partitioning.
> 
> Consider following quite straightforward dual pivot quicksort.
> 
> sort(a[]) {
>     pivot1, pivot2 = choosePivots(a);
> 
>     // partitioning
>     forall (a[k] in a) {
>         if (a[k] < pivot1) {
>             a[k] goes to the left partition
>         } else if (a[k] > pivot2) {
>             a[k] goes to the right partition
>         } else {
>             a[k] goes to the middle partition
>         }
>     }
> 
>     sort(left partition);
>     sort(middle partition);
>     sort(right partition);
> }
> 
> Here you can see that during partitioning every item is compared with 
> one or two pivots. In our case item is compared with
> the second pivot only if it greater than the first one. So what is the 
> average number of comparison during partitioning?
> If we succeed to choose pivots so that they divide whole array into 3 
> equal partitions we have 1*(1/3) + 2*(1/3) + 2*(1/3) = 5/3 per item.
> Is this ideal? What if pivots divide array in following proportions 1/2 
> 1/4 1/4? Then we have 1*(1/2) + 2*(1/4) + 2*(1/4) = 3/2.
> 3/2 is less than 5/3.
> 
> Let's find now ideal proportions of partitions taking into account 
> number of comparison of sorting in whole.
> 
> Assume that number of comparison is A*n*ln(n) + B*n + o(n) and we divide 
> whole array in following proportions
> [alpha, (1 - alpha)/2, (1 - alpha)/2], where 0 < alpha < 1.
> 
> A*n*ln(n) + B*n =
>  = n * (alpha + 2*(1 - alpha)) { partitioning }
>   + A * alpha*n * ln(alpha*n) + B * alpha*n { sorting left partition }
>   + 2 * A * (1-alpha)*n/2 * ln((1-alpha)*n/2) + 2 * B * (1-alpha)*n/2 { 
> sorting middle and right partitions }
> 
> A*n*ln(n) + B*n =
>  = A*alpha*n*ln(n) + A*(1-alpha)*n*ln(n) +
>  + B*alpha*n + B*(1-alpha)*n +
>   + n * (alpha + 2*(1-alpha))
>   + A*alpha*n*ln(alpha) + A*(1-alpha)*n*ln((1-alpha)/2)
> 
> 0 = (alpha + 2*(1-alpha)) + A*alpha*ln(alpha) + A*(1-alpha)*ln((1-alpha)/2)
> 
> A = (alpha - 2) / (alpha*ln(alpha) + (1-alpha)*ln((1-alpha)/2))
> 
> alpha    A
>  1/12    2.078316236
>  2/12    1.783079278
>  3/12    1.617083005
>  4/12    1.517065378
>  5/12    1.461274369
>  6/12    1.442695041    !!!
>  7/12    1.463491681
>  8/12    1.536871653
>  9/12    1.699242413
> 10/12    2.060936332
> 11/12    3.143757518
> 
> It appears that the best alpha is about 1/2. Thus ideal partition is 
> something like [1/2, 1/4, 1/4].
> 
> Of course, these consideration does not apply to the real DPQ 
> completely. This is because in real DPQ every item is not compared with 
> pivots in well defined order and
> real DPQ contains numerous special cases, which make it harder to analyze.
> 
> Regards,
> Dmytro Sheyko
> 
>  > From: iaroslavski at mail.ru
>  > To: core-libs-dev at openjdk.java.net
>  > Subject: Question on sorting
>  > Date: Fri, 30 Jul 2010 22:55:00 +0400
>  > CC: iaroslavski at mail.ru
>  >
>  > Hello,
>  >
>  > I have performance question in sorting.
>  >
>  > In an implementation of Dual-Pivot Quicksort 5 elements
>  >
>  > a[e1], a[e2], a[e3], a[e4], a[e5]
>  >
>  > where e1 = len/6, e2 = len/3, e3 = len/2, e4 = len*2/3, e5 = len*5/6,
>  > are sorted and then elements a[e2], a[e4] are taken as pivots.
>  >
>  > but if I take a[e1] and a[e3] as pivots, it works 2-3% faster on
>  > *random* inputs.
>  >
>  > I tried different sorting for these 5 elements: network, bubble,
>  > insertion - with a[e1] and a[e3] pivots it is faster than with
>  > a[e2] and a[e4].
>  >
>  > If I sort these 5 elements in descending order, it works faster
>  > with a[e5] and a[e3] pivots.
>  >
>  > Do you have any idea why it happens?
>  >
>  > Thank you,
>  > Vladimir