Question on sorting

Tue Aug 3 15:25:49 UTC 2010

Hi Vladimir,

There could be many reasons for this.
The verisimilar ones are imprecise time measurement with highly dispersed results and biased samples.

Another reason is that an attempt to divide whole array into equal-size partition not always gives us the best result.
And hence choosing "wrong" pivots could make partitions balanced slightly better.

Let me clarify this counter-intuitive statement regarding not-equal partitioning.

Consider following quite straightforward dual pivot quicksort.

sort(a[]) {
    pivot1, pivot2 = choosePivots(a);

    // partitioning
    forall (a[k] in a) {
        if (a[k] < pivot1) {
            a[k] goes to the left partition
        } else if (a[k] > pivot2) {
            a[k] goes to the right partition
        } else {
            a[k] goes to the middle partition
        }
    }

    sort(left partition);
    sort(middle partition);
    sort(right partition);
}

Here you can see that during partitioning every item is compared with one or two pivots. In our case item is compared with
the second pivot only if it greater than the first one. So what is the average number of comparison during partitioning?
If we succeed to choose pivots so that they divide whole array into 3 equal partitions we have 1*(1/3) + 2*(1/3) + 2*(1/3) = 5/3 per item.
Is this ideal? What if pivots divide array in following proportions 1/2 1/4 1/4? Then we have 1*(1/2) + 2*(1/4) + 2*(1/4) = 3/2.
3/2 is less than 5/3.

Let's find now ideal proportions of partitions taking into account number of comparison of sorting in whole.

Assume that number of comparison is A*n*ln(n) + B*n + o(n) and we divide whole array in following proportions
[alpha, (1 - alpha)/2, (1 - alpha)/2], where 0 < alpha < 1.

A*n*ln(n) + B*n =
 = n * (alpha + 2*(1 - alpha)) { partitioning }
  + A * alpha*n * ln(alpha*n) + B * alpha*n { sorting left partition }
  + 2 * A * (1-alpha)*n/2 * ln((1-alpha)*n/2) + 2 * B * (1-alpha)*n/2 { sorting middle and right partitions }

A*n*ln(n) + B*n =
 = A*alpha*n*ln(n) + A*(1-alpha)*n*ln(n) + 
 + B*alpha*n + B*(1-alpha)*n +
  + n * (alpha + 2*(1-alpha))
  + A*alpha*n*ln(alpha) + A*(1-alpha)*n*ln((1-alpha)/2)

0 = (alpha + 2*(1-alpha)) + A*alpha*ln(alpha) + A*(1-alpha)*ln((1-alpha)/2)

A = (alpha - 2) / (alpha*ln(alpha) + (1-alpha)*ln((1-alpha)/2))

alpha    A
1/12    2.078316236
2/12    1.783079278
3/12    1.617083005
4/12    1.517065378
5/12    1.461274369
6/12    1.442695041    !!!
7/12    1.463491681
8/12    1.536871653
9/12    1.699242413
10/12    2.060936332
11/12    3.143757518

It appears that the best alpha is about 1/2. Thus ideal partition is something like [1/2, 1/4, 1/4].

Of course, these consideration does not apply to the real DPQ completely. This is because in real DPQ every item is not compared with pivots in well defined order and
real DPQ contains numerous special cases, which make it harder to analyze.

Regards,
Dmytro Sheyko

> From: iaroslavski at mail.ru
> To: core-libs-dev at openjdk.java.net
> Subject: Question on sorting
> Date: Fri, 30 Jul 2010 22:55:00 +0400
> CC: iaroslavski at mail.ru
> 
> Hello,
> 
> I have performance question in sorting.
> 
> In an implementation of Dual-Pivot Quicksort 5 elements
> 
> a[e1], a[e2], a[e3], a[e4], a[e5]
> 
> where e1 = len/6, e2 = len/3, e3 = len/2, e4 = len*2/3, e5 = len*5/6,
> are sorted and then elements a[e2], a[e4] are taken as pivots.
> 
> but if I take a[e1] and a[e3] as pivots, it works 2-3% faster on
> *random* inputs.
> 
> I tried different sorting for these 5 elements: network, bubble,
> insertion - with a[e1] and a[e3] pivots it is faster than with
> a[e2] and a[e4].
> 
> If I sort these 5 elements in descending order, it works faster
> with a[e5] and a[e3] pivots.
> 
> Do you have any idea why it happens?
> 
> Thank you,
> Vladimir

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100803/64d69766/attachment.html>