Observations on sorting values

Mon Apr 15 13:57:15 UTC 2019

I've been working sporadically on sorting algorithms expected to work
well with value types. This post is mainly on sequential sorting issues.
And the measurements here are not even in Java: I've been using plain C
with various odd macros to compare cases for which support does yet
exist, so are more informative than I could otherwise arrange. The
current best candidate is good at conserving and localizing Comparisons,
Loads, and Stores (although at the price of being harder to optimize).
The expected time to sort random permutations is approximately:   T(N) =
1.004 NLgN(C + 0.56L + 0.44S + P), where  L is cost to load element (in
most cases, at least enough fields needed for compare), C is cost of
comparator and (usually poorly predictable) data-branches, S is cost to
store (all fields of) element, and P is other sub-NLogN overhead costs.

The C, L, and S factors are not necessarily constants, or even simple
functions across different modes of use. Here's data from some
experiments in which every sort does basically the same thing: sort
uniform randomly permuted ids held in a 64bit (8byte) "key" field of a C
struct. (Random permutations are among the slowest cases for sorting,
mainly because of terrible branch prediction. Also, more theoretical
results are also known about this case than others.) The modes vary across:
1. Directly sorting ("V") values (arrays of structs) vs ("P") pointers
(arrays of pointers to structs)
2. Size: number of bytes in the struct (minimally 8 for the key; the
others unused)
3. For Pointers, whether they are ("L") linearized, initially assigned
to adjacent structs (maintaining an initial correlation between pointer
and pointee), or are ("R") randomly permuted before sorting
4. Whether the comparator is ("I") inlined or ("C") called.

The table reports sorts of arrays of size from 128 to a billion
elements, with associated times from about 10microseconds to many
minutes. This was run on an Intel E5-2699 with 128Gb RAM, Ubuntu 16.04,
gcc5.4  -O3. Some cases of very large arrays of very large structs could
not be run because of not enough memory. Tests on other Intel and AMD
machines and with other versions of linux and gcc give generally similar
results, although with some still undiagnosed variation.  The first two
columns show V8I (one 8byte field by value, inlining comparator),
basically equivalent to sorting Java "longs", as a time (averaged over
33 minus LgN samples per cell)  and the ratio of time in nanosecs to
NLgN. Ideally, this ratio should converge to a constant, which it seems
to do; entries up to about 64K show decreasing impact of the sub-NLogN
overhead. The other columns report the slowdown (ratio) of each other
case to V8I. The table is also at:
https://docs.google.com/spreadsheets/d/1iGleQPQt_SCjH6OOyJrZfq4p6Ywi13pNyUCo1ufV0xw/edit?usp=sharing

N 	V8I(sec) 	ns/nLgn 	V16I 	V32I 	V64I 	V128I 	V256I 	V8C 	V16C 	V32C
V64C 	V128C 	V256C 	P8RI 	P8LI 	P8RC 	P8LC
128 	0.00001 	12.06 	1.18 	1.36 	1.63 	2.62 	6.18 	1.52 	1.61 	2.35
3.34 	5.34 	12.47 	1.13 	1.12 	1.68 	1.62
256 	0.00002 	11.25 	1.10 	1.34 	1.64 	2.61 	6.23 	1.49 	1.61 	2.38
3.42 	5.65 	12.95 	1.12 	1.09 	1.61 	1.58
512 	0.00005 	10.97 	1.10 	1.38 	1.66 	2.67 	6.00 	1.42 	1.64 	2.42
3.45 	5.41 	12.37 	1.14 	1.10 	1.48 	1.62
1024 	0.00010 	9.69 	1.20 	1.39 	1.82 	2.73 	6.28 	1.54 	1.74 	2.59
3.76 	5.80 	12.40 	1.23 	1.20 	1.60 	1.76
2048 	0.00020 	8.80 	1.22 	1.40 	1.83 	2.86 	6.04 	1.50 	1.84 	2.67
3.79 	5.74 	11.34 	1.24 	1.27 	1.56 	1.81
4096 	0.00039 	7.87 	1.29 	1.34 	1.86 	2.82 	5.56 	1.44 	1.84 	2.64
3.61 	5.40 	10.53 	1.29 	1.31 	1.50 	1.89
8192 	0.00070 	6.59 	1.37 	1.31 	1.92 	2.79 	5.68 	1.39 	1.93 	2.67
3.55 	5.32 	12.38 	1.36 	1.37 	1.47 	1.98
16384 	0.00125 	5.47 	1.41 	1.41 	1.87 	2.81 	6.78 	1.54 	1.89 	2.64
3.69 	6.25 	14.67 	1.36 	1.40 	1.76 	2.05
32768 	0.00267 	5.42 	1.18 	1.41 	1.68 	2.77 	6.28 	1.55 	1.73 	2.59
3.69 	6.28 	9.17 	1.25 	1.21 	1.87 	1.82
65536 	0.00565 	5.39 	1.14 	1.37 	1.68 	2.77 	4.16 	1.33 	1.73 	2.60
3.49 	4.09 	9.19 	1.34 	1.24 	1.26 	1.88
131072 	0.00727 	3.26 	1.87 	1.43 	2.43 	3.01 	6.86 	1.56 	2.43 	2.84
3.75 	6.34 	15.20 	2.08 	2.09 	2.16 	2.96
262144 	0.01530 	3.24 	1.35 	1.43 	1.73 	2.98 	7.15 	1.55 	1.74 	2.64
3.75 	6.66 	15.44 	1.45 	1.36 	2.23 	2.03
524288 	0.03217 	3.23 	1.22 	1.43 	1.69 	2.91 	7.38 	1.55 	1.74 	2.64
3.76 	6.54 	15.55 	1.47 	1.33 	2.27 	2.08
1048576 	0.06744 	3.22 	1.19 	1.43 	1.72 	2.94 	7.50 	1.56 	1.74 	2.64
3.77 	6.41 	15.80 	1.56 	1.38 	2.37 	2.13
2097152 	0.14111 	3.20 	1.15 	1.45 	1.76 	2.99 	7.61 	1.56 	1.75 	2.66
3.81 	6.45 	15.78 	1.74 	1.46 	2.51 	2.21
4194304 	0.29483 	3.20 	1.14 	1.46 	1.77 	3.01 	7.68 	1.56 	1.75 	2.66
3.81 	6.45 	16.41 	1.96 	1.57 	2.75 	2.33
8388608 	0.61583 	3.19 	1.15 	1.47 	1.78 	3.04 	7.75 	1.56 	1.76 	2.67
3.82 	6.47 	15.89 	2.48 	1.84 	3.51 	2.66
16777216 	1.28588 	3.19 	1.15 	1.49 	1.79 	3.06 	7.78 	1.56 	1.76 	2.68
3.83 	6.46 	15.89 	3.03 	2.15 	4.34 	3.12
33554432 	2.68097 	3.20 	1.15 	1.47 	1.79 	3.07 	7.80 	1.56 	1.76 	2.67
3.83 	6.44 	15.88 	3.39 	2.40 	4.84 	3.45
67108864 	5.55497 	3.18 	1.15 	1.48 	1.81 	3.10 	7.88 	1.56 	1.76 	2.69
3.84 	6.54 	15.96 	3.67 	2.63 	5.26 	3.80
134217728 	11.54670 	3.19 	1.15 	1.48 	1.81 	3.11 	7.90 	1.55 	1.76
2.68 	3.85 	6.45 	15.92 	3.82 	2.79 	5.48 	4.04
268435456 	23.89779 	3.18 	1.15 	1.48 	1.81 	3.13 	7.96 	1.56 	1.76
2.70 	3.84 	6.48 	16.05 	3.97 	2.98 	5.73 	4.33
536870912 	49.40996 	3.17 	1.16 	1.49 	1.82 	3.15 	(OOM)
	1.56 	1.77 	2.70 	3.85 	6.49 	(OOM)
	4.18 	3.20 	6.01 	4.63
1073741824 	102.16857 	3.17 	1.16 	1.49 	1.83 	(OOM)
	(OOM)
	1.57 	1.77 	2.70 	3.85 	(OOM)
	(OOM)
	4.70 	3.57 	6.66 	5.08

Some observations. None of which are too surprising:

1. For small structs/values, it shows the usual pointer vs value
effects: Pointers require twice the number of reads (the pointer and
then the pointee key) per access, and the second of these reads has
little predictability/locality, which leads to cache/memory traffic
overhead that increases with array size rather than staying constant as
with Values. For example, the bottom entry for P8LI is analogous to Java
sorting a billion Long objects being increasingly slower (here by a
factor of 3.57) than sorting longs as N increases. Results for pointers
to larger structs are not shown, because they don't vary much from the
"P8R" cases.

2. For Values, there is little enough writing overall in this algorithm
(as opposed to some others)  that adding fields that must be moved per
swap adds only sub-linear cost until struct sizes have hundreds of
bytes, where performance starts falling off cliff, and it usually
becomes preferable to take the indirection cost on reads for the sake of
only one write per array move. (Among the open algorithmic issues is
that it is sometimes faster to allow more writes.)

3. In part because there is no common API to send a comparator only key
fields, or a by-ref handle, inlining comparators has impact beyond
call/return mechanics overhead savings. It can represent the difference
between the use of Values being Really Good vs Really Bad for
performance, since it allows the comparison of a key field to proceed
without reading and passing in unrelated stuff. For Pointers, the main
impact is to clue in the compiler to prefetch across an indirection. It
would be nice to somehow present a better set of options to programmer.
(Short of that, in some cases, it might be a good idea to create and
sort another Value/struct with only a key and a pointer to the full
value/struct, hitting the V16 case here, at the price of O(n) extra
memory.)

4. People who never sort more than say a thousand elements each of which
have only a few fields will be comforted that most of these issues don't
matter a lot.

If you are interested in further experimenting with underdocumented
in-progress code, see http://gee.cs.oswego.edu/dl/wwwtmp/ctpq.tar

Someday when I have good data, I'll post something about how similar
issues surrounding keys and hashcodes apply to other Collections etc.
Also stay tuned for more on parallelism.

-Doug

PS People  betting whether I'd ever send mail with an embedded html
table can collect now. The table is too hard to read as plain text.