Extending Java Arrays/Collection Sort API

Thu Nov 29 01:12:50 UTC 2018

As you noticed, I jumped into the discussion mid-stream.

I've been itching to get index-based sorting into JDK ever since
I tried to code Burrows-Wheeler on the JDK and saw a huge footprint
overhead due to the need for a proxy object for each byte position
in the file to be compressed.

Here are some more direct responses, in the spirit of Brian's suggestion
that we are in an exploration stage.

> On Nov 14, 2018, at 12:01 AM, Laurent Bourgès <bourges.laurent at gmail.com> wrote:
> 
> Hi,
> 
> I am a scientist and the author of the Marlin renderer, integrated in
> OpenJDK.
> 
> In this renderer, I needed a fast sort algorithm on small almost sorted
> arrays, but more important, to deal with 2 arrays: x values and their
> corresponding edge indices (pointer like). I derived my MergeSort class to
> perform 2 swaps (x & edge ptr) as Arrays.sort() or TimSort() did not
> provide such API.

There are two distinct requirements here:  1. Selecting or tweaking a sorting
algorithms for particular expected patterns in the data set (almost-sorted).
2. Sorting two arrays in tandem, with one array supplying the sort key,
and the other serving as a passively permuted payload ("PPP").

Re 1: Another pattern sometimes expected is range-bounded values,
which may prompt consideration of bin-sort.  Perhaps unique values.
Permutation vectors have both properties.

Re 2: In some cases, it's not two arrays that need the work.  Sometimes
it is one array of keys, plus an index set; this can be reduced to two arrays
one of which is an int-array carrying indexes, but in some proposals you see
such an int-array appearing only as an output, with an implicit input of
the iota-vector for the other array's indexes.

More deeply, sometimes the index array is concrete, but the array of keys
is virtualized.  Canonical example (for me) is a set of indexes into a data
set such as a file.  (These indexes are dense for compression algorithms
like B-W, or could be sparse for lexical analysis of bulk text in situ.)  To
handle this properly, a good old index array is just fine, as long as the
sort algorithm *also* accepts an int-comparator.

There is no strong need to generalize to three arrays in tandem, or to
have separate algorithms for treating additional arrays as containing
secondary keys instead of PPP.  This is because given a two-array tandem
sort with PPP, plus an int-array sort with int-comparator, you can compose
many other shapes at a modest footprint cost by adding two index vector
to control the various sorts.

Simplified sketch of generalized tandem array sort:

1. Let L be the common length of the N arrays, A[N][L].
2. Create a new int index array P initialized to iota(L).
3. Create an int-comparator C over 0<=i,j<L which compares A[*][i] to A[*][j].
4. Sort P using C.  Now P is a permutation vector to be applied to each A[*].
5. Create a second index array Q such that Q[P[i]] = i for all i.  Q is P's inverse.
6. Tandem-sort each Q[*] (as a PPP) using a fresh copy of Q for each array.

There is more than one way to do this; the above sketch is intended to be
straightforward.  The temp. space required is two length-L int arrays.

My point is that the "missing primitives" are something like a. reordering
a PPP using a different array (either indexes or keys) for steering, and
b. reordering an index vector (or other array of primitives) using a user
supplied comparator.

An iota generator would be nice too, as well as a permutation inverter.
The tandem sort could be restricted to a "permute this array" operation,
giving an index vector, although I don't think this simplifies the implementation,
and it removes some use cases.  But those are minor points.

> 1. I started discussing about extending the Java Sort API with Vladimir
> Yaroslavskiy, the author of the famous Java DualPivotQuickSort, as he is
> currently improving it.
> 
> His coming DPQS 2018 class provides lot of sort algorithms, I propose to
> make them public API through a facade method in Arrays class:
> insertionSort(), heapSort(), mergeSort(), quickSort()... Of course, his
> current code is dedicated to DPQS, but his algorithm implementation could
> be generalized to propose the Best implementations as a public Java API
> (stable & unstable sorts)
> 
> For example in Marlin, I need trivial insertionSort() in the Helpers class,
> my MergeSort could be replaced by another best implementation.
> Let's not reinvent the wheel…

Indeed, let's not.  But remember that the JDK does not claim to be (and
cannot be) a collection of all the "wheels" we might need.

Here's an extra thought or two on tweakable algorithms:  One algorithm doesn't
fit all data sets although it's amazing how far we've gone in optimizing the JDK's
only sort algorithm.  The JDK can't support many new algorithms to go with sort
(and, ahem, parallelSort), as insertionSort, mergeSort, etc., etc.  Doing this would
amount to having a bunch of partially usable sort routines, more like "tryThiseSort"
"orTryThisIfItDidNotWork", or "dontUseThisSort".  Actually documenting and testing
a multi-kinded API for {insertion,merge,parallel,plain}Sort would be a nightmare,
IMO, since the properties of sort algorithms are hard to explain and verify.

It would be better if we could define an API for sorting algorithms, that a provider
could set up.  Then the various API points in Arrays.*sort(*) could be documented
as sugar for specific standard providers, via the API.  The provider API could
include not only first-class entry points for sorting various kinds of arrays or
tandem array pairs, but also provide queries for a user who wanted to be smart
about which provider to choose.

> 2. New Sort API to deal with indices...
> 
> For now, there is no mean to obtain the indices sorted on another value,
> see my introduction: edge indices sorted by their x positions.
> In data science, this is very useful and general.
> 
> What about an Arrays API extension:
> - Arrays.sort(int[] a, int[] b, low, high) or
> Indices int[] Arrays.sortIndices(int[] a, low high), a[] left untouched

Yes, good.  I don't think sortIndices buys much, since the internal
algorithm will almost certainly create an iota vector and sort it instead
of a.  I'd prefer an iota function such that this composition works
like sortIndices:

   int[] indices = indexRange(low, high);
   sort(indices, (i, j) -> Integer.compare(a[i], a[j]));
   return indices;

> - same with preallocated buffers & runs arrays

I don't follow:  Is this a request for tandem sorting?  Off-heap sorting?
I think the above primitives are likely to cover.

(Exception:  Sorting more that 2 billion things is desirable, but will
bust out of Java arrays.  That's why I'm not proposing sorting index
arrays of type long, at present.  We need an array-like container
of more than 2 billion entries before we graduate to long indexes.)

> - alternatively, it could be expressed as Comparator<Primitive>
> Arrays.sort(int[] a, low, high, Comparator<int, int> cmp) that sorts array
> a using the given comparator. For example, a[] could be edge indices sorted
> according to their edge x position...
> This comparator is specific as it compares anything (external storage) at
> given indices i,j.

Exactly.  Except note the limit of 2 billion.

> Finally Marlin renderer would benefit from Value types (array of structs),
> as edge data are currently packed in int[] arrays, but this new Sort API
> seems to me general & needed enough to be discussed here.

Yes, value types will give better memory locality; today you have to use
tandem sorts.  Also, value types will more naturally work with comparators,
where we have to do something different with int-arrays (IntComparator).

Valhalla is aiming at generic specialization also.  That, I believe, is the
best future for sorting algorithms (and other algorithms too).  A specializable
template class could contain *one copy* of a full sort algorithm, which would
then be specialized to arrays of a reference, primitive, or value type.  The
suggestion of a "sort provider API" would mesh nicely with this:  The template
class would implement that API.  The index type (currently hardwired as "int")
could be a type parameter, instantly releasing us from the 2-billion limit.
Tweakable algorithm parameters (such as "likely run size" or "maximum value")
could be hardwired into the template (as if by a C #ifdef) simply by defining
them as template parameters (a la C++ non-type template parameters).

A user-requested specialization of a template algorithm class would be a strong
signal to the JVM that some "customized" code needs to be generated.  Currently
we have no such signals, leading to performance losses when heroic inlining
fails to fully connect the user code to the sort algorithm code.  I call this the
"loop customization problem".

— John

P.S. Loop customization problem:
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2012-September/008381.html
http://mail.openjdk.java.net/pipermail/valhalla-dev/2016-June/001953.html
Each "loop syndrome" for a sort or BLAS algorithm would be a distinct
template class instance.

P.P.S. Such a template algorithm class would work naturally with Arrays 2.0,
which will (someday, in my dreams!) define template interfaces for extending
Java arrays in new directions, notably with index types beyond int.

Independently of sort algorithms, I expect an Arrays 2.0 array could be truly
2-D if its index space were pairs of ints (a value type).  It might internally store
its data in some blocked cache-friendly form (a square sub-array in each cache
line, maybe).  I wonder if there is some useful generalization of linear-ordered
sorting which applies to 2-D arrays?  Clearly there are many interesting algorithms
on multi-D arrays which could be instantiated and tweaked using template algorithm
classes.