Proposal: Large arrays (take two)

james lowden jl0235 at yahoo.com
Tue Mar 24 12:02:22 PDT 2009


Proposal: Large arrays

AUTHOR(S):

J. Lowden

VERSION:

1.1    Revised to remove distinction between classic/new arrays


OVERVIEW

FEATURE SUMMARY:

Java arrays are accessed via 32-bit ints, resulting in a maximum theoretical array size of 2147483647 elements.  While this is enough for most uses, some applications that need to handle large sequential data sets in memory would benefit from the ability to work with arrays indexed using 64-bit indices.  I propose an extension to current array syntax to allow longs to be used as array indices and sizes.  For example,

int [] foo = new int [3000000000L]; 
foo [2999999999L] = 500;

does not currently compile, but would under this proposal result in an array of length 3000000000.

Currently, the length of the array is accessed via a length field, which is an int.  Instead of changing this to a long, which would potentially break a lot of extant code, I propose deprecating the length field and adding a size() method to arrays, which returns a long indicating the actual size.  Checking the original length field would behave as expected for arrays of less than 2^31 elements, and would return Integer.MAX_VALUE for all larger arrays.


MAJOR ADVANTAGE:

Java gains the ability to easily work with large sequential data sets.


MAJOR BENEFIT:

Declaring extremely large arrays simply isn't possible right now; in cases where such a structure is desirable, something has to be hacked together.  For applications dealing with large data sets, the current upper limit on array size is constricting, and will only grow more so as 64-bit systems continue to become more common and RAM less expensive.


MAJOR DISADVANTAGE:

This can't be implemented well solely via a compiler patch; existing VMs would likely need to be changed to support this concept natively; new VM instructions would likely be required.  Modifications to array-oriented library code such as java.util.Arrays and the collections API would also be highly desirable in order to make this change useful.


ALTERNATIVES:

Build your own class for storing long-indexes sequences, possibly as an array-or-arrays.


EXAMPLES

SIMPLE EXAMPLE:

// calculate the first 3 billion fibonacci numbers and store them in an array
// this doesn't *actually* work, as the 3 billionth fibonacci number is way
// way too big to store in any java primitive type


long [] fib = new long [3000000000L];
fib [0] = 1;
fib [1] = 1;
long length = fib.size ();
for (long i = 2; i < length; i++)
  fib [i] = fib [i - 1] + fib [i - 2];


DETAILS

SPECIFICATION:

Syntax-wise, the following expressions become legal:

foo = new SomeType [some_long_value];   // instantiates an array of length
					// some_long_value, which is either a long
					// or some type that can upconvert or unbox
					// to a long

long some_long = foo.size ();   // arrays will have a "size" method, which returns
				// a long indicating the number of elements in the array

At the VM level, we could use the "wide" instruction in combination with the current group of array instructions to provide instructions for access to and manipulation of long-indexed arrays.  The existing int-indexed array instructions would continue to behave as expected, operating on the first 2^31 - 1 elements of arrays, with the exception of "arraylength" which would, in the case of an array with more than 2^31 - 1 elements, return Integer.MAX_VALUE.


COMPILATION:

Compilation of array operations would be changed to call the new "wide" + array_instruction VM instructions for all array access.


TESTING:

Any test sufficient to test support for current arrays should work for this as well.  The previous sentence may be hopelessly naive.


LIBRARY SUPPORT:

Existing array APIs would have to be modified to support this change.  In particular, the methods of java.util.Arrays would have to be altered to work with long-indexed arrays.  I propose using the following pattern for such changes (both in java.util.Arrays and elsewhere where applicable):

1) If the method neither takes in nor returns an int representing an array length or array index, it retains its current public signature and the implementation is changed to work with large arrays.

2) If the method takes in at least one int parameter representing a array length or array index but does not return such an int, a parallel version will need to be created that takes in long parameters instead of ints.  The implementation of the original version may also need to be altered to work with large arrays.

3) If the method returns an int representing an array length or index, it should be modified to work as expected with arrays that are within current size limits and break as cleanly as possible for larger arrays.  It should also be deprecated in favor of an equivalent method that returns a long and operates properly on large arrays.

Additionally, there are extant interfaces (java.util.Collection) that are often used in tandem with arrays where it might be beneficial to make parellel changes.  Collection, for example, has a size () method that returns an int.  It isn't possible to change this to return a long, or add an alternate size method that returns a long, without breaking every extant implementation of this interface.   I'm thinking that the best non-breaking way to handle this would be to create a new subinterface of Collection that defined a new getSize () method that returned a long, and deprecate the extant size () method.  The existing collection implementations could then be retrofit to implement the new interface without breaking code that was dependent on the current implementation of Collection. 

Alternately, a completely new collections library (javax.collections or something) could be built based on 64-bit sizes, possibly incorporating other features that haven't been added to Collections over the years out of backward compatibility concerns.


REFLECTIVE APIS:

java.lang.reflect.Array would need additional getXXX (Object array, long index) methods parallel to the current getXXX (Object array, int index) methods--the current getXXX methods would still work.  Additional newInstance methods taking in long sizes would also be required.  The current getLength () methods would be deprecated in favor of a new getSize () method returning a long.


OTHER CHANGES:

VM needs to support large arrays.


MIGRATION:

Existing "hacks" to provide this kind of behavior could be replaced with large arrays.


COMPATIBILITY

BREAKING CHANGES:

Existing code that wasn't expecting a large array that was passed an array with 2^31 or more elements would only "see" the first 2^31 - 1 elements.


EXISTING PROGRAMS:

No changes.


REFERENCES

EXISTING BUGS:

4880587 (64-Bit indexing for arrays)


URL FOR PROTOTYPE (optional):

None.


      



More information about the coin-dev mailing list