A List implementation backed by multiple small arrays rather than the traditional single large array.

Tue Mar 30 13:30:57 UTC 2010

The VM-based arraylet implementation is by design minimalistic: it only
splits large arrays into smaller ones, nothing more. You must still wrap
primitive arrays by collection APIs uif you want anything else, including
dynamic size. But the opportunity to get some extra VM help for dynamic
sizing is obvious. Consider C's realloc() function. The Java language
doesn't currently have a realloc()-like API, because it's generally useless
in a garbage-collected heap that most often does not use free lists, and
most often allows compaction (realloc() is largely a clever,
statistically-efficient trick to gain some performance back from fragmented
heaps).

Now, arraylets would enable a special kind of safe realloc() operation that
makes sense for primitive arrays: it would always return a new array, in the
sense that the "root" array (pointers to slices) is new; but sharing most of
the slices with the original array. So if you have a 100K-element array
(that needs a 100-element root array for 1K slices), and grow it into
150K-position, we only need to allocate new slices for the extra 50K
positions, plus the new 150-element root array. And we only need to copy
data from the 100 positions of the old root array to the new one (and maybe,
from a single slice in the end of the original array, if its size didn't
match the maximum slice size - but then the collections growing algorithm
could easily avoid this). Array shrinking is even easier.

Notice that the old root array is still a live object, and now its slices
are aliased by a new root array, but this is only potentially confusing,
it's not unsafe. Collections would encapsulate these arrays and not expose
any aliasing or sharing (by not keeping any reference to the original array
after resize operations).

A+
Osvaldo

2010/3/29 Kevin L. Stern <kevin.l.stern at gmail.com>

> One advantage of this approach over the VM approach is that no data copy is
> necessary when the capacity of the data structure is expanded (new arrays
> are tacked on to the end of the top level array of references) or contracted
> (arrays of null are removed from the top level array of references) aside
> from the (any?) copy of the top level array of references.  One way to
> address your concern, though, is to create a ChunkedArray class that simply
> wraps an array and provides expand and contract functionality.  This could
> be reused in any/all collections.
>
>
> On Mon, Mar 29, 2010 at 10:08 AM, Osvaldo Doederlein <opinali at gmail.com>wrote:
>
>> Initially, it would be good enough to replace only java.util.ArrayList
>> with minimal overhead. ArrayList does not support efficient add-at-front or
>> other enhancements of ArrayDeque; but ArrayList is still a much more
>> important and popular collection, it's the primary "straight replacement for
>> primitive arrrays" and I guess it should continue with that role.
>>
>> One problem of both ArrayList and primitive arrays is that they're not
>> GC-friendly; huge arrays suck for GC. IBM's realtime Metronome collector
>> uses the "arraylet" structure for primitive arrays, so there is a hard
>> upper-limit on object size (well, at least as long as apps don't define
>> classes with thousands of fields, I guess). This avoids the whole issue of
>> "large objects" which permits a simpler heap layout, better incremental GC,
>> etc. There are two tradeoffs. First, some overhead for all array operations
>> - but this is the least important, remarkably as the arraylet trick is
>> implement at the VM level so we can rely on the JIT to perform extra
>> optimizations (e.g., unrolling and other loop optimizations; bounds-check
>> elimination and other array opts, may be arraylet-aware so most overhead is
>> cancelled or at least lifted out of loop bodies and hot paths.) Second, no
>> support at all for huge arrays is incompatible with native code that expects
>> a continuous layout, e.g. for the byte[]s inside Images - so all these uses
>> must be identified and fixed somehow, e.g using DirectBuffers, or changing
>> the native layer to understand arraylets (image libaries may be OK with
>> banding), or in the worst case just copy the data to/from a continuous,
>> native array (in most cases I think this copy already happens for other
>> reasons, so there's no extra copy, just a slightly more expensive copy).
>>
>> Now we're talking about some big VM change of course, but HotSpot would
>> not be the first production VM to do this so maybe it's a viable project for
>> the future, remarkably as Sun plans to keep raising the bar in
>> incremental/realtime GC (G1 may already be a great step forward, but huge
>> arrays will always spoil the fun for many apps).
>>
>> In summary I think the ChunkedArrayList would serve only as a stopgap
>> solution, with extremely limited benefits unless it's sufficiently good so
>> like Martin says, we can replace more List implementations. And I'll even
>> add, replace many other collections too - e.g. a giant HashMap will contain
>> a giant Entry[] array inside it, I want this array to be chunked too
>> (ConcurrentHashMap already is, but it's tuned up differently, for concurrent
>> usage - and that's just one example anyway). And by "replace" I further mean
>> "change the implementation of all existing collections that are
>> array-backed", not "offer new collections" as the latter will only be
>> heavily used ten years from today when JavaSE7 is considered the minimum
>> JavaSE release to be supported by apps/libraries/frameworks/containers/etc.
>> Even then, the benefits will be clearly inferior to what can be achireved by
>> VM-level arraylets.
>>
>> A+
>> Osvaldo
>>
>>
>> 2010/3/29 Martin Buchholz <martinrb at google.com>
>>
>>  On Sun, Mar 28, 2010 at 04:55, Kevin L. Stern <kevin.l.stern at gmail.com>
>>> wrote:
>>> > I put together the following class, ChunkedArrayList, in response to
>>> > Martin's request (excerpted from an earlier conversation on this web
>>> board)
>>> > below.
>>> >
>>> >
>>> https://docs.google.com/leaf?id=0B6brz3MPBDdhMGNiNGIwMTQtMTgxMi00ODlmLTk4ZGYtOWY2NDE0M2E5M2Zl&sort=name&layout=list&num=50
>>> >
>>> > Thoughts?
>>>
>>> This class is well on the way to what I was thinking of,
>>> but my bar for acceptance is a little higher.
>>> In particular, I don't want to add yet another class
>>> that is can replace some, but not all of existing
>>> list implementations.
>>>
>>> Most obviously, I don't want to lose the ability,
>>> introduced in ArrayDeque, of having O(1) insertion
>>> at the front and end of the collection.
>>> Perhaps you can do this by having one "arraylet"
>>> always be shared by both ends, which
>>> grow towards each other in circular fashion.
>>>
>>> I also think we should shrink the array when
>>> necessary, so that occupancy never drops
>>> below, say 50%.
>>>
>>> Perhaps we should also have amortized O(1)
>>> insertion in the middle by using a "gap array".
>>> Probably more important for byte/char collections
>>> like StringBuilder...
>>>
>>> I believe there are more complicated implementations
>>> that permit O(1) insertions at the ends, and only
>>> O(sqrt(N)) space overhead.
>>>
>>> ....
>>>
>>> E.g. Use your favorite search engine to do
>>> some research on:
>>> Resizable arrays in optimal time and space
>>> Succinct dynamic data structures
>>>
>>> Meta-comment: there is not enough transfer of
>>> academic research results into practice; I would think this
>>> is one of the responsibilities of the researchers.
>>>
>>> I presume you'd be willing to sign a
>>> contributor agreement to get your changes into
>>> the JDK someday.
>>>
>>> Martin
>>>
>>> > Regards,
>>> >
>>> > Kevin
>>> >
>>> >
>>> > On Tue, Mar 9, 2010 at 3:15 PM, Martin Buchholz <martinrb at google.com>
>>> > wrote:
>>> >
>>> >     It surely is not a good idea to use a single backing array
>>> >     for huge arrays.  As you point out, it's up to 32GB
>>> >     for just one object.  But the core JDK
>>> >     doesn't offer a suitable alternative for users who need very
>>> >     large collections.
>>> >
>>> >     It would have been more in the spirit of Java to have a
>>> >     collection class instead of ArrayList that was not fastest at
>>> >     any particular operation, but had excellent asymptotic behaviour,
>>> >     based on backing arrays containing backing arrays.
>>> >     But:
>>> >     - no such excellent class has been written yet
>>> >      (or please point me to such a class)
>>> >     - even if it were, such a best-of-breed-general-purpose
>>> >      List implementation would probably need to be introduced as a
>>> >      separate class, because of the performance expectations of
>>> >      existing implementations.
>>> >
>>> >     In the meantime, we have to maintain what we got,
>>> >     and that includes living with arrays and classes that wrap them.
>>> >
>>> >     Changing the spec is unlikely to succeed..
>>> >
>>> >     Martin
>>> >
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20100330/1e7e1ded/attachment.html>