[External] : Re: Trimming ArrayList and HashMap During GC Copy

Nathan Reynolds numeralnathan at gmail.com
Tue Feb 7 15:13:22 UTC 2023


Did I miss any replies?

On Thu, Feb 2, 2023 at 1:42 PM Nathan Reynolds <numeralnathan at gmail.com>
wrote:

> *Trim During Relocating*
>
> > If the bound used in the loop is a small constant, let’s say 4, then the
> JIT can also unroll the body of the loop 4 times so that the generated
> machine code doesn’t even have a loop any more, like this:
>
> I thought about loop unrolling, but left it out.  Thank you for including
> it because it helped me realize some more complexity.
>
> I still don't have solutions to dealing with JIT optimizations and loading
> elements from the array.
>
> *Shrink-at-Threshold Policy*
>
> > Copying tends to be faster than people think. Even when their mind set
> is that it’s cheap, it still tends to surprise. And if I recall your data
> correctly, these things tend to be rather small, no? ;-)
> > So I’d really try out the shrink-at-threshold policy, using a similar
> approach to the growing strategy, and run some programs using that, before
> drawing any conclusions about how it might perform. It just might be that
> it will work just fine. And if not, perhaps the policy could be refined to
> be more clever.
>
> Good point.  We are scared about a potentially viable solution simply
> because it might perform poorly.  Fortunately, there are some really good
> performance tests that can verify if the shrink-at-threshold policy is
> going to have any impact.  For example, SPECjbb and SPECjEnterprise could
> show that most programs will be fine.
>
> There probably are programs with collection(s) that shouldn't shrink, then
> each collection instance can provide a configuration option to tune or
> disable the shrink-at-threshold policy.  We also need to provide a
> configuration option to turn off the policy altogether so that some
> programs can have decent performance until they can individually configure
> the problematic collections.  Maybe the shrink-at-threshold policy can be
> smarter to backoff if the array is growing and shrinking too often.  If
> this were the case, then all (?) programs would be fine.
>
> *Flagging*
>
> This approach is a gradual approach and perhaps doesn't involve GC at
> all.  Each collection opts in with a new method called private void
> markToTrim().  To avoid collisions with already existing methods of the
> same signature, this method must be annotated with a new annotation called
> TrimMarkable.  A flagging mechanism (GC or something else) can call this
> method.
>
> This solution has an advantage over the Trim During Relocation solution.
> Other library collections can implement markToTrim().  This solution isn't
> only for the JDK collections.  markToTrim() makes life easier for the
> flagging mechanism since each collection takes care of its own
> implementation of trim marking.  This makes each collection responsible for
> its own behavior and does not tie the flagging mechanism to the collection
> except through markToTrim() specification.
>
> One implementation of markToTrim() could use the sign bit in the
> collection's size field.  When the sign bit is set, that means the
> collection should be trimmed.  Later, when a program's thread uses the
> collection, the program's thread wipes out the negative sign and calls
> trimToSize().  Here is a possible implementation of markToTrim().
>
> private static final VarHandle SIZE_HANDLE;   // static initializer
> connects this to the collection's size field
>
> private void markToTrim()
> {
>    int expect;
>
>    expect = SIZE_HANDLE.getOpaque(this);
>
>    if (size > 0)           // Don't call size().  See size() later.
>        if (isTrimmable())
>       {
>          SIZE_HANDLE.compareAndSet(this, expect, -expect);
>       }
> }
>
> getOpaque() ensures that JIT won't cache the size field's value in a
> register or on the stack.
>
> The "if (size > 0)" ensures that the sign bit is not already set.  If so,
> there is not point in changing it.
>
> The compareAndSet() ensures that any changes by the Java thread(s) are not
> lost.  This allows markToTrim() to run concurrently with the Java threads.
> However, the result of the compareAndSet() operation could be lost if the
> Java thread reads the size field before markToTrim() and then set the size
> field after markToTrim().  Losing the result of the compareAndSet()
> operation is actually desirable.  This means a Java thread is changing the
> size of the collection and hence trimming would probably be a poor choice.
> Furthermore, looping to ensure the size sign flag is changed is not
> desirable.
>
> isTrimmable() ensures that markToTrim() doesn't waste time setting the
> size sign bit for a collection that is already as compact as possible.
> This also eliminates branch mispredictions (see size() later) dealing with
> the sign bit in the size field if there is nothing to do.  isTrimmable()
> would look like this for ArrayList.
>
> private boolean isTrimmable()
> {
>    return size < elementData.length;     // Don't call size()
> }
>
> isTrimmable() for HashMap would look like this.  This is because
> elementData.length is a power of 2 and it can't be trimmed unless all the
> elements can fit in half the table size times the load factor.
>
> private boolean isTrimmable()
> {
>    return size < elementData.length / 2 * loadFactor;   // Don't call
> size()
> }
>
> isTrimmable() is a separate method to make it clear that markToTrim() is
> fairly generic.  The real implementation probably wouldn't bother having
> isTrimmable()
>
> The collection's size() needs to change to the following.
>
> public int size()
> {
>    int result;
>
>    result = size;
>
>    return result >= 0 ? result : trimThenSize();
> }
>
> private int trimThenSize()
> {
>   int result;
>
>   result = Math.abs(size);
>   size   = result;
>
>   trimToSize();
>
>   return result;
> }
>
> trimToSize() is another method that the collection implements.  To see an
> example, see ArrayList.trimToSize().
>
> size() simply returns the size field value if the sign bit is not set.  It
> calls trimThenSize() which does the heavy lifting of fixing the size field
> and doing the trim.  This allows JIT to inline size() but not inline
> trimThenSize() for better performance.
>
> Copying the size field to the local variable result ensures that any
> concurrent changes to the size field from trimToSize() cannot cause a
> problem.  We only need to make sure that JIT doesn't get rid of the result
> local variable and instead reloads the size field.  A mild opaque fence
> would do this; however, this is probably overkill since JIT is unlikely to
> reload the size field.  In other words, we need a fence opposite of opaque
> that forces JIT to not reload the size field.
>
> Ignoring the overhead of calling a method and returning a value, the
> original implementation of size() only has a memory load instruction to get
> the size field's value for a total cost of 3 cycles (on Intel x86).  The
> new implementation has the same memory load, a comparison, and a branch.
> This will add a total cost of 2 cycles and 3 bytes (on Intel x86).  For a
> loop, JIT will probably hoist the call to size() so the additional 2 cycles
> is insignificant compared to the rest of the loop's execution.  We can
> ignore branch misprediction penalties since the sign bit is rarely set.  I
> doubt anyone will notice the performance impact of any of these changes.
>
> The rest of the collection's code needs to call size() and not directly
> read the size field.  size() will be used in all (?) of the collection's
> methods.  This makes size() the ideal place to consider trimming the
> collection.
>
> *Flagging Mechanism*
>
> The flagging mechanism runs in Java code.  However, it needs a list of
> collections to flag.  This is similar to ReferenceQueue, Cleaner, String
> deduplication, and the dreaded finalize.
>
> I only know of 1 entity that has that information and that is GC.  If
> there is some pauseless heap walking API in the JVM, then some thread in
> the JVM can walk the heap and as it finds a collection, it tells the Java
> thread about the collection.
>
> We still want the 2 modes: off and idle.  Here are some other modes we may
> want to entertain.
>
> Idle mode: Something determines that the application has gone idle and
> kicks in a full GC and return memory back to the OS.  That same thing could
> kick off a heap walk to find the collections.  However, since GC is going
> to walk the entire heap already, it would make sense to have GC provide the
> list of collections to flag to the flagging mechanism.
>
> Every GC mode: Instead of an always mode, we may want an every GC mode.
> Something can monitor for GCs and then start the heap walking, GC could
> kick off the heap walking, or we can piggy back off of GC and when it
> discovers a collection, GC lets the flagging mechanism know.
>
> Promotion mode: Whenever GC promotes a collection from young to old, it
> informs the flagging mechanism of the collection.  GC only tells the
> flagging mechanism about the collection.  It doesn't do anything more.
>
> Relocation mode: Whenever GC relocates a collection, it informs the
> flagging mechanism of the collection.  Again, GC only tells the flagging
> mechanism about the collection.  It doesn't do anything more.
>
> We may want some combination of idle, promotion, and relocation.
>
> Thank you for reading this and my previous emails.  Please ask
> clarification questions, poke holes, etc.
>
> On Tue, Jan 31, 2023 at 10:11 PM Erik Osterlund <erik.osterlund at oracle.com>
> wrote:
>
>>
>>
>> On 31 Jan 2023, at 22:52, Nathan Reynolds <numeralnathan at gmail.com>
>> wrote:
>>
>> 
>> Let's see if I understand by writing a concrete example.  Here is some
>> pseudo ArrayList.get() code.  The Object[] access bounds checking are
>> explicitly written to make it easier to discuss.
>>
>> public E get(int index)
>> {
>>    if ((index < 0) || (index >= size))
>>    {
>>       throw new IndexOutOfBoundsException();
>>    }
>>
>>    if ((index < 0) || (index >= elementData.length))   // Array bounds
>> checking
>>    {
>>       throw new ArrayIndexOutOfBoundsException();
>>    }
>>
>>    return (E) elementData[index];
>> }
>>
>> Let's say we have the following loop.
>>
>> int index;
>>
>> for (index = 0; index < myList.size(); index++)
>> {
>>    System.out.println(myList.get(index));
>> }
>>
>> Here is the loop code with ArrayList.size() and ArrayList.get() inlined.
>>
>> int index;
>>
>> for (index = 0; index < myList.size; index++)   // Note: size() was
>> replaced with the field size
>> {
>>    if ((index < 0) || (index >= myList.size))
>>    {
>>       throw new IndexOutOfBoundsException();
>>    }
>>
>>    if ((index < 0) || (index >= myList.elementData.length))   // Array
>> bounds checking
>>    {
>>       throw new ArrayIndexOutOfBoundsException();
>>    }
>>
>>    System.out.println((E) myList.elementData[index]);
>> }
>>
>> I am no JIT expert and what I say may be wrong...
>>
>>    - JIT can get rid of both index < 0 since JIT can prove that index is
>>    always >= 0.
>>    - JIT can get rid of index >= myList.size condition since JIT can
>>    prove that index must always be < size due to the for loop's condition.
>>    - JIT can store myList.size in a local variable.  This makes the
>>    local variable constant.
>>    - JIT can store myList.elementData in a local variable.  This makes
>>    the local variable constant.
>>    - With both size and elementData in local variables, this allows JIT
>>    to hoist the final array-bound checking condition out of the loop.
>>
>>
>> Object elementData[];
>> int size, index;
>>
>> size        = myList.size;
>> elementData = myList.elementData;
>>
>> if (size >= elementData.length)      // Array bounds checking
>> {
>>    // TODO Execute the code to print indexes 0 to elementData.length - 1
>>
>>    throw new ArrayIndexOutOfBoundsException();
>> }
>>
>> for (index = 0; index < size; index++)
>> {
>>    System.out.println((E) elementData[index]);
>> }
>>
>> For well behaving Java code, ArrayList.size <= elementData.length.  Thus,
>> when GC trims elementData[], the new array will still be long enough.  The
>> JVM won't crash.
>>
>> For poorly behaving code, Thread A could be executing the above loop.
>> Thread B could decrement ArrayList.size.  Thus, when GC trims
>> elementData[], the new array won't be long enough.  Thread A walks beyond
>> the end of the array and will corrupt or more likely crash the JVM.  Is
>> this what you are pointing out?
>>
>>
>> Yes, indeed. If the bound used in the loop is a small constant, let’s say
>> 4, then the JIT can also unroll the body of the loop 4 times so that the
>> generated machine code doesn’t even have a loop any more, like this:
>>
>> Object elementData[];
>> int size;
>>
>> size        = myList.size;
>> elementData = myList.elementData;
>>
>> if (size < 4 || elementData.length < 4)
>> {
>>    throw new ArrayIndexOutOfBoundsException();
>> }
>>
>> System.out.println(elementData[0]);
>> System.out.println(elementData[1]);
>> System.out.println(elementData[2]);
>> System.out.println(elementData[3]);
>>
>> For HashMap, the problem is more difficult to solve.  Let's say a thread
>> is iterating over the table and nodes.  If the node change bucket
>> locations, then the thread won't iterate over all the nodes.
>>
>>
>> Right.
>>
>> I need some time to think about these problems and come up with a
>> solution.  I figured I would get this email out there to take a step
>> forward while I think.
>>
>>
>> Okay.
>>
>> > You say taming growth isn’t the issue. So is it the lack of a shrinking
>> policy when removing elements perhaps? Whatever it is, I think it’s best
>> fixed explicitly at the library level, and not by hacking the entire JVM
>> around this issue.
>>
>> Yes, the problem is the lack of a shrinking policy when removing elements.
>>
>> We could trim the Object[] with every remove.  This would provide the
>> maximum memory savings, but really hurt performance if the ArrayList is
>> bouncing around in size.
>>
>> We could trim the Object[] when a lower threshold is reached.  This would
>> provide some memory savings and somewhat hurt performance if the ArrayList
>> is bouncing around in size between the lower threshold and the Object[]
>> length.
>>
>>
>> Copying tends to be faster than people think. Even when their mind set is
>> that it’s cheap, it still tends to surprise. And if I recall your data
>> correctly, these things tend to be rather small, no? ;-)
>>
>> So I’d really try out the shrink-at-threshold policy, using a similar
>> approach to the growing strategy, and run some programs using that, before
>> drawing any conclusions about how it might perform. It just might be that
>> it will work just fine. And if not, perhaps the policy could be refined to
>> be more clever.
>>
>> /Erik
>>
>> This is why I suggest we only trim when the program transitions to idle.
>> The trimming will achieve maximum memory savings and only hurt performance
>> when the program wakes up.  When it does wake up, many collections won't
>> change size and so there won't be any impact.  The collections that do
>> change size will experience a negative impact.  Servers usually wake up
>> slowly so the impact will be amortized over time.  Desktop programs do wake
>> up quickly.  However, the user probably won't notice.  The user will
>> probably like that the program is consuming the minimal amount of memory
>> while it idles.
>>
>> On Sat, Jan 28, 2023 at 12:06 AM Erik Osterlund <
>> erik.osterlund at oracle.com> wrote:
>>
>>> Hi Nathan,
>>>
>>> Unfortunately, that section did not cover it. That section talks about
>>> add, and source code level range checks in the code. I’m talking about get
>>> and JVM level range checks that happen on array loads. While the array list
>>> level source code range checks are interesting in their own right, they can
>>> be fiddled with when for example racingly using the ArrayList from
>>> different threads, or for example using reflection, class redefinition or
>>> instrumentation on the ArrayList. While that seems dumb, we can never allow
>>> the JVM to crash, even when processing dumb java code. That’s something
>>> that we gladly would walk through fire to ensure if we have to, and will
>>> never compromise on. Therefore the integrity of array accesses have to
>>> stand on their own feet, and not rely on some array list properties that
>>> dumb code violates.
>>>
>>> To ensure that, when you read elements[index], the JVM has to emit a
>>> JVM-internal range check to prevent erroneous code from accessing data
>>> outside of the array. So to be crystal clear - there is a separate source
>>> code level bounds check in the ArrayList against the size - that is not the
>>> one I am referring to. That one is *not* enough to guarantee the integrity
>>> of the array isn’t violated, and hence that we don’t crash the JVM.
>>>
>>> My main point was that taming the JVM internal range check of the array
>>> accesses, will likely come at a big cost. Both engineering wise and
>>> performance wise. The JIT believes the array length is truly immutable
>>> (because it is), and hence doesn’t believe any form of fencing will be used
>>> to guard updates to this field; it only applies to mutable fields, and the
>>> array length simply isn’t mutable, and is treated as such. This assumption
>>> is ingrained in the very reasoning of JVM internal range check and counted
>>> loop optimizations (loops that run a bounded number of times, usually
>>> driving performance optimizations). Teaching the JIT that the length field
>>> of arrays is in fact mutable, would ruin many such optimizations, which are
>>> crucial for performance. That’s what I’m saying is a hard one to sell.
>>>
>>> You say the benefit of solving this problem in the GC is that
>>> synchronization is free. But it isn’t free though when we have to tame the
>>> JIT and break loop optimizations to make it work. It seems to me like
>>> buying a free Mc Donalds lunch voucher for $1000.
>>>
>>> I think you are on to something in the problem domain, I just think the
>>> solution domain isn’t focused where it ought to be. If these data
>>> structures got bloated in an unwanted way, I think it would help to
>>> identify how they got there, and then try to improve the sizing heuristics
>>> in the libraries instead. You say taming growth isn’t the issue. So is it
>>> the lack of a shrinking policy when removing elements perhaps? Whatever it
>>> is, I think it’s best fixed explicitly at the library level, and not by
>>> hacking the entire JVM around this issue.
>>>
>>> /Erik
>>>
>>> On 28 Jan 2023, at 00:16, Nathan Reynolds <numeralnathan at gmail.com>
>>> wrote:
>>>
>>> 
>>> > But first of all, you still didn’t address my concern that out of
>>> bounds accesses into ArrayList may sometimes crash the JVM, instead of just
>>> throwing innocent exceptions. The aaload does a range check to know if it
>>> should throw, that gets aggressively moved as early as possible.
>>>
>>> I thought I covered your concern in the section "Loop Hoisting
>>> Performance Concern".  I covered hoisting the if statement and range/bounds
>>> checking.  If you still think I haven't covered your concern please
>>> elaborate.
>>>
>>> On Thu, Jan 26, 2023 at 1:04 PM Erik Osterlund <
>>> erik.osterlund at oracle.com> wrote:
>>>
>>>> There is a lot to say here. But first of all, you still didn’t address
>>>> my concern that out of bounds accesses into ArrayList may sometimes crash
>>>> the JVM, instead of just throwing innocent exceptions. The aaload does a
>>>> range check to know if it should throw, that gets aggressively moved as
>>>> early as possible. Destroying array range check optimizations is going to
>>>> be a difficult sell as they are crucial for good performance, and with
>>>> those optimizations in place, shrinking arrays is not valid, as it allows
>>>> accesses outside the bounds of the array.
>>>>
>>>> /Erik
>>>>
>>>> On 26 Jan 2023, at 20:55, Nathan Reynolds <numeralnathan at gmail.com>
>>>> wrote:
>>>>
>>>> 
>>>> > Perhaps the reason for the power of two sizing there, is because then
>>>> going from a hashCode to a bucket can be done quickly using bitwise and,
>>>> instead of expensive modulo. I suspect that changing that policy would come
>>>> at a significant CPU overhead. Also, when elements are scattered randomly
>>>> in the backing array, if you don’t rehash the table, you would only be able
>>>> to remove a couple of trailing nulls at the end of the array, while many
>>>> would remain in the middle between the elements. And rehashing might run
>>>> into collisions which requires allocating tree nodes - something we can’t
>>>> do from a GC context.
>>>>
>>>> You are correct.  Changing away from a power of 2 table size is going
>>>> to be a hard sell.  Perhaps there are workloads where the memory savings is
>>>> worth it, but I don't have any data to support or reject such a proposal.
>>>> I consider the non-power of 2 table sizes outside my proposal.  I am only
>>>> proposing trimming from one larger power of 2 to a smaller power of 2 as
>>>> long as the HashMap's load factor is adhered to.
>>>>
>>>> > Having said that, perhaps if a HashTable is too sparse, a smaller
>>>> default size of the table and more dense load factor, would save some
>>>> memory.
>>>>
>>>> A smaller default table size will save 0.5% heap space for the average
>>>> program.  We would need tests to make sure a smaller default size isn't
>>>> going to cause a problem.
>>>>
>>>> As for a more dense load factor, this drastically impacts the
>>>> performance of HashMap.  Perhaps some workloads can benefit from this.  We
>>>> could make the default loading factor a System property.  On the other
>>>> hand, if someone is going to set the System property, then they probably
>>>> already thought about setting the loading factor in the HashMap
>>>> constructor.  We would need more data to be able to figure this out.
>>>>
>>>> *Pivot*
>>>>
>>>> Okay.  I am pivoting the design a bit in order to address some
>>>> concerns.  I am still proposing off, idle, and always modes.  I still have
>>>> the caveat that the always mode may impact the Java application performance
>>>> too much.
>>>>
>>>> For Serial and Parallel GC, we can trim ArrayLists and HashMaps at any
>>>> time while the Java application is paused.
>>>>
>>>> For G1, the application is paused during the evacuation phase.  We can
>>>> trim during the evacuation since the Java application is paused.
>>>>
>>>> For ZGC, we can trim during the relocation operation.
>>>>
>>>> Here's what I understand about how ZGC works.
>>>>
>>>>    1. Right before relocation, the good color for pointers is changed
>>>>    to remap.
>>>>    2. There is a race between multiple Java and (multiple?) GC threads
>>>>    to change all the pointers to the remap pointer.
>>>>    3. The racing threads check if the pointer is destined for
>>>>    relocation.
>>>>       1. If not, the threads change the color of the pointer and
>>>>       return.
>>>>       4. The threads check if the forwarding table already has a new
>>>>    address.
>>>>       1. If so, the threads change the pointer to the new address and
>>>>       return.
>>>>       5. All threads allocate their own new space.
>>>>    6. All threads copy the object's contents to their own new space.
>>>>    7. The threads use CAS to attempt to insert their copy's address
>>>>    into the forwarding table.
>>>>    8. The winner's copy becomes the official copy of the object.
>>>>    9. The losers deallocate their own new space and use the official
>>>>    address for the object.
>>>>
>>>> At some point, ZGC selects which objects will be relocated.  ZGC then
>>>> executes step #1.  At the time step #1 happens, all objects become
>>>> temporarily immutable.  For objects that will stay in-place, the Java and
>>>> GC threads detect this and simply change the pointers' color.  The object
>>>> is no longer immutable.  For objects that will move, the Java and GC
>>>> threads copy from the old location into a new location(s).  The old
>>>> location is effectively immutable because the Java threads must copy the
>>>> object before any mutation can happen.
>>>>
>>>> Here's what needs to change to support trimming.  This only happens if
>>>> trimming is enabled for this GC collection.  If idle mode is selected, then
>>>> most GC collections won't have trimming enabled.
>>>>
>>>>    1. No change
>>>>    2. No change
>>>>    3. No change
>>>>    4. No change
>>>>    5. All threads allocate their own new space for the ArrayList and
>>>>    its trimmed Object[].
>>>>    6. All threads copy the ArrayList's contents, set the Object[]'s
>>>>    length in the new location to the ArrayList's size, and copy the Object[]'s
>>>>    elements up to size.
>>>>    7. No change
>>>>    8. No change
>>>>    9. No change
>>>>
>>>> When an object must be relocated, none of the threads can modify the
>>>> old location.  The existing and new relocation algorithms don't allow for
>>>> it.  This means the relocation and trimming does not have a race condition
>>>> even though threads are competing to see which can do the copy first.  For
>>>> example, let's say Thread A is on step #5, #6, or #7.  (The other steps
>>>> don't matter.)  Let's say Thread B wins the race and then executes
>>>> ArrayList's Java code and modifies the size, content, grow(), and/or
>>>> shrink.  This modification happens in the new location.  Thread A won't see
>>>> it because Thread A only looks at the old location which the algorithm does
>>>> not allow to be modified.  Thread A will detect it lost the race and throw
>>>> away its *stale* new copy.  Data integrity is preserved.
>>>>
>>>> As for HashMap, here are steps #5 and #6.
>>>>
>>>> 5. All threads allocate their own new space for the HashMap, its new
>>>> power-of-2-length Object[], and all the HashMap Nodes.
>>>> 6. All threads copy the HashMap's contents and the HashMap Nodes' key,
>>>> value and hash into their new location (and not the next pointer).  All
>>>> threads rebuild their own copy of the hashtable.  The threads only use the
>>>> Nodes' hash value and hence don't need to compute hashes, check for
>>>> equality, or execute Java code.
>>>>
>>>> Don't we need to check if two Nodes have equal keys?  No.  The equality
>>>> check happened when the keys were put into the HashMap in Java code.  They
>>>> weren't equal then and so must remain as 2 different Nodes.
>>>>
>>>> *Java Code Changes*
>>>>
>>>> All the GC algorithms require some help from ArrayList and HashMap
>>>> code.  Let's consider this pseudo code for ArrayList.add().
>>>>
>>>> public boolean add(E item)
>>>> {
>>>>    int index, length;
>>>>
>>>>    // Potential safepoint #1
>>>>
>>>>    index = size++;
>>>>
>>>>    // Potential safepoint #2
>>>>    // TODO Add fence
>>>>    // Potential safepoint #3
>>>>
>>>>    length = elementData.length;
>>>>
>>>>    // Potential safepoint #4
>>>>
>>>>    if (size > length)
>>>>    {
>>>>       // Potential safepoint #5
>>>>       grow();
>>>>       // Potential safepoint #6
>>>>    }
>>>>
>>>>    // Potential safepoint #7
>>>>
>>>>    elementData[index] = item;
>>>>
>>>>    // Potential safepoint #8
>>>>
>>>>    return true;
>>>> }
>>>>
>>>> In order to ensure this code works with GC's trimming, we need to add a
>>>> fence between incrementing the size field and reading elementData.length.
>>>> This fence ensures that the trimming code sees the need for at least the
>>>> new size of elements in the new Object[].  The fence prevents JIT from
>>>> reordering the store to the size field after reading elementData's length.
>>>>
>>>> We only need an opaque fence.  We don't need a full or acquire fence
>>>> that impacts CPU behavior and hence performance.  GC pauses ensure adequate
>>>> CPU fencing.  JIT can put a safepoint at any of the potential safepoint
>>>> places and it will work.  Of course, potential safepoints #1 and #8 can
>>>> also be located outside add().  Potential safepoints #5 and #6 can be
>>>> located anywhere inside grow().
>>>>
>>>> Potential safepoint #1: Sadly, the trim operation will see the smaller
>>>> size and change elementData's length to match the current size.  When the
>>>> Java thread wakes up, the Java thread will call grow().  Since trimming
>>>> only happens when the program is transitioning to idle, the extra CPU time
>>>> to trim and regrow is inconsequential.  (This scenario begs the question if
>>>> the idle transition logic is correct because there is an active Java
>>>> thread.)
>>>>
>>>> Potential safepoint #2: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, the Java thread will see that the elementData's length has room
>>>> for 1 more element.  The Java thread will not call grow() and simply
>>>> execute elementData[index] = item.
>>>>
>>>> Potential safepoint #3: Same as #2
>>>>
>>>> Potential safepoint #4: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, the length local variable will not match elementData's length.
>>>> This is okay.  Here are 2 scenarios.  1) The length local variable could be
>>>> ≤ than size.  Sadly, the Java thread will call grow() obviating the trim
>>>> operation's work.  2) The length local variable could be > than size.  The
>>>> Java code won't call grow() and will execute elementData[index] = item.
>>>> Setting the item parameter into the array will work because the trim
>>>> operation left space at index.
>>>>
>>>> Potential safepoint #5: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  Sadly, when the Java
>>>> thread wakes up, it will call grow() obviating the trim operation's work.
>>>>
>>>> Potential safepoint #6: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, it will execute elementData[index] = item.  Setting the item
>>>> parameter into the array will work because the trim operation left space at
>>>> index.
>>>>
>>>> Potential safepoint #7: Same as #6.
>>>>
>>>> Potential safepoint #8: The Java thread finished adding the item
>>>> parameter to elementData before the trim operation.  Thus, the trim
>>>> operation will make elementData's length == size.
>>>>
>>>> For remove operations, the size change must come after
>>>> elementData[index] = null.  We need another opaque fence to ensure size
>>>> changes after elementData[index] = null.
>>>>
>>>> The same logic applies for HashMap, size field must be incremented
>>>> before checking if the hashtable needs to enlarge.  The size field must be
>>>> decremented after removing a node from the hashtable.
>>>>
>>>> *Loop Hoisting Performance Concern*
>>>>
>>>> Let's say JIT inlines add() into a loop where the loop will add N
>>>> elements.  To hoist the if statement (and hence bounds check) out of the
>>>> loop, JIT would have to hoist size += N outside the loop along with the if
>>>> statement and the grow() call.  The opaque fence prevents JIT from hoisting
>>>> anything.  We can make JIT handle this special case and hoist the opaque
>>>> fence along with the rest of the hoisted logic.  No performance is lost.
>>>>
>>>> This violates the meaning of an opaque fence.  So, either we allow the
>>>> violation in this special case because the code still works or we create a
>>>> new fence that is weaker than opaque.  We really only need the store to the
>>>> size field to happen in program order before loading elementData.length.
>>>> All other loads and stores can move around.
>>>>
>>>> Perhaps we can use VarHandle.setOpaque(size) and
>>>> VarHandle.getOpaque(elementData) instead of a fence.  I don't know the
>>>> minutia on how this works, but perhaps this change allows JIT to recognize
>>>> it can hoist.
>>>>
>>>> *Concurrent Collections*
>>>>
>>>> Concurrent collections don't need trimming to happen during GC since
>>>> the data structure already handles concurrent changes.  GC can simply
>>>> collect pointers to all the applicable concurrent collections (i.e.,
>>>> ConcurrentHashMap, and CopyOnWriteArrayList).  We have a few options on
>>>> what kind of thread can do the trimming.  1) Using the fork-join pool of
>>>> threads would take advantage of all the cores but may interfere with the
>>>> program.  2) A single dedicated Java thread means the trimming won't be
>>>> able to take advantage of multiple cores yet the impact will be minimal to
>>>> the program.  I know that thread priority is rarely used, but allowing a
>>>> configuration option to set this thread's priority to idle would ensure
>>>> trimming never takes CPU from the program.  3) Loom threads would take
>>>> advantage of all the cores but will take CPU time away from the program
>>>> since trimming is largely (completely?) a non-blocking operation.  A
>>>> configuration option to set these threads' priority to idle or low would be
>>>> ideal.
>>>>
>>>> *Notes*
>>>>
>>>> For ArrayList, the extra trimming work is negligible and probably ends
>>>> up being a net zero or faster because fewer nulls have to be copied from
>>>> the old Object[] to the new Object[] location.
>>>>
>>>> For HashMap, the extra trimming work will probably have a minuscule
>>>> impact.  However, this minuscule impact may be replaced with better overall
>>>> program performance.  Years ago after String de-duplication was
>>>> implemented, I was surprised that reducing the memory usage of programs had
>>>> the benefit of improving response times.  So, the extra CPU spent to
>>>> de-duplicate is well spent in improving program performance.  In the
>>>> future, trim's CPU time will be compensated by better overall program
>>>> performance.
>>>>
>>>> We need to consider if young collections will be trimmed.  On one hand,
>>>> trimming all objects makes the ZGC relocation algorithm easier.  However,
>>>> the ZGC relocation algorithm only needs to check if the collection is old.
>>>> So, the algorithm only needs another if statement.  On the other hand, it
>>>> may be a waste of CPU time to trim young collections since most of the
>>>> young collections will be garbage soon.  However, if the program is
>>>> transitioning to idle, then trimming the young collections makes sense
>>>> since these collections could be around for a 4-day weekend.
>>>>
>>>> I am still only suggesting that we do trimming when the program
>>>> transitions to idle.  This way the extra CPU spent on trimming has an
>>>> inconsequential impact.  Potentially the always mode for trimming will be
>>>> fine for most workloads.  This will require performance testing.
>>>>
>>>> For always mode, perhaps we only trim old collections each GC cycle and
>>>> young collections when the program transitions to idle.  It would be
>>>> interesting to know how many old collections have large variances in size.
>>>> Queues probably swing wildly in size and so trimming wouldn't be a good
>>>> idea.  The data in caches probably change over time.  However, the cache
>>>> size is usually fixed and maybe constant.  If constant, trimming will be of
>>>> great benefit.  If we can determine which collections swing wildly in size,
>>>> then trimming can avoid those and we are one step closer to being able to
>>>> enable the always mode for all workloads.
>>>>
>>>> On Mon, Jan 23, 2023 at 10:20 PM Erik Osterlund <
>>>> erik.osterlund at oracle.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 24 Jan 2023, at 02:33, Nathan Reynolds <numeralnathan at gmail.com>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> > The array list code itself will probably blow up if for example you
>>>>> GC in the middle of let’s say the add method. It checks if we have
>>>>> capacity, then we GC draining that capacity, then add the object out of
>>>>> bounds. GC safepoints can happen anywhere in Java code, including in any
>>>>> method on the array list, which uses the length for various reasons,
>>>>> assuming it won’t change seemingly concurrently.
>>>>>
>>>>> You raise a valid point.  Thank you for catching this.  I wrote a
>>>>> response but came up with something better below.
>>>>>
>>>>> > Loads from an array can speculatively float above user control
>>>>> checks like the ArrayList internal range check. That’s fine and the user
>>>>> can’t observe the difference in behaviour. However, an aaload has its own
>>>>> range check against the actual array length. Loads into the array can not
>>>>> float above that array range check, because that could perform a load
>>>>> outside the heap boundary. Therefore, with your proposal, a read out of
>>>>> bounds of the array list would not only throw an exception, but
>>>>> occasionally also access outside the heap and hence crash the JVM, when
>>>>> accessed in a loop within which the capacity can shrink, rendering hoisted
>>>>> range checks invalid. That’s my concern here. And combating that would
>>>>> likely involve a huge deopt hammer triggered by GC.
>>>>>
>>>>> We can have GC only do the trimming operation when the only references
>>>>> to the ArrayList or HashMap are from other objects in the heap.  If the
>>>>> ArrayList or HashMap is referenced from a thread's stack or registers, then
>>>>> GC skips the ArrayList or HashMap.  Thus, a Java thread acquires a "lock"
>>>>> on the ArrayList or HashMap when it starts referencing the ArrayList or
>>>>> HashMap.  The Java thread releases a "lock" when it stops referencing the
>>>>> ArrayList or HashMap.  "lock" doesn't mean anything more than the thread
>>>>> having a reference to the ArrayList or HashMap.  It does not imply any
>>>>> synchronization mechanism that will cause a performance problem for the
>>>>> Java thread.
>>>>>
>>>>> Non-ZGC algorithms pause the application during the move operation.
>>>>> While the application is paused, GC can be sure that no Java threads are
>>>>> operating on the ArrayList or HashMap.
>>>>>
>>>>> For ZGC, a GC thread can move the ArrayList and its internal Object[]
>>>>> with trimming.  Whenever a GC thread is moving an object, Java threads that
>>>>> attempt to access the object already block and wait for the move to finish,
>>>>> if I understand correctly.  If the Java thread wants to access the
>>>>> ArrayList before a GC thread has moved it, the Java thread does the move
>>>>> operation.  Since the Java thread is executing code outside of the
>>>>> ArrayList, then the Java thread will do the trimming before the Java thread
>>>>> starts looking at Object[] bounds.
>>>>>
>>>>>
>>>>> Right. This takes us back to scanning all activation records in the
>>>>> system. As mentioned earlier, that tends to get rather expensive when you
>>>>> use virtual threads, and you can have activation records in the heap as
>>>>> well, which also similarly may have problematic references to array list
>>>>> internals. We have had to go over mechanisms that scan activation records
>>>>> in general, and rethink such algorithms to do something else, due to the
>>>>> scalability concerns.
>>>>>
>>>>> > Yes, I am talking about large object allocation. While we certainly
>>>>> could deal with it, my point was rather that it would be painful and
>>>>> require a bunch of special handling.
>>>>>
>>>>> The trimming can avoid large objects.  With that in mind, trimming
>>>>> large Object[] in ArrayList and HashMap is probably not worth the
>>>>> additional effort.  Large sparsely-used ArrayList doesn't even show up in
>>>>> the average heap dump.  Large sparsely-used HashMap only takes 0.4% of the
>>>>> heap.  Large sparsely-used ConcurrentHashMap only takes 0.1% of the heap.
>>>>> We can revisit trimming large sparsely-used ArrayLists and HashMaps when we
>>>>> are looking for heap reductions ≤ 0.5%. :)
>>>>>
>>>>>
>>>>> Fair enough.
>>>>>
>>>>>
>>>>> >> Perhaps this could be modelled as a library problem instead. Sounds
>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>> preferences?
>>>>> > I was thinking of rather having a less aggressive growing strategy,
>>>>> or something along those lines.
>>>>>
>>>>> This won't help in HashMap's case.  The table sizes are powers of 2.
>>>>> The growth has to be a doubling.  HashMap is already growing as slowly as
>>>>> possible.
>>>>>
>>>>>
>>>>> Perhaps the reason for the power of two sizing there, is because then
>>>>> going from a hashCode to a bucket can be done quickly using bitwise and,
>>>>> instead of expensive modulo. I suspect that changing that policy would come
>>>>> at a significant CPU overhead. Also, when elements are scattered randomly
>>>>> in the backing array, if you don’t rehash the table, you would only be able
>>>>> to remove a couple of trailing nulls at the end of the array, while many
>>>>> would remain in the middle between the elements. And rehashing might run
>>>>> into collisions which requires allocating tree nodes - something we can’t
>>>>> do from a GC context.
>>>>>
>>>>> Having said that, perhaps if a HashTable is too sparse, a smaller
>>>>> default size of the table and more dense load factor, would save some
>>>>> memory.
>>>>>
>>>>> Slowing the growth rate of ArrayList will only reduce waste by < 0.3%.
>>>>>
>>>>>
>>>>> Okay.
>>>>>
>>>>> /Erik
>>>>>
>>>>> On Mon, Jan 23, 2023 at 2:34 PM Erik Osterlund <
>>>>> erik.osterlund at oracle.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 23 Jan 2023, at 18:45, Nathan Reynolds <numeralnathan at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> > 1. Such a change would have user observable differences in
>>>>>> behaviour, which could introduce bugs in user code, due to the
>>>>>> optimization.
>>>>>>
>>>>>> How is this user observable?  The Object[] is buried inside an
>>>>>> ArrayList or HashMap.  This idea is not touching other Object[]'s outside a
>>>>>> collection.
>>>>>>
>>>>>> I suppose a performance impact from having to grow is somewhat
>>>>>> observable.  This was noted in my original email.  However, growing is not
>>>>>> functionally observable.
>>>>>>
>>>>>> The only way to functionally observe this is through reflection.
>>>>>> Reflection is a moot concern because there are no guarantees when using
>>>>>> reflection and switching to a new Java version.  So, any reflection into
>>>>>> ArrayList or HashMap would have to be fixed like any other functional
>>>>>> change to ArrayList or HashMap.
>>>>>>
>>>>>>
>>>>>> The array list code itself will probably blow up if for example you
>>>>>> GC in the middle of let’s say the add method. It checks if we have
>>>>>> capacity, then we GC draining that capacity, then add the object out of
>>>>>> bounds. GC safepoints can happen anywhere in Java code, including in any
>>>>>> method on the array list, which uses the length for various reasons,
>>>>>> assuming it won’t change seemingly concurrently.
>>>>>>
>>>>>> > 2. The JIT relies a lot on hoisting range checks out of loops to
>>>>>> get good performance, but we need to safepoint poll inside of loops. If we
>>>>>> take the slow path of such a safepoint poll, and perform a GC which shrinks
>>>>>> the arrays, then the range checks that were only performed before the loop,
>>>>>> are no longer valid, and we would essentially have to deoptimize all
>>>>>> activation records (frames) in the system, which with virtual threads could
>>>>>> be a lot of frames. Either that or stop hoisting range checks out of loops,
>>>>>> but that comes with its own performance problems.
>>>>>>
>>>>>> ArrayList uses its internal size field to determine what elements the
>>>>>> code looks at.  If the Object[] changes length to anything ≥ size, then
>>>>>> loops and hoisting still work.  The new Object[] length won't cause a
>>>>>> problem.
>>>>>>
>>>>>> HashMap may or may not be possible to trim.  It will take some effort
>>>>>> to examine the code and determine if a GC trim will work or the code may
>>>>>> need to be changed to be compatible.  One way to make the Java code
>>>>>> compatible would be to make the Java code set the HashMap's table field to
>>>>>> null while holding the reference to the table in a local variable.  GC will
>>>>>> then not recognize the table as belonging to HashMap and hence GC won't do
>>>>>> anything with it.
>>>>>>
>>>>>>
>>>>>> Loads from an array can speculatively float above user control checks
>>>>>> like the ArrayList internal range check. That’s fine and the user can’t
>>>>>> observe the difference in behaviour. However, an aaload has its own range
>>>>>> check against the actual array length. Loads into the array can not float
>>>>>> above that array range check, because that could perform a load outside the
>>>>>> heap boundary. Therefore, with your proposal, a read out of bounds of the
>>>>>> array list would not only throw an exception, but occasionally also access
>>>>>> outside the heap and hence crash the JVM, when accessed in a loop within
>>>>>> which the capacity can shrink, rendering hoisted range checks invalid.
>>>>>> That’s my concern here. And combating that would likely involve a huge
>>>>>> deopt hammer triggered by GC.
>>>>>>
>>>>>> > 3. Many GCs allocate objects that are larger than a certain
>>>>>> threshold in special storage that deals with unique concerns for larger
>>>>>> objects. If said objects shrink to below that limit, I suspect we would
>>>>>> have many interesting problems to deal with. Not impossible to fix, but
>>>>>> would probably be rather painful to deal with.
>>>>>>
>>>>>> Are you talking about large object allocation?  If so, this is beyond
>>>>>> my expertise.  However, I would assume that GC could allocate a new
>>>>>> location for the Object[], copy the data, and leave the large Object[] as
>>>>>> garbage for collection later.  This is probably worth it since the memory
>>>>>> saves could be very significant.
>>>>>>
>>>>>>
>>>>>> Yes, I am talking avout large object allocation. While we certainly
>>>>>> could deal with it, my point was rather that it would be painful and
>>>>>> require a bunch of special handling.
>>>>>>
>>>>>> > Perhaps this could be modelled as a library problem instead. Sounds
>>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>>> preferences?
>>>>>>
>>>>>> I am not sure how a Java library could implement this.  The library
>>>>>> could use reflection to get access to the Object[] in ArrayList or HashMap,
>>>>>> allocate a new Object[], copy the elements, and use reflection to store the
>>>>>> new Object[].  However, there is no synchronization mechanism to get the
>>>>>> other Java threads to see the reference to the new Object[].  Some threads
>>>>>> will continue to use the old Object[] and some threads will use the new
>>>>>> Object[].  Also, if other Java threads change the elements in the old
>>>>>> Object[] while the library is making the change, then those changes could
>>>>>> be lost.
>>>>>>
>>>>>>
>>>>>> I was thinking of rather having a less aggressive growing strategy,
>>>>>> or something along those lines.
>>>>>>
>>>>>> The beauty of trimming during GC move is that GC and Java threads
>>>>>> have already negotiated how to synchronize the object move.  In Serial GC,
>>>>>> the Java threads are paused and GC can move anything it wants.  GC changes
>>>>>> the stack (and registers?) of all the Java threads.  The Java threads have
>>>>>> no way to tell that the object moved.  In ZGC, the Java and GC threads work
>>>>>> together to perform the move.  If I understand correctly, other GC
>>>>>> algorithms execute moves like Serial or ZGC.
>>>>>>
>>>>>>
>>>>>> As I mentioned earlier, they have not negotiated when such GC can
>>>>>> shrink, if a java thread is in the middle of a method poking around at an
>>>>>> array list.
>>>>>>
>>>>>> Hope this makes sense.
>>>>>>
>>>>>> /Erik
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 23, 2023 at 1:57 AM Erik Osterlund <
>>>>>> erik.osterlund at oracle.com> wrote:
>>>>>>
>>>>>>> Hi Nathan,
>>>>>>>
>>>>>>> I can think of a number of concerning aspects of doing something
>>>>>>> like this:
>>>>>>>
>>>>>>> 1. Such a change would have user observable differences in
>>>>>>> behaviour, which could introduce bugs in user code, due to the optimization.
>>>>>>>
>>>>>>> 2. The JIT relies a lot on hoisting range checks out of loops to get
>>>>>>> good performance, but we need to safepoint poll inside of loops. If we take
>>>>>>> the slow path of such a safepoint poll, and perform a GC which shrinks the
>>>>>>> arrays, then the range checks that were only performed before the loop, are
>>>>>>> no longer valid, and we would essentially have to deoptimize all activation
>>>>>>> records (frames) in the system, which with virtual threads could be a lot
>>>>>>> of frames. Either that or stop hoisting range checks out of loops, but that
>>>>>>> comes with its own performance problems.
>>>>>>>
>>>>>>> 3. Many GCs allocate objects that are larger than a certain
>>>>>>> threshold in special storage that deals with unique concerns for larger
>>>>>>> objects. If said objects shrink to below that limit, I suspect we would
>>>>>>> have many interesting problems to deal with. Not impossible to fix, but
>>>>>>> would probably be rather painful to deal with.
>>>>>>>
>>>>>>> Perhaps this could be modelled as a library problem instead. Sounds
>>>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>>>> preferences?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik
>>>>>>>
>>>>>>> > On 21 Jan 2023, at 00:24, Nathan Reynolds <numeralnathan at gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > ArrayList, HashMap, and other collections have internal Object[]'s
>>>>>>> to store the elements.  The Object[]'s are lazily initialized to minimize
>>>>>>> heap usage.  However, there is still quite a bit of waste in these
>>>>>>> Object[]'s.  Some collections are oversized from the beginning.  Some
>>>>>>> collections are filled and then partially or fully emptied.  For an average
>>>>>>> heap dump, 10.1% of the heap is wasted on oversized Object[]'s.
>>>>>>> >
>>>>>>> > These Object[]'s can be trimmed during GC's move operation but
>>>>>>> only when the process transitions to idle.  Since the trimming only happens
>>>>>>> when the application transitions to idle, then the extra CPU and/or pause
>>>>>>> time isn't a concern.  The performance penalty to the application to resize
>>>>>>> larger won't happen for  a while since the process is idle.  With the
>>>>>>> reduction of heap usage more idle processes can run on a single VM (i.e.,
>>>>>>> the JVM will be more Docker Container friendly).
>>>>>>> >
>>>>>>> > The trimming may apply to other collections such as ArrayDeque,
>>>>>>> BitSet, HashSet, IdentityHashMap, LinkedHashMap, Stack, TreeMap, TreeSet,
>>>>>>> WeakHashMap, Hashtable, and Vector.  HashSet, IdentityHashMap,
>>>>>>> LinkedHashMap, TreeMap, and TreeSet may be implemented with HashMap so
>>>>>>> trimming HashMap will automatically make these other collections benefit.
>>>>>>> >
>>>>>>> > How did I come up with the 10.1% heap reduction?  I analyzed 650
>>>>>>> heap dumps and presented the findings in Java Memory Hogs.  By carefully
>>>>>>> adding up the right numbers, I calculate that the heap reduction will be
>>>>>>> 10.1%.  That's a big savings and it will impact the entire world of Java
>>>>>>> programs!
>>>>>>> >
>>>>>>> > I entertained the idea of having a setting to trim with every GC
>>>>>>> or when objects promote from young to old generation.  One concern is that
>>>>>>> the application would take a performance penalty to enlarge some or all the
>>>>>>> Object[]'s.  Another concern is that the extra CPU time and/or pause time
>>>>>>> will interfere with the application's performance.
>>>>>>> >
>>>>>>> > What do you think of the overall idea?  Do you have ideas of how
>>>>>>> to trim more often than when the application transitions to idle?
>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20230207/7641de7a/attachment-0001.htm>


More information about the hotspot-gc-dev mailing list