[External] : Re: Trimming ArrayList and HashMap During GC Copy

Thu Jan 26 19:55:20 UTC 2023

> Perhaps the reason for the power of two sizing there, is because then
going from a hashCode to a bucket can be done quickly using bitwise and,
instead of expensive modulo. I suspect that changing that policy would come
at a significant CPU overhead. Also, when elements are scattered randomly
in the backing array, if you don’t rehash the table, you would only be able
to remove a couple of trailing nulls at the end of the array, while many
would remain in the middle between the elements. And rehashing might run
into collisions which requires allocating tree nodes - something we can’t
do from a GC context.

You are correct.  Changing away from a power of 2 table size is going to be
a hard sell.  Perhaps there are workloads where the memory savings is worth
it, but I don't have any data to support or reject such a proposal.  I
consider the non-power of 2 table sizes outside my proposal.  I am only
proposing trimming from one larger power of 2 to a smaller power of 2 as
long as the HashMap's load factor is adhered to.

> Having said that, perhaps if a HashTable is too sparse, a smaller default
size of the table and more dense load factor, would save some memory.

A smaller default table size will save 0.5% heap space for the average
program.  We would need tests to make sure a smaller default size isn't
going to cause a problem.

As for a more dense load factor, this drastically impacts the performance
of HashMap.  Perhaps some workloads can benefit from this.  We could make
the default loading factor a System property.  On the other hand, if
someone is going to set the System property, then they probably already
thought about setting the loading factor in the HashMap constructor.  We
would need more data to be able to figure this out.

*Pivot*

Okay.  I am pivoting the design a bit in order to address some concerns.  I
am still proposing off, idle, and always modes.  I still have the caveat
that the always mode may impact the Java application performance too much.

For Serial and Parallel GC, we can trim ArrayLists and HashMaps at any time
while the Java application is paused.

For G1, the application is paused during the evacuation phase.  We can trim
during the evacuation since the Java application is paused.

For ZGC, we can trim during the relocation operation.

Here's what I understand about how ZGC works.

   1. Right before relocation, the good color for pointers is changed to
   remap.
   2. There is a race between multiple Java and (multiple?) GC threads to
   change all the pointers to the remap pointer.
   3. The racing threads check if the pointer is destined for relocation.
      1. If not, the threads change the color of the pointer and return.
      4. The threads check if the forwarding table already has a new
   address.
      1. If so, the threads change the pointer to the new address and
      return.
      5. All threads allocate their own new space.
   6. All threads copy the object's contents to their own new space.
   7. The threads use CAS to attempt to insert their copy's address into
   the forwarding table.
   8. The winner's copy becomes the official copy of the object.
   9. The losers deallocate their own new space and use the official
   address for the object.

At some point, ZGC selects which objects will be relocated.  ZGC then
executes step #1.  At the time step #1 happens, all objects become
temporarily immutable.  For objects that will stay in-place, the Java and
GC threads detect this and simply change the pointers' color.  The object
is no longer immutable.  For objects that will move, the Java and GC
threads copy from the old location into a new location(s).  The old
location is effectively immutable because the Java threads must copy the
object before any mutation can happen.

Here's what needs to change to support trimming.  This only happens if
trimming is enabled for this GC collection.  If idle mode is selected, then
most GC collections won't have trimming enabled.

   1. No change
   2. No change
   3. No change
   4. No change
   5. All threads allocate their own new space for the ArrayList and its
   trimmed Object[].
   6. All threads copy the ArrayList's contents, set the Object[]'s length
   in the new location to the ArrayList's size, and copy the Object[]'s
   elements up to size.
   7. No change
   8. No change
   9. No change

When an object must be relocated, none of the threads can modify the old
location.  The existing and new relocation algorithms don't allow for it.
This means the relocation and trimming does not have a race condition even
though threads are competing to see which can do the copy first.  For
example, let's say Thread A is on step #5, #6, or #7.  (The other steps
don't matter.)  Let's say Thread B wins the race and then executes
ArrayList's Java code and modifies the size, content, grow(), and/or
shrink.  This modification happens in the new location.  Thread A won't see
it because Thread A only looks at the old location which the algorithm does
not allow to be modified.  Thread A will detect it lost the race and throw
away its *stale* new copy.  Data integrity is preserved.

As for HashMap, here are steps #5 and #6.

5. All threads allocate their own new space for the HashMap, its new
power-of-2-length Object[], and all the HashMap Nodes.
6. All threads copy the HashMap's contents and the HashMap Nodes' key,
value and hash into their new location (and not the next pointer).  All
threads rebuild their own copy of the hashtable.  The threads only use the
Nodes' hash value and hence don't need to compute hashes, check for
equality, or execute Java code.

Don't we need to check if two Nodes have equal keys?  No.  The equality
check happened when the keys were put into the HashMap in Java code.  They
weren't equal then and so must remain as 2 different Nodes.

*Java Code Changes*

All the GC algorithms require some help from ArrayList and HashMap code.
Let's consider this pseudo code for ArrayList.add().

public boolean add(E item)
{
   int index, length;

   // Potential safepoint #1

   index = size++;

   // Potential safepoint #2
   // TODO Add fence
   // Potential safepoint #3

   length = elementData.length;

   // Potential safepoint #4

   if (size > length)
   {
      // Potential safepoint #5
      grow();
      // Potential safepoint #6
   }

   // Potential safepoint #7

   elementData[index] = item;

   // Potential safepoint #8

   return true;
}

In order to ensure this code works with GC's trimming, we need to add a
fence between incrementing the size field and reading elementData.length.
This fence ensures that the trimming code sees the need for at least the
new size of elements in the new Object[].  The fence prevents JIT from
reordering the store to the size field after reading elementData's length.

We only need an opaque fence.  We don't need a full or acquire fence that
impacts CPU behavior and hence performance.  GC pauses ensure adequate CPU
fencing.  JIT can put a safepoint at any of the potential safepoint places
and it will work.  Of course, potential safepoints #1 and #8 can also be
located outside add().  Potential safepoints #5 and #6 can be located
anywhere inside grow().

Potential safepoint #1: Sadly, the trim operation will see the smaller size
and change elementData's length to match the current size.  When the Java
thread wakes up, the Java thread will call grow().  Since trimming only
happens when the program is transitioning to idle, the extra CPU time to
trim and regrow is inconsequential.  (This scenario begs the question if
the idle transition logic is correct because there is an active Java
thread.)

Potential safepoint #2: The trim operation will see the larger size and
change elementData's length to match the larger size.  When the Java thread
wakes up, the Java thread will see that the elementData's length has room
for 1 more element.  The Java thread will not call grow() and simply
execute elementData[index] = item.

Potential safepoint #3: Same as #2

Potential safepoint #4: The trim operation will see the larger size and
change elementData's length to match the larger size.  When the Java thread
wakes up, the length local variable will not match elementData's length.
This is okay.  Here are 2 scenarios.  1) The length local variable could be
≤ than size.  Sadly, the Java thread will call grow() obviating the trim
operation's work.  2) The length local variable could be > than size.  The
Java code won't call grow() and will execute elementData[index] = item.
Setting the item parameter into the array will work because the trim
operation left space at index.

Potential safepoint #5: The trim operation will see the larger size and
change elementData's length to match the larger size.  Sadly, when the Java
thread wakes up, it will call grow() obviating the trim operation's work.

Potential safepoint #6: The trim operation will see the larger size and
change elementData's length to match the larger size.  When the Java thread
wakes up, it will execute elementData[index] = item.  Setting the item
parameter into the array will work because the trim operation left space at
index.

Potential safepoint #7: Same as #6.

Potential safepoint #8: The Java thread finished adding the item parameter
to elementData before the trim operation.  Thus, the trim operation will
make elementData's length == size.

For remove operations, the size change must come after elementData[index] =
null.  We need another opaque fence to ensure size changes after
elementData[index] = null.

The same logic applies for HashMap, size field must be incremented before
checking if the hashtable needs to enlarge.  The size field must be
decremented after removing a node from the hashtable.

*Loop Hoisting Performance Concern*

Let's say JIT inlines add() into a loop where the loop will add N
elements.  To hoist the if statement (and hence bounds check) out of the
loop, JIT would have to hoist size += N outside the loop along with the if
statement and the grow() call.  The opaque fence prevents JIT from hoisting
anything.  We can make JIT handle this special case and hoist the opaque
fence along with the rest of the hoisted logic.  No performance is lost.

This violates the meaning of an opaque fence.  So, either we allow the
violation in this special case because the code still works or we create a
new fence that is weaker than opaque.  We really only need the store to the
size field to happen in program order before loading elementData.length.
All other loads and stores can move around.

Perhaps we can use VarHandle.setOpaque(size) and
VarHandle.getOpaque(elementData) instead of a fence.  I don't know the
minutia on how this works, but perhaps this change allows JIT to recognize
it can hoist.

*Concurrent Collections*

Concurrent collections don't need trimming to happen during GC since the
data structure already handles concurrent changes.  GC can simply collect
pointers to all the applicable concurrent collections (i.e.,
ConcurrentHashMap, and CopyOnWriteArrayList).  We have a few options on
what kind of thread can do the trimming.  1) Using the fork-join pool of
threads would take advantage of all the cores but may interfere with the
program.  2) A single dedicated Java thread means the trimming won't be
able to take advantage of multiple cores yet the impact will be minimal to
the program.  I know that thread priority is rarely used, but allowing a
configuration option to set this thread's priority to idle would ensure
trimming never takes CPU from the program.  3) Loom threads would take
advantage of all the cores but will take CPU time away from the program
since trimming is largely (completely?) a non-blocking operation.  A
configuration option to set these threads' priority to idle or low would be
ideal.

*Notes*

For ArrayList, the extra trimming work is negligible and probably ends up
being a net zero or faster because fewer nulls have to be copied from the
old Object[] to the new Object[] location.

For HashMap, the extra trimming work will probably have a minuscule
impact.  However, this minuscule impact may be replaced with better overall
program performance.  Years ago after String de-duplication was
implemented, I was surprised that reducing the memory usage of programs had
the benefit of improving response times.  So, the extra CPU spent to
de-duplicate is well spent in improving program performance.  In the
future, trim's CPU time will be compensated by better overall program
performance.

We need to consider if young collections will be trimmed.  On one hand,
trimming all objects makes the ZGC relocation algorithm easier.  However,
the ZGC relocation algorithm only needs to check if the collection is old.
So, the algorithm only needs another if statement.  On the other hand, it
may be a waste of CPU time to trim young collections since most of the
young collections will be garbage soon.  However, if the program is
transitioning to idle, then trimming the young collections makes sense
since these collections could be around for a 4-day weekend.

I am still only suggesting that we do trimming when the program transitions
to idle.  This way the extra CPU spent on trimming has an inconsequential
impact.  Potentially the always mode for trimming will be fine for most
workloads.  This will require performance testing.

For always mode, perhaps we only trim old collections each GC cycle and
young collections when the program transitions to idle.  It would be
interesting to know how many old collections have large variances in size.
Queues probably swing wildly in size and so trimming wouldn't be a good
idea.  The data in caches probably change over time.  However, the cache
size is usually fixed and maybe constant.  If constant, trimming will be of
great benefit.  If we can determine which collections swing wildly in size,
then trimming can avoid those and we are one step closer to being able to
enable the always mode for all workloads.

On Mon, Jan 23, 2023 at 10:20 PM Erik Osterlund <erik.osterlund at oracle.com>
wrote:

>
>
> On 24 Jan 2023, at 02:33, Nathan Reynolds <numeralnathan at gmail.com> wrote:
>
> 
> > The array list code itself will probably blow up if for example you GC
> in the middle of let’s say the add method. It checks if we have capacity,
> then we GC draining that capacity, then add the object out of bounds. GC
> safepoints can happen anywhere in Java code, including in any method on the
> array list, which uses the length for various reasons, assuming it won’t
> change seemingly concurrently.
>
> You raise a valid point.  Thank you for catching this.  I wrote a response
> but came up with something better below.
>
> > Loads from an array can speculatively float above user control checks
> like the ArrayList internal range check. That’s fine and the user can’t
> observe the difference in behaviour. However, an aaload has its own range
> check against the actual array length. Loads into the array can not float
> above that array range check, because that could perform a load outside the
> heap boundary. Therefore, with your proposal, a read out of bounds of the
> array list would not only throw an exception, but occasionally also access
> outside the heap and hence crash the JVM, when accessed in a loop within
> which the capacity can shrink, rendering hoisted range checks invalid.
> That’s my concern here. And combating that would likely involve a huge
> deopt hammer triggered by GC.
>
> We can have GC only do the trimming operation when the only references to
> the ArrayList or HashMap are from other objects in the heap.  If the
> ArrayList or HashMap is referenced from a thread's stack or registers, then
> GC skips the ArrayList or HashMap.  Thus, a Java thread acquires a "lock"
> on the ArrayList or HashMap when it starts referencing the ArrayList or
> HashMap.  The Java thread releases a "lock" when it stops referencing the
> ArrayList or HashMap.  "lock" doesn't mean anything more than the thread
> having a reference to the ArrayList or HashMap.  It does not imply any
> synchronization mechanism that will cause a performance problem for the
> Java thread.
>
> Non-ZGC algorithms pause the application during the move operation.  While
> the application is paused, GC can be sure that no Java threads are
> operating on the ArrayList or HashMap.
>
> For ZGC, a GC thread can move the ArrayList and its internal Object[] with
> trimming.  Whenever a GC thread is moving an object, Java threads that
> attempt to access the object already block and wait for the move to finish,
> if I understand correctly.  If the Java thread wants to access the
> ArrayList before a GC thread has moved it, the Java thread does the move
> operation.  Since the Java thread is executing code outside of the
> ArrayList, then the Java thread will do the trimming before the Java thread
> starts looking at Object[] bounds.
>
>
> Right. This takes us back to scanning all activation records in the
> system. As mentioned earlier, that tends to get rather expensive when you
> use virtual threads, and you can have activation records in the heap as
> well, which also similarly may have problematic references to array list
> internals. We have had to go over mechanisms that scan activation records
> in general, and rethink such algorithms to do something else, due to the
> scalability concerns.
>
> > Yes, I am talking about large object allocation. While we certainly
> could deal with it, my point was rather that it would be painful and
> require a bunch of special handling.
>
> The trimming can avoid large objects.  With that in mind, trimming large
> Object[] in ArrayList and HashMap is probably not worth the additional
> effort.  Large sparsely-used ArrayList doesn't even show up in the average
> heap dump.  Large sparsely-used HashMap only takes 0.4% of the heap.  Large
> sparsely-used ConcurrentHashMap only takes 0.1% of the heap.  We can
> revisit trimming large sparsely-used ArrayLists and HashMaps when we are
> looking for heap reductions ≤ 0.5%. :)
>
>
> Fair enough.
>
>
> >> Perhaps this could be modelled as a library problem instead. Sounds
> like perhaps the growing strategy of the library is too aggressive for your
> preferences?
> > I was thinking of rather having a less aggressive growing strategy, or
> something along those lines.
>
> This won't help in HashMap's case.  The table sizes are powers of 2.  The
> growth has to be a doubling.  HashMap is already growing as slowly as
> possible.
>
>
> Perhaps the reason for the power of two sizing there, is because then
> going from a hashCode to a bucket can be done quickly using bitwise and,
> instead of expensive modulo. I suspect that changing that policy would come
> at a significant CPU overhead. Also, when elements are scattered randomly
> in the backing array, if you don’t rehash the table, you would only be able
> to remove a couple of trailing nulls at the end of the array, while many
> would remain in the middle between the elements. And rehashing might run
> into collisions which requires allocating tree nodes - something we can’t
> do from a GC context.
>
> Having said that, perhaps if a HashTable is too sparse, a smaller default
> size of the table and more dense load factor, would save some memory.
>
> Slowing the growth rate of ArrayList will only reduce waste by < 0.3%.
>
>
> Okay.
>
> /Erik
>
> On Mon, Jan 23, 2023 at 2:34 PM Erik Osterlund <erik.osterlund at oracle.com>
> wrote:
>
>>
>>
>> On 23 Jan 2023, at 18:45, Nathan Reynolds <numeralnathan at gmail.com>
>> wrote:
>>
>> 
>> > 1. Such a change would have user observable differences in behaviour,
>> which could introduce bugs in user code, due to the optimization.
>>
>> How is this user observable?  The Object[] is buried inside an ArrayList
>> or HashMap.  This idea is not touching other Object[]'s outside a
>> collection.
>>
>> I suppose a performance impact from having to grow is somewhat
>> observable.  This was noted in my original email.  However, growing is not
>> functionally observable.
>>
>> The only way to functionally observe this is through reflection.
>> Reflection is a moot concern because there are no guarantees when using
>> reflection and switching to a new Java version.  So, any reflection into
>> ArrayList or HashMap would have to be fixed like any other functional
>> change to ArrayList or HashMap.
>>
>>
>> The array list code itself will probably blow up if for example you GC in
>> the middle of let’s say the add method. It checks if we have capacity, then
>> we GC draining that capacity, then add the object out of bounds. GC
>> safepoints can happen anywhere in Java code, including in any method on the
>> array list, which uses the length for various reasons, assuming it won’t
>> change seemingly concurrently.
>>
>> > 2. The JIT relies a lot on hoisting range checks out of loops to get
>> good performance, but we need to safepoint poll inside of loops. If we take
>> the slow path of such a safepoint poll, and perform a GC which shrinks the
>> arrays, then the range checks that were only performed before the loop, are
>> no longer valid, and we would essentially have to deoptimize all activation
>> records (frames) in the system, which with virtual threads could be a lot
>> of frames. Either that or stop hoisting range checks out of loops, but that
>> comes with its own performance problems.
>>
>> ArrayList uses its internal size field to determine what elements the
>> code looks at.  If the Object[] changes length to anything ≥ size, then
>> loops and hoisting still work.  The new Object[] length won't cause a
>> problem.
>>
>> HashMap may or may not be possible to trim.  It will take some effort to
>> examine the code and determine if a GC trim will work or the code may need
>> to be changed to be compatible.  One way to make the Java code compatible
>> would be to make the Java code set the HashMap's table field to null while
>> holding the reference to the table in a local variable.  GC will then not
>> recognize the table as belonging to HashMap and hence GC won't do anything
>> with it.
>>
>>
>> Loads from an array can speculatively float above user control checks
>> like the ArrayList internal range check. That’s fine and the user can’t
>> observe the difference in behaviour. However, an aaload has its own range
>> check against the actual array length. Loads into the array can not float
>> above that array range check, because that could perform a load outside the
>> heap boundary. Therefore, with your proposal, a read out of bounds of the
>> array list would not only throw an exception, but occasionally also access
>> outside the heap and hence crash the JVM, when accessed in a loop within
>> which the capacity can shrink, rendering hoisted range checks invalid.
>> That’s my concern here. And combating that would likely involve a huge
>> deopt hammer triggered by GC.
>>
>> > 3. Many GCs allocate objects that are larger than a certain threshold
>> in special storage that deals with unique concerns for larger objects. If
>> said objects shrink to below that limit, I suspect we would have many
>> interesting problems to deal with. Not impossible to fix, but would
>> probably be rather painful to deal with.
>>
>> Are you talking about large object allocation?  If so, this is beyond my
>> expertise.  However, I would assume that GC could allocate a new location
>> for the Object[], copy the data, and leave the large Object[] as garbage
>> for collection later.  This is probably worth it since the memory saves
>> could be very significant.
>>
>>
>> Yes, I am talking avout large object allocation. While we certainly could
>> deal with it, my point was rather that it would be painful and require a
>> bunch of special handling.
>>
>> > Perhaps this could be modelled as a library problem instead. Sounds
>> like perhaps the growing strategy of the library is too aggressive for your
>> preferences?
>>
>> I am not sure how a Java library could implement this.  The library could
>> use reflection to get access to the Object[] in ArrayList or HashMap,
>> allocate a new Object[], copy the elements, and use reflection to store the
>> new Object[].  However, there is no synchronization mechanism to get the
>> other Java threads to see the reference to the new Object[].  Some threads
>> will continue to use the old Object[] and some threads will use the new
>> Object[].  Also, if other Java threads change the elements in the old
>> Object[] while the library is making the change, then those changes could
>> be lost.
>>
>>
>> I was thinking of rather having a less aggressive growing strategy, or
>> something along those lines.
>>
>> The beauty of trimming during GC move is that GC and Java threads have
>> already negotiated how to synchronize the object move.  In Serial GC, the
>> Java threads are paused and GC can move anything it wants.  GC changes the
>> stack (and registers?) of all the Java threads.  The Java threads have no
>> way to tell that the object moved.  In ZGC, the Java and GC threads work
>> together to perform the move.  If I understand correctly, other GC
>> algorithms execute moves like Serial or ZGC.
>>
>>
>> As I mentioned earlier, they have not negotiated when such GC can shrink,
>> if a java thread is in the middle of a method poking around at an array
>> list.
>>
>> Hope this makes sense.
>>
>> /Erik
>>
>>
>> On Mon, Jan 23, 2023 at 1:57 AM Erik Osterlund <erik.osterlund at oracle.com>
>> wrote:
>>
>>> Hi Nathan,
>>>
>>> I can think of a number of concerning aspects of doing something like
>>> this:
>>>
>>> 1. Such a change would have user observable differences in behaviour,
>>> which could introduce bugs in user code, due to the optimization.
>>>
>>> 2. The JIT relies a lot on hoisting range checks out of loops to get
>>> good performance, but we need to safepoint poll inside of loops. If we take
>>> the slow path of such a safepoint poll, and perform a GC which shrinks the
>>> arrays, then the range checks that were only performed before the loop, are
>>> no longer valid, and we would essentially have to deoptimize all activation
>>> records (frames) in the system, which with virtual threads could be a lot
>>> of frames. Either that or stop hoisting range checks out of loops, but that
>>> comes with its own performance problems.
>>>
>>> 3. Many GCs allocate objects that are larger than a certain threshold in
>>> special storage that deals with unique concerns for larger objects. If said
>>> objects shrink to below that limit, I suspect we would have many
>>> interesting problems to deal with. Not impossible to fix, but would
>>> probably be rather painful to deal with.
>>>
>>> Perhaps this could be modelled as a library problem instead. Sounds like
>>> perhaps the growing strategy of the library is too aggressive for your
>>> preferences?
>>>
>>> Thanks,
>>> /Erik
>>>
>>> > On 21 Jan 2023, at 00:24, Nathan Reynolds <numeralnathan at gmail.com>
>>> wrote:
>>> >
>>> > ArrayList, HashMap, and other collections have internal Object[]'s to
>>> store the elements.  The Object[]'s are lazily initialized to minimize heap
>>> usage.  However, there is still quite a bit of waste in these Object[]'s.
>>> Some collections are oversized from the beginning.  Some collections are
>>> filled and then partially or fully emptied.  For an average heap dump,
>>> 10.1% of the heap is wasted on oversized Object[]'s.
>>> >
>>> > These Object[]'s can be trimmed during GC's move operation but only
>>> when the process transitions to idle.  Since the trimming only happens when
>>> the application transitions to idle, then the extra CPU and/or pause time
>>> isn't a concern.  The performance penalty to the application to resize
>>> larger won't happen for  a while since the process is idle.  With the
>>> reduction of heap usage more idle processes can run on a single VM (i.e.,
>>> the JVM will be more Docker Container friendly).
>>> >
>>> > The trimming may apply to other collections such as ArrayDeque,
>>> BitSet, HashSet, IdentityHashMap, LinkedHashMap, Stack, TreeMap, TreeSet,
>>> WeakHashMap, Hashtable, and Vector.  HashSet, IdentityHashMap,
>>> LinkedHashMap, TreeMap, and TreeSet may be implemented with HashMap so
>>> trimming HashMap will automatically make these other collections benefit.
>>> >
>>> > How did I come up with the 10.1% heap reduction?  I analyzed 650 heap
>>> dumps and presented the findings in Java Memory Hogs.  By carefully adding
>>> up the right numbers, I calculate that the heap reduction will be 10.1%.
>>> That's a big savings and it will impact the entire world of Java programs!
>>> >
>>> > I entertained the idea of having a setting to trim with every GC or
>>> when objects promote from young to old generation.  One concern is that the
>>> application would take a performance penalty to enlarge some or all the
>>> Object[]'s.  Another concern is that the extra CPU time and/or pause time
>>> will interfere with the application's performance.
>>> >
>>> > What do you think of the overall idea?  Do you have ideas of how to
>>> trim more often than when the application transitions to idle?
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20230126/07a27f97/attachment-0001.htm>