[External] : Re: Trimming ArrayList and HashMap During GC Copy

Tue Feb 7 23:45:53 UTC 2023

> Sounds to me like checking out the direct approach of having a shrinking
policy when a collection removes elements, is still out there as the more
feasible solution. I really think we ought to try that first and see how it
pans out.

Agreed.  This isn't a GC thing.  Which mailing list or which person do I
talk to?

> As far as calling Java code directly from GC threads goes, well we are
not technically equipped to run Java code from GC threads, so that is out
of the question. Could we chain together all collections in a linked list
and hand off to some other concurrent thread that calls this method?
> a) It seems counter intuitive to add fields to the collection as part of
a memory optimization when the library solution doesn’t seemingly need it.

Agreed.  Adding fields would require testing a wide variety of applications
to see if there is a memory savings.

> b) Using VarHandle on the size field to fiddle with it using atomics
requires the size field to be declared volatile, not sure the JIT will like
consequences of that.

I forgot that VarHandle target fields had to volatile.  No big deal.  All
normal accesses to the size field can use VarHandle.get() and set().  After
JIT does its optimization, this boils down to normal load or store
operations.

> c) The link for the linked list can keep collections alive artificially,
increasing the heap use instead of the other way around. So a policy for
clearing these links promptly becomes important to avoid heap bloat - the
very thing the optimization tries to fix. It isn’t clear from your
description how that would be dealt with.

If GC is only adding collections to the linked list when the collection is
promoted from young to old or when the old collection is walked, then
collections in the linked list are most likely going stick around for a
while.  So, there is a small chance that an instance could be garbage but
the linked list is holding it.  However, this is likely to be very small.
One would likely have to write a specially crafted program to do such a
thing.

I envision this to work like ReferenceQueue.  A dedicated Java thread waits
for new collections to show up.  It then executes markToTrim() on all the
collections and goes back to waiting.  markToTrim() is a very fast
operation.  I estimate it will take about 172 cycles to pop a node off the
linked list and execute markToTrim().  This number includes 100% cache
misses on the initial access to node, collection and Object[].  On a 2.7
GHz server processor, that will take 64 ns per collection.

A typical heap dump has 894,000 collections that need to be trimmed.  I
don't have any data on how many of these are in the old generation.  Even
so, old GC probably doesn't traverse the entire old generation with each
old GC.  So, the number of collections that would be flagged is much much
smaller.  I am going to guess about 20,000 collections get flagged per old
GC.

It will take 1.2 ms to process 20,000 collections.  The time between old GC
is much longer than that.  Thus, the linked list will be empty before the
next old GC kicks in.

Let's say old GC does traverse the entire heap, then it will take 57 ms to
process all 894,000 collections.  Again, the time between old GC is much
longer than that.  Thus, the linked list will be empty before the next old
GC kicks in.

The above assumes that flagging happens with every old GC.  I am proposing
that flagging happens when the program transitions to idle.  In which case,
the next old GC is not going to happen for a very long time.

> d) Long linked lists gives GC a lot of indigestion in general and
performance of walking such lists is typically terrible.

Good point.  Linked lists require pointer chasing which is very slow.  As
shown above, the linked list of flagged collections is going to be empty.
GC won't have to traverse it.

> e) The is isTrimmable() method would have to be called from a different
thread from the one that is using the data structure. These collections are
not written to be concurrency-safe, so it isn’t clear how that would pan
out concretely.

markToTrim() is called by a background Java thread similar to a
ReferenceQueue thread.  markToTrim() needs to be concurrency-safe and I may
have shown it to be so.  I added isTrimmable() to make the discussion
easier.  This method won't exist because the programmer will inline the
logic into markToTrim().

> As far as a separate thread continuously walking the entire heap to flag
things that can shrink goes, well I would be rather surprised if that would
use less CPU than simply shrinking directly when an object removes an
element. Walking the entire heap is expensive and not even the GC wants to
do that. You would come a long way just taking that CPU quota and doing
more GC to keep the size of the heap down.

Agreed.  This is why I think GC should identify the collections.  I am not
asking GC to traverse the entire heap.  I am simply asking GC to identify
the collections as GC interacts walks over them.

> As the actual shrinking is triggered not by flagging, but by subsequent
*use* of a collection, it might be that when you go into the “idle” mode
and the program is literally doing nothing, it won’t use all these
collections that we want to shrink. So you end up wasting all this memory
anyway during the idle mode, while the library solution would ensure that
the data structures have already shrunk when you get to that idle state,
and we just need to GC and be done with it.

Agreed.  I realized that the flagging and trimming would be an eventual
trimming.  The library solution would be better from that point of view.  I
am (falsely?) concerned that the library solution would impact performance
too much.  If so, the eventual trimming will be a better solution.

> In summary, it seems rather feasible and straight forward to implement a
shrinking policy in the collections if we need to keep the footprint down.
I don’t so far see any reason why this ought to be solved with JVM magic
and special GC code instead, when there is no obvious need for it, and it
isn’t clear what it achieves compared to the less magical baseline solution
of having a shrinking policy in place. Before thinking about further JVM
magic that we can use to solve this, I’d really like to see the baseline
library solution evaluated first, so we can see if there is any further
problem to solve or not. So does it make sense to try that out first,
before further discussing crazy JVM centric solutions?

Agreed.  Let's try the shrinking policy approach first.  It will be easier
to implement.  As I said at the beginning of this reply, which mailing list
or person to I reach out to?

On Tue, Feb 7, 2023 at 9:01 AM Erik Osterlund <erik.osterlund at oracle.com>
wrote:

> Hi Nathan,
>
> Sounds to me like checking out the direct approach of having a shrinking
> policy when a collection removes elements, is still out there as the more
> feasible solution. I really think we ought to try that first and see how it
> pans out.
>
> As far as calling Java code directly from GC threads goes, well we are not
> technically equipped to run Java code from GC threads, so that is out of
> the question. Could we chain together all collections in a linked list and
> hand off to some other concurrent thread that calls this method? Well
> a) It seems counter intuitive to add fields to the collection as part of a
> memory optimization when the library solution doesn’t seemingly need it.
> b) Using VarHandle on the size field to fiddle with it using atomics
> requires the size field to be declared volatile, not sure the JIT will like
> consequences of that.
> c) The link for the linked list can keep collections alive artificially,
> increasing the heap use instead of the other way around. So a policy for
> clearing these links promptly becomes important to avoid heap bloat - the
> very thing the optimization tries to fix. It isn’t clear from your
> description how that would be dealt with.
> d) Long linked lists gives GC a lot of indigestion in general and
> performance of walking such lists is typically terrible.
> e) The is isTrimmable() method would have to be called from a different
> thread from the one that is using the data structure. These collections are
> not written to be concurrency-safe, so it isn’t clear how that would pan
> out concretely.
>
> As far as a separate thread continuously walking the entire heap to flag
> things that can shrink goes, well I would be rather surprised if that would
> use less CPU than simply shrinking directly when an object removes an
> element. Walking the entire heap is expensive and not even the GC wants to
> do that. You would come a long way just taking that CPU quota and doing
> more GC to keep the size of the heap down.
>
> As the actual shrinking is triggered not by flagging, but by subsequent
> *use* of a collection, it might be that when you go into the “idle” mode
> and the program is literally doing nothing, it won’t use all these
> collections that we want to shrink. So you end up wasting all this memory
> anyway during the idle mode, while the library solution would ensure that
> the data structures have already shrunk when you get to that idle state,
> and we just need to GC and be done with it.
>
> In summary, it seems rather feasible and straight forward to implement a
> shrinking policy in the collections if we need to keep the footprint down.
> I don’t so far see any reason why this ought to be solved with JVM magic
> and special GC code instead, when there is no obvious need for it, and it
> isn’t clear what it achieves compared to the less magical baseline solution
> of having a shrinking policy in place. Before thinking about further JVM
> magic that we can use to solve this, I’d really like to see the baseline
> library solution evaluated first, so we can see if there is any further
> problem to solve or not. So does it make sense to try that out first,
> before further discussing crazy JVM centric solutions?
>
> /Erik
>
> On 2 Feb 2023, at 22:42, Nathan Reynolds <numeralnathan at gmail.com> wrote:
>
> *Trim During Relocating*
>
> > If the bound used in the loop is a small constant, let’s say 4, then the
> JIT can also unroll the body of the loop 4 times so that the generated
> machine code doesn’t even have a loop any more, like this:
>
> I thought about loop unrolling, but left it out.  Thank you for including
> it because it helped me realize some more complexity.
>
> I still don't have solutions to dealing with JIT optimizations and loading
> elements from the array.
>
> *Shrink-at-Threshold Policy*
>
> > Copying tends to be faster than people think. Even when their mind set
> is that it’s cheap, it still tends to surprise. And if I recall your data
> correctly, these things tend to be rather small, no? ;-)
> > So I’d really try out the shrink-at-threshold policy, using a similar
> approach to the growing strategy, and run some programs using that, before
> drawing any conclusions about how it might perform. It just might be that
> it will work just fine. And if not, perhaps the policy could be refined to
> be more clever.
>
> Good point.  We are scared about a potentially viable solution simply
> because it might perform poorly.  Fortunately, there are some really good
> performance tests that can verify if the shrink-at-threshold policy is
> going to have any impact.  For example, SPECjbb and SPECjEnterprise could
> show that most programs will be fine.
>
> There probably are programs with collection(s) that shouldn't shrink, then
> each collection instance can provide a configuration option to tune or
> disable the shrink-at-threshold policy.  We also need to provide a
> configuration option to turn off the policy altogether so that some
> programs can have decent performance until they can individually configure
> the problematic collections.  Maybe the shrink-at-threshold policy can be
> smarter to backoff if the array is growing and shrinking too often.  If
> this were the case, then all (?) programs would be fine.
>
> *Flagging*
>
> This approach is a gradual approach and perhaps doesn't involve GC at
> all.  Each collection opts in with a new method called private void
> markToTrim().  To avoid collisions with already existing methods of the
> same signature, this method must be annotated with a new annotation called
> TrimMarkable.  A flagging mechanism (GC or something else) can call this
> method.
>
> This solution has an advantage over the Trim During Relocation solution.
> Other library collections can implement markToTrim().  This solution isn't
> only for the JDK collections.  markToTrim() makes life easier for the
> flagging mechanism since each collection takes care of its own
> implementation of trim marking.  This makes each collection responsible for
> its own behavior and does not tie the flagging mechanism to the collection
> except through markToTrim() specification.
>
> One implementation of markToTrim() could use the sign bit in the
> collection's size field.  When the sign bit is set, that means the
> collection should be trimmed.  Later, when a program's thread uses the
> collection, the program's thread wipes out the negative sign and calls
> trimToSize().  Here is a possible implementation of markToTrim().
>
> private static final VarHandle SIZE_HANDLE;   // static initializer
> connects this to the collection's size field
>
> private void markToTrim()
> {
>    int expect;
>
>    expect = SIZE_HANDLE.getOpaque(this);
>
>    if (size > 0)           // Don't call size().  See size() later.
>        if (isTrimmable())
>       {
>          SIZE_HANDLE.compareAndSet(this, expect, -expect);
>       }
> }
>
> getOpaque() ensures that JIT won't cache the size field's value in a
> register or on the stack.
>
> The "if (size > 0)" ensures that the sign bit is not already set.  If so,
> there is not point in changing it.
>
> The compareAndSet() ensures that any changes by the Java thread(s) are not
> lost.  This allows markToTrim() to run concurrently with the Java threads.
> However, the result of the compareAndSet() operation could be lost if the
> Java thread reads the size field before markToTrim() and then set the size
> field after markToTrim().  Losing the result of the compareAndSet()
> operation is actually desirable.  This means a Java thread is changing the
> size of the collection and hence trimming would probably be a poor choice.
> Furthermore, looping to ensure the size sign flag is changed is not
> desirable.
>
> isTrimmable() ensures that markToTrim() doesn't waste time setting the
> size sign bit for a collection that is already as compact as possible.
> This also eliminates branch mispredictions (see size() later) dealing with
> the sign bit in the size field if there is nothing to do.  isTrimmable()
> would look like this for ArrayList.
>
> private boolean isTrimmable()
> {
>    return size < elementData.length;     // Don't call size()
> }
>
> isTrimmable() for HashMap would look like this.  This is because
> elementData.length is a power of 2 and it can't be trimmed unless all the
> elements can fit in half the table size times the load factor.
>
> private boolean isTrimmable()
> {
>    return size < elementData.length / 2 * loadFactor;   // Don't call
> size()
> }
>
> isTrimmable() is a separate method to make it clear that markToTrim() is
> fairly generic.  The real implementation probably wouldn't bother having
> isTrimmable()
>
> The collection's size() needs to change to the following.
>
> public int size()
> {
>    int result;
>
>    result = size;
>
>    return result >= 0 ? result : trimThenSize();
> }
>
> private int trimThenSize()
> {
>   int result;
>
>   result = Math.abs(size);
>   size   = result;
>
>   trimToSize();
>
>   return result;
> }
>
> trimToSize() is another method that the collection implements.  To see an
> example, see ArrayList.trimToSize().
>
> size() simply returns the size field value if the sign bit is not set.  It
> calls trimThenSize() which does the heavy lifting of fixing the size field
> and doing the trim.  This allows JIT to inline size() but not inline
> trimThenSize() for better performance.
>
> Copying the size field to the local variable result ensures that any
> concurrent changes to the size field from trimToSize() cannot cause a
> problem.  We only need to make sure that JIT doesn't get rid of the result
> local variable and instead reloads the size field.  A mild opaque fence
> would do this; however, this is probably overkill since JIT is unlikely to
> reload the size field.  In other words, we need a fence opposite of opaque
> that forces JIT to not reload the size field.
>
> Ignoring the overhead of calling a method and returning a value, the
> original implementation of size() only has a memory load instruction to get
> the size field's value for a total cost of 3 cycles (on Intel x86).  The
> new implementation has the same memory load, a comparison, and a branch.
> This will add a total cost of 2 cycles and 3 bytes (on Intel x86).  For a
> loop, JIT will probably hoist the call to size() so the additional 2 cycles
> is insignificant compared to the rest of the loop's execution.  We can
> ignore branch misprediction penalties since the sign bit is rarely set.  I
> doubt anyone will notice the performance impact of any of these changes.
>
> The rest of the collection's code needs to call size() and not directly
> read the size field.  size() will be used in all (?) of the collection's
> methods.  This makes size() the ideal place to consider trimming the
> collection.
>
> *Flagging Mechanism*
>
> The flagging mechanism runs in Java code.  However, it needs a list of
> collections to flag.  This is similar to ReferenceQueue, Cleaner, String
> deduplication, and the dreaded finalize.
>
> I only know of 1 entity that has that information and that is GC.  If
> there is some pauseless heap walking API in the JVM, then some thread in
> the JVM can walk the heap and as it finds a collection, it tells the Java
> thread about the collection.
>
> We still want the 2 modes: off and idle.  Here are some other modes we may
> want to entertain.
>
> Idle mode: Something determines that the application has gone idle and
> kicks in a full GC and return memory back to the OS.  That same thing could
> kick off a heap walk to find the collections.  However, since GC is going
> to walk the entire heap already, it would make sense to have GC provide the
> list of collections to flag to the flagging mechanism.
>
> Every GC mode: Instead of an always mode, we may want an every GC mode.
> Something can monitor for GCs and then start the heap walking, GC could
> kick off the heap walking, or we can piggy back off of GC and when it
> discovers a collection, GC lets the flagging mechanism know.
>
> Promotion mode: Whenever GC promotes a collection from young to old, it
> informs the flagging mechanism of the collection.  GC only tells the
> flagging mechanism about the collection.  It doesn't do anything more.
>
> Relocation mode: Whenever GC relocates a collection, it informs the
> flagging mechanism of the collection.  Again, GC only tells the flagging
> mechanism about the collection.  It doesn't do anything more.
>
> We may want some combination of idle, promotion, and relocation.
>
> Thank you for reading this and my previous emails.  Please ask
> clarification questions, poke holes, etc.
>
> On Tue, Jan 31, 2023 at 10:11 PM Erik Osterlund <erik.osterlund at oracle.com>
> wrote:
>
>>
>>
>> On 31 Jan 2023, at 22:52, Nathan Reynolds <numeralnathan at gmail.com>
>> wrote:
>>
>> 
>> Let's see if I understand by writing a concrete example.  Here is some
>> pseudo ArrayList.get() code.  The Object[] access bounds checking are
>> explicitly written to make it easier to discuss.
>>
>> public E get(int index)
>> {
>>    if ((index < 0) || (index >= size))
>>    {
>>       throw new IndexOutOfBoundsException();
>>    }
>>
>>    if ((index < 0) || (index >= elementData.length))   // Array bounds
>> checking
>>    {
>>       throw new ArrayIndexOutOfBoundsException();
>>    }
>>
>>    return (E) elementData[index];
>> }
>>
>> Let's say we have the following loop.
>>
>> int index;
>>
>> for (index = 0; index < myList.size(); index++)
>> {
>>    System.out.println(myList.get(index));
>> }
>>
>> Here is the loop code with ArrayList.size() and ArrayList.get() inlined.
>>
>> int index;
>>
>> for (index = 0; index < myList.size; index++)   // Note: size() was
>> replaced with the field size
>> {
>>    if ((index < 0) || (index >= myList.size))
>>    {
>>       throw new IndexOutOfBoundsException();
>>    }
>>
>>    if ((index < 0) || (index >= myList.elementData.length))   // Array
>> bounds checking
>>    {
>>       throw new ArrayIndexOutOfBoundsException();
>>    }
>>
>>    System.out.println((E) myList.elementData[index]);
>> }
>>
>> I am no JIT expert and what I say may be wrong...
>>
>>    - JIT can get rid of both index < 0 since JIT can prove that index is
>>    always >= 0.
>>    - JIT can get rid of index >= myList.size condition since JIT can
>>    prove that index must always be < size due to the for loop's condition.
>>    - JIT can store myList.size in a local variable.  This makes the
>>    local variable constant.
>>    - JIT can store myList.elementData in a local variable.  This makes
>>    the local variable constant.
>>    - With both size and elementData in local variables, this allows JIT
>>    to hoist the final array-bound checking condition out of the loop.
>>
>>
>> Object elementData[];
>> int size, index;
>>
>> size        = myList.size;
>> elementData = myList.elementData;
>>
>> if (size >= elementData.length)      // Array bounds checking
>> {
>>    // TODO Execute the code to print indexes 0 to elementData.length - 1
>>
>>    throw new ArrayIndexOutOfBoundsException();
>> }
>>
>> for (index = 0; index < size; index++)
>> {
>>    System.out.println((E) elementData[index]);
>> }
>>
>> For well behaving Java code, ArrayList.size <= elementData.length.  Thus,
>> when GC trims elementData[], the new array will still be long enough.  The
>> JVM won't crash.
>>
>> For poorly behaving code, Thread A could be executing the above loop.
>> Thread B could decrement ArrayList.size.  Thus, when GC trims
>> elementData[], the new array won't be long enough.  Thread A walks beyond
>> the end of the array and will corrupt or more likely crash the JVM.  Is
>> this what you are pointing out?
>>
>>
>> Yes, indeed. If the bound used in the loop is a small constant, let’s say
>> 4, then the JIT can also unroll the body of the loop 4 times so that the
>> generated machine code doesn’t even have a loop any more, like this:
>>
>> Object elementData[];
>> int size;
>>
>> size        = myList.size;
>> elementData = myList.elementData;
>>
>> if (size < 4 || elementData.length < 4)
>> {
>>    throw new ArrayIndexOutOfBoundsException();
>> }
>>
>> System.out.println(elementData[0]);
>> System.out.println(elementData[1]);
>> System.out.println(elementData[2]);
>> System.out.println(elementData[3]);
>>
>> For HashMap, the problem is more difficult to solve.  Let's say a thread
>> is iterating over the table and nodes.  If the node change bucket
>> locations, then the thread won't iterate over all the nodes.
>>
>>
>> Right.
>>
>> I need some time to think about these problems and come up with a
>> solution.  I figured I would get this email out there to take a step
>> forward while I think.
>>
>>
>> Okay.
>>
>> > You say taming growth isn’t the issue. So is it the lack of a shrinking
>> policy when removing elements perhaps? Whatever it is, I think it’s best
>> fixed explicitly at the library level, and not by hacking the entire JVM
>> around this issue.
>>
>> Yes, the problem is the lack of a shrinking policy when removing elements.
>>
>> We could trim the Object[] with every remove.  This would provide the
>> maximum memory savings, but really hurt performance if the ArrayList is
>> bouncing around in size.
>>
>> We could trim the Object[] when a lower threshold is reached.  This would
>> provide some memory savings and somewhat hurt performance if the ArrayList
>> is bouncing around in size between the lower threshold and the Object[]
>> length.
>>
>>
>> Copying tends to be faster than people think. Even when their mind set is
>> that it’s cheap, it still tends to surprise. And if I recall your data
>> correctly, these things tend to be rather small, no? ;-)
>>
>> So I’d really try out the shrink-at-threshold policy, using a similar
>> approach to the growing strategy, and run some programs using that, before
>> drawing any conclusions about how it might perform. It just might be that
>> it will work just fine. And if not, perhaps the policy could be refined to
>> be more clever.
>>
>> /Erik
>>
>> This is why I suggest we only trim when the program transitions to idle.
>> The trimming will achieve maximum memory savings and only hurt performance
>> when the program wakes up.  When it does wake up, many collections won't
>> change size and so there won't be any impact.  The collections that do
>> change size will experience a negative impact.  Servers usually wake up
>> slowly so the impact will be amortized over time.  Desktop programs do wake
>> up quickly.  However, the user probably won't notice.  The user will
>> probably like that the program is consuming the minimal amount of memory
>> while it idles.
>>
>> On Sat, Jan 28, 2023 at 12:06 AM Erik Osterlund <
>> erik.osterlund at oracle.com> wrote:
>>
>>> Hi Nathan,
>>>
>>> Unfortunately, that section did not cover it. That section talks about
>>> add, and source code level range checks in the code. I’m talking about get
>>> and JVM level range checks that happen on array loads. While the array list
>>> level source code range checks are interesting in their own right, they can
>>> be fiddled with when for example racingly using the ArrayList from
>>> different threads, or for example using reflection, class redefinition or
>>> instrumentation on the ArrayList. While that seems dumb, we can never allow
>>> the JVM to crash, even when processing dumb java code. That’s something
>>> that we gladly would walk through fire to ensure if we have to, and will
>>> never compromise on. Therefore the integrity of array accesses have to
>>> stand on their own feet, and not rely on some array list properties that
>>> dumb code violates.
>>>
>>> To ensure that, when you read elements[index], the JVM has to emit a
>>> JVM-internal range check to prevent erroneous code from accessing data
>>> outside of the array. So to be crystal clear - there is a separate source
>>> code level bounds check in the ArrayList against the size - that is not the
>>> one I am referring to. That one is *not* enough to guarantee the integrity
>>> of the array isn’t violated, and hence that we don’t crash the JVM.
>>>
>>> My main point was that taming the JVM internal range check of the array
>>> accesses, will likely come at a big cost. Both engineering wise and
>>> performance wise. The JIT believes the array length is truly immutable
>>> (because it is), and hence doesn’t believe any form of fencing will be used
>>> to guard updates to this field; it only applies to mutable fields, and the
>>> array length simply isn’t mutable, and is treated as such. This assumption
>>> is ingrained in the very reasoning of JVM internal range check and counted
>>> loop optimizations (loops that run a bounded number of times, usually
>>> driving performance optimizations). Teaching the JIT that the length field
>>> of arrays is in fact mutable, would ruin many such optimizations, which are
>>> crucial for performance. That’s what I’m saying is a hard one to sell.
>>>
>>> You say the benefit of solving this problem in the GC is that
>>> synchronization is free. But it isn’t free though when we have to tame the
>>> JIT and break loop optimizations to make it work. It seems to me like
>>> buying a free Mc Donalds lunch voucher for $1000.
>>>
>>> I think you are on to something in the problem domain, I just think the
>>> solution domain isn’t focused where it ought to be. If these data
>>> structures got bloated in an unwanted way, I think it would help to
>>> identify how they got there, and then try to improve the sizing heuristics
>>> in the libraries instead. You say taming growth isn’t the issue. So is it
>>> the lack of a shrinking policy when removing elements perhaps? Whatever it
>>> is, I think it’s best fixed explicitly at the library level, and not by
>>> hacking the entire JVM around this issue.
>>>
>>> /Erik
>>>
>>> On 28 Jan 2023, at 00:16, Nathan Reynolds <numeralnathan at gmail.com>
>>> wrote:
>>>
>>> 
>>> > But first of all, you still didn’t address my concern that out of
>>> bounds accesses into ArrayList may sometimes crash the JVM, instead of just
>>> throwing innocent exceptions. The aaload does a range check to know if it
>>> should throw, that gets aggressively moved as early as possible.
>>>
>>> I thought I covered your concern in the section "Loop Hoisting
>>> Performance Concern".  I covered hoisting the if statement and range/bounds
>>> checking.  If you still think I haven't covered your concern please
>>> elaborate.
>>>
>>> On Thu, Jan 26, 2023 at 1:04 PM Erik Osterlund <
>>> erik.osterlund at oracle.com> wrote:
>>>
>>>> There is a lot to say here. But first of all, you still didn’t address
>>>> my concern that out of bounds accesses into ArrayList may sometimes crash
>>>> the JVM, instead of just throwing innocent exceptions. The aaload does a
>>>> range check to know if it should throw, that gets aggressively moved as
>>>> early as possible. Destroying array range check optimizations is going to
>>>> be a difficult sell as they are crucial for good performance, and with
>>>> those optimizations in place, shrinking arrays is not valid, as it allows
>>>> accesses outside the bounds of the array.
>>>>
>>>> /Erik
>>>>
>>>> On 26 Jan 2023, at 20:55, Nathan Reynolds <numeralnathan at gmail.com>
>>>> wrote:
>>>>
>>>> 
>>>> > Perhaps the reason for the power of two sizing there, is because then
>>>> going from a hashCode to a bucket can be done quickly using bitwise and,
>>>> instead of expensive modulo. I suspect that changing that policy would come
>>>> at a significant CPU overhead. Also, when elements are scattered randomly
>>>> in the backing array, if you don’t rehash the table, you would only be able
>>>> to remove a couple of trailing nulls at the end of the array, while many
>>>> would remain in the middle between the elements. And rehashing might run
>>>> into collisions which requires allocating tree nodes - something we can’t
>>>> do from a GC context.
>>>>
>>>> You are correct.  Changing away from a power of 2 table size is going
>>>> to be a hard sell.  Perhaps there are workloads where the memory savings is
>>>> worth it, but I don't have any data to support or reject such a proposal.
>>>> I consider the non-power of 2 table sizes outside my proposal.  I am only
>>>> proposing trimming from one larger power of 2 to a smaller power of 2 as
>>>> long as the HashMap's load factor is adhered to.
>>>>
>>>> > Having said that, perhaps if a HashTable is too sparse, a smaller
>>>> default size of the table and more dense load factor, would save some
>>>> memory.
>>>>
>>>> A smaller default table size will save 0.5% heap space for the average
>>>> program.  We would need tests to make sure a smaller default size isn't
>>>> going to cause a problem.
>>>>
>>>> As for a more dense load factor, this drastically impacts the
>>>> performance of HashMap.  Perhaps some workloads can benefit from this.  We
>>>> could make the default loading factor a System property.  On the other
>>>> hand, if someone is going to set the System property, then they probably
>>>> already thought about setting the loading factor in the HashMap
>>>> constructor.  We would need more data to be able to figure this out.
>>>>
>>>> *Pivot*
>>>>
>>>> Okay.  I am pivoting the design a bit in order to address some
>>>> concerns.  I am still proposing off, idle, and always modes.  I still have
>>>> the caveat that the always mode may impact the Java application performance
>>>> too much.
>>>>
>>>> For Serial and Parallel GC, we can trim ArrayLists and HashMaps at any
>>>> time while the Java application is paused.
>>>>
>>>> For G1, the application is paused during the evacuation phase.  We can
>>>> trim during the evacuation since the Java application is paused.
>>>>
>>>> For ZGC, we can trim during the relocation operation.
>>>>
>>>> Here's what I understand about how ZGC works.
>>>>
>>>>    1. Right before relocation, the good color for pointers is changed
>>>>    to remap.
>>>>    2. There is a race between multiple Java and (multiple?) GC threads
>>>>    to change all the pointers to the remap pointer.
>>>>    3. The racing threads check if the pointer is destined for
>>>>    relocation.
>>>>       1. If not, the threads change the color of the pointer and
>>>>       return.
>>>>       4. The threads check if the forwarding table already has a new
>>>>    address.
>>>>       1. If so, the threads change the pointer to the new address and
>>>>       return.
>>>>       5. All threads allocate their own new space.
>>>>    6. All threads copy the object's contents to their own new space.
>>>>    7. The threads use CAS to attempt to insert their copy's address
>>>>    into the forwarding table.
>>>>    8. The winner's copy becomes the official copy of the object.
>>>>    9. The losers deallocate their own new space and use the official
>>>>    address for the object.
>>>>
>>>> At some point, ZGC selects which objects will be relocated.  ZGC then
>>>> executes step #1.  At the time step #1 happens, all objects become
>>>> temporarily immutable.  For objects that will stay in-place, the Java and
>>>> GC threads detect this and simply change the pointers' color.  The object
>>>> is no longer immutable.  For objects that will move, the Java and GC
>>>> threads copy from the old location into a new location(s).  The old
>>>> location is effectively immutable because the Java threads must copy the
>>>> object before any mutation can happen.
>>>>
>>>> Here's what needs to change to support trimming.  This only happens if
>>>> trimming is enabled for this GC collection.  If idle mode is selected, then
>>>> most GC collections won't have trimming enabled.
>>>>
>>>>    1. No change
>>>>    2. No change
>>>>    3. No change
>>>>    4. No change
>>>>    5. All threads allocate their own new space for the ArrayList and
>>>>    its trimmed Object[].
>>>>    6. All threads copy the ArrayList's contents, set the Object[]'s
>>>>    length in the new location to the ArrayList's size, and copy the Object[]'s
>>>>    elements up to size.
>>>>    7. No change
>>>>    8. No change
>>>>    9. No change
>>>>
>>>> When an object must be relocated, none of the threads can modify the
>>>> old location.  The existing and new relocation algorithms don't allow for
>>>> it.  This means the relocation and trimming does not have a race condition
>>>> even though threads are competing to see which can do the copy first.  For
>>>> example, let's say Thread A is on step #5, #6, or #7.  (The other steps
>>>> don't matter.)  Let's say Thread B wins the race and then executes
>>>> ArrayList's Java code and modifies the size, content, grow(), and/or
>>>> shrink.  This modification happens in the new location.  Thread A won't see
>>>> it because Thread A only looks at the old location which the algorithm does
>>>> not allow to be modified.  Thread A will detect it lost the race and throw
>>>> away its *stale* new copy.  Data integrity is preserved.
>>>>
>>>> As for HashMap, here are steps #5 and #6.
>>>>
>>>> 5. All threads allocate their own new space for the HashMap, its new
>>>> power-of-2-length Object[], and all the HashMap Nodes.
>>>> 6. All threads copy the HashMap's contents and the HashMap Nodes' key,
>>>> value and hash into their new location (and not the next pointer).  All
>>>> threads rebuild their own copy of the hashtable.  The threads only use the
>>>> Nodes' hash value and hence don't need to compute hashes, check for
>>>> equality, or execute Java code.
>>>>
>>>> Don't we need to check if two Nodes have equal keys?  No.  The equality
>>>> check happened when the keys were put into the HashMap in Java code.  They
>>>> weren't equal then and so must remain as 2 different Nodes.
>>>>
>>>> *Java Code Changes*
>>>>
>>>> All the GC algorithms require some help from ArrayList and HashMap
>>>> code.  Let's consider this pseudo code for ArrayList.add().
>>>>
>>>> public boolean add(E item)
>>>> {
>>>>    int index, length;
>>>>
>>>>    // Potential safepoint #1
>>>>
>>>>    index = size++;
>>>>
>>>>    // Potential safepoint #2
>>>>    // TODO Add fence
>>>>    // Potential safepoint #3
>>>>
>>>>    length = elementData.length;
>>>>
>>>>    // Potential safepoint #4
>>>>
>>>>    if (size > length)
>>>>    {
>>>>       // Potential safepoint #5
>>>>       grow();
>>>>       // Potential safepoint #6
>>>>    }
>>>>
>>>>    // Potential safepoint #7
>>>>
>>>>    elementData[index] = item;
>>>>
>>>>    // Potential safepoint #8
>>>>
>>>>    return true;
>>>> }
>>>>
>>>> In order to ensure this code works with GC's trimming, we need to add a
>>>> fence between incrementing the size field and reading elementData.length.
>>>> This fence ensures that the trimming code sees the need for at least the
>>>> new size of elements in the new Object[].  The fence prevents JIT from
>>>> reordering the store to the size field after reading elementData's length.
>>>>
>>>> We only need an opaque fence.  We don't need a full or acquire fence
>>>> that impacts CPU behavior and hence performance.  GC pauses ensure adequate
>>>> CPU fencing.  JIT can put a safepoint at any of the potential safepoint
>>>> places and it will work.  Of course, potential safepoints #1 and #8 can
>>>> also be located outside add().  Potential safepoints #5 and #6 can be
>>>> located anywhere inside grow().
>>>>
>>>> Potential safepoint #1: Sadly, the trim operation will see the smaller
>>>> size and change elementData's length to match the current size.  When the
>>>> Java thread wakes up, the Java thread will call grow().  Since trimming
>>>> only happens when the program is transitioning to idle, the extra CPU time
>>>> to trim and regrow is inconsequential.  (This scenario begs the question if
>>>> the idle transition logic is correct because there is an active Java
>>>> thread.)
>>>>
>>>> Potential safepoint #2: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, the Java thread will see that the elementData's length has room
>>>> for 1 more element.  The Java thread will not call grow() and simply
>>>> execute elementData[index] = item.
>>>>
>>>> Potential safepoint #3: Same as #2
>>>>
>>>> Potential safepoint #4: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, the length local variable will not match elementData's length.
>>>> This is okay.  Here are 2 scenarios.  1) The length local variable could be
>>>> ≤ than size.  Sadly, the Java thread will call grow() obviating the trim
>>>> operation's work.  2) The length local variable could be > than size.  The
>>>> Java code won't call grow() and will execute elementData[index] = item.
>>>> Setting the item parameter into the array will work because the trim
>>>> operation left space at index.
>>>>
>>>> Potential safepoint #5: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  Sadly, when the Java
>>>> thread wakes up, it will call grow() obviating the trim operation's work.
>>>>
>>>> Potential safepoint #6: The trim operation will see the larger size and
>>>> change elementData's length to match the larger size.  When the Java thread
>>>> wakes up, it will execute elementData[index] = item.  Setting the item
>>>> parameter into the array will work because the trim operation left space at
>>>> index.
>>>>
>>>> Potential safepoint #7: Same as #6.
>>>>
>>>> Potential safepoint #8: The Java thread finished adding the item
>>>> parameter to elementData before the trim operation.  Thus, the trim
>>>> operation will make elementData's length == size.
>>>>
>>>> For remove operations, the size change must come after
>>>> elementData[index] = null.  We need another opaque fence to ensure size
>>>> changes after elementData[index] = null.
>>>>
>>>> The same logic applies for HashMap, size field must be incremented
>>>> before checking if the hashtable needs to enlarge.  The size field must be
>>>> decremented after removing a node from the hashtable.
>>>>
>>>> *Loop Hoisting Performance Concern*
>>>>
>>>> Let's say JIT inlines add() into a loop where the loop will add N
>>>> elements.  To hoist the if statement (and hence bounds check) out of the
>>>> loop, JIT would have to hoist size += N outside the loop along with the if
>>>> statement and the grow() call.  The opaque fence prevents JIT from hoisting
>>>> anything.  We can make JIT handle this special case and hoist the opaque
>>>> fence along with the rest of the hoisted logic.  No performance is lost.
>>>>
>>>> This violates the meaning of an opaque fence.  So, either we allow the
>>>> violation in this special case because the code still works or we create a
>>>> new fence that is weaker than opaque.  We really only need the store to the
>>>> size field to happen in program order before loading elementData.length.
>>>> All other loads and stores can move around.
>>>>
>>>> Perhaps we can use VarHandle.setOpaque(size) and
>>>> VarHandle.getOpaque(elementData) instead of a fence.  I don't know the
>>>> minutia on how this works, but perhaps this change allows JIT to recognize
>>>> it can hoist.
>>>>
>>>> *Concurrent Collections*
>>>>
>>>> Concurrent collections don't need trimming to happen during GC since
>>>> the data structure already handles concurrent changes.  GC can simply
>>>> collect pointers to all the applicable concurrent collections (i.e.,
>>>> ConcurrentHashMap, and CopyOnWriteArrayList).  We have a few options on
>>>> what kind of thread can do the trimming.  1) Using the fork-join pool of
>>>> threads would take advantage of all the cores but may interfere with the
>>>> program.  2) A single dedicated Java thread means the trimming won't be
>>>> able to take advantage of multiple cores yet the impact will be minimal to
>>>> the program.  I know that thread priority is rarely used, but allowing a
>>>> configuration option to set this thread's priority to idle would ensure
>>>> trimming never takes CPU from the program.  3) Loom threads would take
>>>> advantage of all the cores but will take CPU time away from the program
>>>> since trimming is largely (completely?) a non-blocking operation.  A
>>>> configuration option to set these threads' priority to idle or low would be
>>>> ideal.
>>>>
>>>> *Notes*
>>>>
>>>> For ArrayList, the extra trimming work is negligible and probably ends
>>>> up being a net zero or faster because fewer nulls have to be copied from
>>>> the old Object[] to the new Object[] location.
>>>>
>>>> For HashMap, the extra trimming work will probably have a minuscule
>>>> impact.  However, this minuscule impact may be replaced with better overall
>>>> program performance.  Years ago after String de-duplication was
>>>> implemented, I was surprised that reducing the memory usage of programs had
>>>> the benefit of improving response times.  So, the extra CPU spent to
>>>> de-duplicate is well spent in improving program performance.  In the
>>>> future, trim's CPU time will be compensated by better overall program
>>>> performance.
>>>>
>>>> We need to consider if young collections will be trimmed.  On one hand,
>>>> trimming all objects makes the ZGC relocation algorithm easier.  However,
>>>> the ZGC relocation algorithm only needs to check if the collection is old.
>>>> So, the algorithm only needs another if statement.  On the other hand, it
>>>> may be a waste of CPU time to trim young collections since most of the
>>>> young collections will be garbage soon.  However, if the program is
>>>> transitioning to idle, then trimming the young collections makes sense
>>>> since these collections could be around for a 4-day weekend.
>>>>
>>>> I am still only suggesting that we do trimming when the program
>>>> transitions to idle.  This way the extra CPU spent on trimming has an
>>>> inconsequential impact.  Potentially the always mode for trimming will be
>>>> fine for most workloads.  This will require performance testing.
>>>>
>>>> For always mode, perhaps we only trim old collections each GC cycle and
>>>> young collections when the program transitions to idle.  It would be
>>>> interesting to know how many old collections have large variances in size.
>>>> Queues probably swing wildly in size and so trimming wouldn't be a good
>>>> idea.  The data in caches probably change over time.  However, the cache
>>>> size is usually fixed and maybe constant.  If constant, trimming will be of
>>>> great benefit.  If we can determine which collections swing wildly in size,
>>>> then trimming can avoid those and we are one step closer to being able to
>>>> enable the always mode for all workloads.
>>>>
>>>> On Mon, Jan 23, 2023 at 10:20 PM Erik Osterlund <
>>>> erik.osterlund at oracle.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 24 Jan 2023, at 02:33, Nathan Reynolds <numeralnathan at gmail.com>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> > The array list code itself will probably blow up if for example you
>>>>> GC in the middle of let’s say the add method. It checks if we have
>>>>> capacity, then we GC draining that capacity, then add the object out of
>>>>> bounds. GC safepoints can happen anywhere in Java code, including in any
>>>>> method on the array list, which uses the length for various reasons,
>>>>> assuming it won’t change seemingly concurrently.
>>>>>
>>>>> You raise a valid point.  Thank you for catching this.  I wrote a
>>>>> response but came up with something better below.
>>>>>
>>>>> > Loads from an array can speculatively float above user control
>>>>> checks like the ArrayList internal range check. That’s fine and the user
>>>>> can’t observe the difference in behaviour. However, an aaload has its own
>>>>> range check against the actual array length. Loads into the array can not
>>>>> float above that array range check, because that could perform a load
>>>>> outside the heap boundary. Therefore, with your proposal, a read out of
>>>>> bounds of the array list would not only throw an exception, but
>>>>> occasionally also access outside the heap and hence crash the JVM, when
>>>>> accessed in a loop within which the capacity can shrink, rendering hoisted
>>>>> range checks invalid. That’s my concern here. And combating that would
>>>>> likely involve a huge deopt hammer triggered by GC.
>>>>>
>>>>> We can have GC only do the trimming operation when the only references
>>>>> to the ArrayList or HashMap are from other objects in the heap.  If the
>>>>> ArrayList or HashMap is referenced from a thread's stack or registers, then
>>>>> GC skips the ArrayList or HashMap.  Thus, a Java thread acquires a "lock"
>>>>> on the ArrayList or HashMap when it starts referencing the ArrayList or
>>>>> HashMap.  The Java thread releases a "lock" when it stops referencing the
>>>>> ArrayList or HashMap.  "lock" doesn't mean anything more than the thread
>>>>> having a reference to the ArrayList or HashMap.  It does not imply any
>>>>> synchronization mechanism that will cause a performance problem for the
>>>>> Java thread.
>>>>>
>>>>> Non-ZGC algorithms pause the application during the move operation.
>>>>> While the application is paused, GC can be sure that no Java threads are
>>>>> operating on the ArrayList or HashMap.
>>>>>
>>>>> For ZGC, a GC thread can move the ArrayList and its internal Object[]
>>>>> with trimming.  Whenever a GC thread is moving an object, Java threads that
>>>>> attempt to access the object already block and wait for the move to finish,
>>>>> if I understand correctly.  If the Java thread wants to access the
>>>>> ArrayList before a GC thread has moved it, the Java thread does the move
>>>>> operation.  Since the Java thread is executing code outside of the
>>>>> ArrayList, then the Java thread will do the trimming before the Java thread
>>>>> starts looking at Object[] bounds.
>>>>>
>>>>>
>>>>> Right. This takes us back to scanning all activation records in the
>>>>> system. As mentioned earlier, that tends to get rather expensive when you
>>>>> use virtual threads, and you can have activation records in the heap as
>>>>> well, which also similarly may have problematic references to array list
>>>>> internals. We have had to go over mechanisms that scan activation records
>>>>> in general, and rethink such algorithms to do something else, due to the
>>>>> scalability concerns.
>>>>>
>>>>> > Yes, I am talking about large object allocation. While we certainly
>>>>> could deal with it, my point was rather that it would be painful and
>>>>> require a bunch of special handling.
>>>>>
>>>>> The trimming can avoid large objects.  With that in mind, trimming
>>>>> large Object[] in ArrayList and HashMap is probably not worth the
>>>>> additional effort.  Large sparsely-used ArrayList doesn't even show up in
>>>>> the average heap dump.  Large sparsely-used HashMap only takes 0.4% of the
>>>>> heap.  Large sparsely-used ConcurrentHashMap only takes 0.1% of the heap.
>>>>> We can revisit trimming large sparsely-used ArrayLists and HashMaps when we
>>>>> are looking for heap reductions ≤ 0.5%. :)
>>>>>
>>>>>
>>>>> Fair enough.
>>>>>
>>>>>
>>>>> >> Perhaps this could be modelled as a library problem instead. Sounds
>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>> preferences?
>>>>> > I was thinking of rather having a less aggressive growing strategy,
>>>>> or something along those lines.
>>>>>
>>>>> This won't help in HashMap's case.  The table sizes are powers of 2.
>>>>> The growth has to be a doubling.  HashMap is already growing as slowly as
>>>>> possible.
>>>>>
>>>>>
>>>>> Perhaps the reason for the power of two sizing there, is because then
>>>>> going from a hashCode to a bucket can be done quickly using bitwise and,
>>>>> instead of expensive modulo. I suspect that changing that policy would come
>>>>> at a significant CPU overhead. Also, when elements are scattered randomly
>>>>> in the backing array, if you don’t rehash the table, you would only be able
>>>>> to remove a couple of trailing nulls at the end of the array, while many
>>>>> would remain in the middle between the elements. And rehashing might run
>>>>> into collisions which requires allocating tree nodes - something we can’t
>>>>> do from a GC context.
>>>>>
>>>>> Having said that, perhaps if a HashTable is too sparse, a smaller
>>>>> default size of the table and more dense load factor, would save some
>>>>> memory.
>>>>>
>>>>> Slowing the growth rate of ArrayList will only reduce waste by < 0.3%.
>>>>>
>>>>>
>>>>> Okay.
>>>>>
>>>>> /Erik
>>>>>
>>>>> On Mon, Jan 23, 2023 at 2:34 PM Erik Osterlund <
>>>>> erik.osterlund at oracle.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 23 Jan 2023, at 18:45, Nathan Reynolds <numeralnathan at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> > 1. Such a change would have user observable differences in
>>>>>> behaviour, which could introduce bugs in user code, due to the
>>>>>> optimization.
>>>>>>
>>>>>> How is this user observable?  The Object[] is buried inside an
>>>>>> ArrayList or HashMap.  This idea is not touching other Object[]'s outside a
>>>>>> collection.
>>>>>>
>>>>>> I suppose a performance impact from having to grow is somewhat
>>>>>> observable.  This was noted in my original email.  However, growing is not
>>>>>> functionally observable.
>>>>>>
>>>>>> The only way to functionally observe this is through reflection.
>>>>>> Reflection is a moot concern because there are no guarantees when using
>>>>>> reflection and switching to a new Java version.  So, any reflection into
>>>>>> ArrayList or HashMap would have to be fixed like any other functional
>>>>>> change to ArrayList or HashMap.
>>>>>>
>>>>>>
>>>>>> The array list code itself will probably blow up if for example you
>>>>>> GC in the middle of let’s say the add method. It checks if we have
>>>>>> capacity, then we GC draining that capacity, then add the object out of
>>>>>> bounds. GC safepoints can happen anywhere in Java code, including in any
>>>>>> method on the array list, which uses the length for various reasons,
>>>>>> assuming it won’t change seemingly concurrently.
>>>>>>
>>>>>> > 2. The JIT relies a lot on hoisting range checks out of loops to
>>>>>> get good performance, but we need to safepoint poll inside of loops. If we
>>>>>> take the slow path of such a safepoint poll, and perform a GC which shrinks
>>>>>> the arrays, then the range checks that were only performed before the loop,
>>>>>> are no longer valid, and we would essentially have to deoptimize all
>>>>>> activation records (frames) in the system, which with virtual threads could
>>>>>> be a lot of frames. Either that or stop hoisting range checks out of loops,
>>>>>> but that comes with its own performance problems.
>>>>>>
>>>>>> ArrayList uses its internal size field to determine what elements the
>>>>>> code looks at.  If the Object[] changes length to anything ≥ size, then
>>>>>> loops and hoisting still work.  The new Object[] length won't cause a
>>>>>> problem.
>>>>>>
>>>>>> HashMap may or may not be possible to trim.  It will take some effort
>>>>>> to examine the code and determine if a GC trim will work or the code may
>>>>>> need to be changed to be compatible.  One way to make the Java code
>>>>>> compatible would be to make the Java code set the HashMap's table field to
>>>>>> null while holding the reference to the table in a local variable.  GC will
>>>>>> then not recognize the table as belonging to HashMap and hence GC won't do
>>>>>> anything with it.
>>>>>>
>>>>>>
>>>>>> Loads from an array can speculatively float above user control checks
>>>>>> like the ArrayList internal range check. That’s fine and the user can’t
>>>>>> observe the difference in behaviour. However, an aaload has its own range
>>>>>> check against the actual array length. Loads into the array can not float
>>>>>> above that array range check, because that could perform a load outside the
>>>>>> heap boundary. Therefore, with your proposal, a read out of bounds of the
>>>>>> array list would not only throw an exception, but occasionally also access
>>>>>> outside the heap and hence crash the JVM, when accessed in a loop within
>>>>>> which the capacity can shrink, rendering hoisted range checks invalid.
>>>>>> That’s my concern here. And combating that would likely involve a huge
>>>>>> deopt hammer triggered by GC.
>>>>>>
>>>>>> > 3. Many GCs allocate objects that are larger than a certain
>>>>>> threshold in special storage that deals with unique concerns for larger
>>>>>> objects. If said objects shrink to below that limit, I suspect we would
>>>>>> have many interesting problems to deal with. Not impossible to fix, but
>>>>>> would probably be rather painful to deal with.
>>>>>>
>>>>>> Are you talking about large object allocation?  If so, this is beyond
>>>>>> my expertise.  However, I would assume that GC could allocate a new
>>>>>> location for the Object[], copy the data, and leave the large Object[] as
>>>>>> garbage for collection later.  This is probably worth it since the memory
>>>>>> saves could be very significant.
>>>>>>
>>>>>>
>>>>>> Yes, I am talking avout large object allocation. While we certainly
>>>>>> could deal with it, my point was rather that it would be painful and
>>>>>> require a bunch of special handling.
>>>>>>
>>>>>> > Perhaps this could be modelled as a library problem instead. Sounds
>>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>>> preferences?
>>>>>>
>>>>>> I am not sure how a Java library could implement this.  The library
>>>>>> could use reflection to get access to the Object[] in ArrayList or HashMap,
>>>>>> allocate a new Object[], copy the elements, and use reflection to store the
>>>>>> new Object[].  However, there is no synchronization mechanism to get the
>>>>>> other Java threads to see the reference to the new Object[].  Some threads
>>>>>> will continue to use the old Object[] and some threads will use the new
>>>>>> Object[].  Also, if other Java threads change the elements in the old
>>>>>> Object[] while the library is making the change, then those changes could
>>>>>> be lost.
>>>>>>
>>>>>>
>>>>>> I was thinking of rather having a less aggressive growing strategy,
>>>>>> or something along those lines.
>>>>>>
>>>>>> The beauty of trimming during GC move is that GC and Java threads
>>>>>> have already negotiated how to synchronize the object move.  In Serial GC,
>>>>>> the Java threads are paused and GC can move anything it wants.  GC changes
>>>>>> the stack (and registers?) of all the Java threads.  The Java threads have
>>>>>> no way to tell that the object moved.  In ZGC, the Java and GC threads work
>>>>>> together to perform the move.  If I understand correctly, other GC
>>>>>> algorithms execute moves like Serial or ZGC.
>>>>>>
>>>>>>
>>>>>> As I mentioned earlier, they have not negotiated when such GC can
>>>>>> shrink, if a java thread is in the middle of a method poking around at an
>>>>>> array list.
>>>>>>
>>>>>> Hope this makes sense.
>>>>>>
>>>>>> /Erik
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 23, 2023 at 1:57 AM Erik Osterlund <
>>>>>> erik.osterlund at oracle.com> wrote:
>>>>>>
>>>>>>> Hi Nathan,
>>>>>>>
>>>>>>> I can think of a number of concerning aspects of doing something
>>>>>>> like this:
>>>>>>>
>>>>>>> 1. Such a change would have user observable differences in
>>>>>>> behaviour, which could introduce bugs in user code, due to the optimization.
>>>>>>>
>>>>>>> 2. The JIT relies a lot on hoisting range checks out of loops to get
>>>>>>> good performance, but we need to safepoint poll inside of loops. If we take
>>>>>>> the slow path of such a safepoint poll, and perform a GC which shrinks the
>>>>>>> arrays, then the range checks that were only performed before the loop, are
>>>>>>> no longer valid, and we would essentially have to deoptimize all activation
>>>>>>> records (frames) in the system, which with virtual threads could be a lot
>>>>>>> of frames. Either that or stop hoisting range checks out of loops, but that
>>>>>>> comes with its own performance problems.
>>>>>>>
>>>>>>> 3. Many GCs allocate objects that are larger than a certain
>>>>>>> threshold in special storage that deals with unique concerns for larger
>>>>>>> objects. If said objects shrink to below that limit, I suspect we would
>>>>>>> have many interesting problems to deal with. Not impossible to fix, but
>>>>>>> would probably be rather painful to deal with.
>>>>>>>
>>>>>>> Perhaps this could be modelled as a library problem instead. Sounds
>>>>>>> like perhaps the growing strategy of the library is too aggressive for your
>>>>>>> preferences?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik
>>>>>>>
>>>>>>> > On 21 Jan 2023, at 00:24, Nathan Reynolds <numeralnathan at gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > ArrayList, HashMap, and other collections have internal Object[]'s
>>>>>>> to store the elements.  The Object[]'s are lazily initialized to minimize
>>>>>>> heap usage.  However, there is still quite a bit of waste in these
>>>>>>> Object[]'s.  Some collections are oversized from the beginning.  Some
>>>>>>> collections are filled and then partially or fully emptied.  For an average
>>>>>>> heap dump, 10.1% of the heap is wasted on oversized Object[]'s.
>>>>>>> >
>>>>>>> > These Object[]'s can be trimmed during GC's move operation but
>>>>>>> only when the process transitions to idle.  Since the trimming only happens
>>>>>>> when the application transitions to idle, then the extra CPU and/or pause
>>>>>>> time isn't a concern.  The performance penalty to the application to resize
>>>>>>> larger won't happen for  a while since the process is idle.  With the
>>>>>>> reduction of heap usage more idle processes can run on a single VM (i.e.,
>>>>>>> the JVM will be more Docker Container friendly).
>>>>>>> >
>>>>>>> > The trimming may apply to other collections such as ArrayDeque,
>>>>>>> BitSet, HashSet, IdentityHashMap, LinkedHashMap, Stack, TreeMap, TreeSet,
>>>>>>> WeakHashMap, Hashtable, and Vector.  HashSet, IdentityHashMap,
>>>>>>> LinkedHashMap, TreeMap, and TreeSet may be implemented with HashMap so
>>>>>>> trimming HashMap will automatically make these other collections benefit.
>>>>>>> >
>>>>>>> > How did I come up with the 10.1% heap reduction?  I analyzed 650
>>>>>>> heap dumps and presented the findings in Java Memory Hogs.  By carefully
>>>>>>> adding up the right numbers, I calculate that the heap reduction will be
>>>>>>> 10.1%.  That's a big savings and it will impact the entire world of Java
>>>>>>> programs!
>>>>>>> >
>>>>>>> > I entertained the idea of having a setting to trim with every GC
>>>>>>> or when objects promote from young to old generation.  One concern is that
>>>>>>> the application would take a performance penalty to enlarge some or all the
>>>>>>> Object[]'s.  Another concern is that the extra CPU time and/or pause time
>>>>>>> will interfere with the application's performance.
>>>>>>> >
>>>>>>> > What do you think of the overall idea?  Do you have ideas of how
>>>>>>> to trim more often than when the application transitions to idle?
>>>>>>>
>>>>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20230207/76e9b8cc/attachment-0001.htm>