Removing G1 Reference Post Write Barrier StoreLoad Barrier

Fri Jan 9 19:00:54 UTC 2015

Hi Erik,

  finally had time to give you answers to your question and feedback
about the current state of benchmarking the change.

On Tue, 2014-12-23 at 13:23 +0000, Erik Österlund wrote:
> Hi Thomas,
> 
> Thanks for the comments.
> 
> > On 22 Dec 2014, at 22:19, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> > On Mon, 2014-12-22 at 20:30 +0000, Erik Österlund wrote:
> 
> >> And reference writes is something I consider a fast path. Although
> >> the frequency of inter regional pointer writes is different in
> >> different applications, I think that by having a storeload fence
> >> in this G1 barrier, it gives rise to some awkward cases like sorting
> >> large linked lists where performance becomes suboptimal, so it
> >> would be neat to get rid of it and get more consistent and
> >> resilient performance numbers.
> > 
> > Sorting linked lists is suboptimal with current G1 with or without the
> > change as every reference write potentially creates a card entry. I
> > guess most time will be spent in actual refinement in this case anyway.
> 
> Yeah true but it would get better anyway: it’s possible that the concurrent
> refinement threads can take much of the hits if resources are available for
> that and even if not then many fences might trigger on the same card - it
> will currently fence whether the card was dirty or not. Then of course
> there’s no point speculating too much - I think just benchmarking it would
> be better so it can be quantified.

Over the winter holidays I have been running standard benchmarks we use
for performance work: basically specjbb2000/2005/2013, specjvm98/2008
and some known problematic things (like the program at
https://bugs.openjdk.java.net/browse/JDK-8062128 :)).

Hardware have been two socket/32 thread x86 machines, and 64 thread
SPARC T4 (I "ported" your change there which has been trivial).

I asked other teams to run on other benchmarks and hardware,
unfortunately some build issue prevented them to start the build I gave
them (some files were missing, i.e. I messed up). This will take a while
further.

As far as these benchmarks are concerned I made the following
observations based on the results:

Except for specjvm2008 xml, in particular the xml.validation
sub-benchmark there is no significant difference in throughput/scores in
all of these benchmarks.
On SPARC, XML.validation improves by 20%, on x86 5%, which for the
entire specjvm2008 xml benchmark gives a significant improvement on
SPARC, but the improvement on xml.validation is too small in total to
give a significant improvement on x86.

The micro-benchmark from JDK-8062128 gives a 20% improvement too.

In general the memory barrier does not seem to be the real bottleneck
for many applications. The filters (cross-region, and the in-young test)
remove the need for executing the memory barrier.

I will keep on asking for runs with your change on more
applications/benchmarks, also on different hardware.

Btw, about half a year ago we did an analysis of the bottlenecks in the
write barrier and the following things seemed to be most problematic
iirc:

 - the (code) size of the barrier causes significantly worse code
generation, and code execution/caching. E.g. for applications that you
run completely in the young gen.
 - actual pushing of entries onto the refinement queues is really slow.
As soon as a significant amount of entries were generated, throughput
decreases significantly. This maybe because the refinement threads start
working, taking cpu time.

Of course, the storeload does not exactly help.

> > Mprotect will flush all store buffers of all processors every time. So
> > you charge everyone (also not only the VM; consider running multiple VMs
> > on a system). This is what Jon has been concerned about, how scalable is
> > this.
> 
> Are you sure about this? Since it uses IPI for flushing the store buffers,
>I was quite certain it sends it only to those CPUs the process is
>scheduled to run on. I would be very surprised if that was not the case
>as it would be a huge blunder in the implementation of mprotect, but I
>haven’t checked the source code on every platform out there.

I did some brief checking. Here are some results, including some
comments internally on this though:

 - the IPI needs to go to all cpus that might have matching TLBs, not
only the CPUs that are scheduled to run on.
 - on OSes IPI is often "a highly serialized function", e.g. from a
short look on Linux source code, it seems that there can be only one
cross-cpu call at the same time. (I admit that I might be completely
wrong, but there seems to be some global per-cpu locking going on).
That might prevent scaling if you have many of those issued quickly in
succession.
 - it does not work on non-TSO systems.

> > There is chance that a (potentially much more local) StoreLoad is much
> > less expensive than mprotect with a modestly large system overall.

There is apparently ongoing work on making memory barriers less
expensive in the CPU too, that will likely continue.

> >> If I understand you correctly, the frequency of invoking card
> >> refinement from mutators might have to increase by giving them
> >> smaller dirty card buffers because we can’t have too many dirty
> >> cards hanging around per mutator thread if we want to have good
> >> latency and lots of threads?
> > This is one problem, the other that mutator threads themselves need to
> > do more refinement than they do with current settings. Which means more
> > full buffers with more frequent mprotect calls. There may be some
> > possibility to increase the buffer size, but 128 cards seems already
> > somewhat large (I have not measured it, do not even know the current
> > buffer size).
> 
> I measured it in the benchmarks I ran and it was 2048 cards per
> buffer processed. So if the batching is the same as the buffer size,
> 2048, then even on a system with 1024 cores, it would be half a memory

Just fyi, the number of cards per buffer by default is 256
(G1UpdateBufferSize) - maybe you mixed this number up with the total
size of a queue in 64 bit systems (which is 2048).

> fence per card, which is arguably still always going to be strictly
> less than it is now. But then if somebody had even more cores, like
> a billion or something, it would be possible to make a more advanced
> variant where not all refinement threads make the global fence, instead
> the global fences are timestamped and if one refinement thread issues a
> global fence, the others won’t have to for cards cleaned before the
> timestamp. That way, the punishment on the other cores in such a
> massive system could be controlled and dealt with, without causing too
> much trouble really.

True.

> >> In that case, the minimum size of
> >> mutator dirty card buffers could impact the batch size so the
> >> constants matter here. But 128 seems like a rather small constant,
> >> do you think we would run into a situation where that matters?
> > 
> > No, not particularly in that situation, see above. 128 cards may already
> > be quite a lot of work to do though.
> 
> If the buffer size is 2048 (don’t know if that changes dynamically?)
>then it shouldn’t be too much. 
> 
> But the whole argument that we need smaller buffers if we have more
> cores and threads I do not quite understand. It’s understood that means

The main goal is to reduce the number of buffers that need to be
processed by the gc threads to keep latency down. Another getting to a
safepoint asap, processing 256 entries might take a while.

I do not remember saying that if we have more threads, we _definitely_
need smaller buffers, sorry for any misunderstanding. This needs to be
investigated in detail first :)

My main point was that there will most likely be more refinement done by
the mutator threads in the future (on large workloads). Because
currently, in some applications, the way buffers are scheduled between
refinement threads and mutator threads is simply not suitable.

More processing work done by the mutator means more IPIs, and more
concern about getting to a safepoint quickly.

I agree that something could be done about these problems, sure.

> a lot of threads will have a lot of buffers hanging around but if we
> have many cores we also have more threads processing them in the
> safepoints to reduce the pause time, right? So it seems that it’s when
> the ratio of threads to the number of cores available gets nasty that
> the buffer size should reduce to retain the same pause times. And even
> then, since the number of cores is constant, the rate at which card
> entries are produced is the same regardless of the number of threads
> over-saturating the system (roughly speaking), so even if private local
> card buffers per mutator get smaller to reduce pause times, they could
> be processed at the same speed as before and hence have the same batch
> size. By having smaller private buffers sticking to the threads and
> joining them together to make larger public batches to be processed,
> this should be dealt with in a scalable fashion I believe.

In theory, yes. However this is not true in common applications due to
the way the card filtering/dirtying of the card table interacts with the
number of created cards. The observation is, that the faster you process
the cards, the more cards you generate in the queues at mutator time
because the card table entries get cleared faster.

Creating an entry is a lot faster than processing it :)

Also, throwing more threads/cores at the problem yields diminishing
returns: there are bottlenecks in the VM (e.g. remembered set
processing), and hardware bottlenecks. Memory bandwidth just does not
scale as much as available CPU, and a single card results in processing
another 512 bytes of memory in the heap.

I will report back as soon as I have more results.

Thanks,
  Thomas