Removing G1 Reference Post Write Barrier StoreLoad Barrier

Fri Jan 9 21:02:29 UTC 2015

Hi Thomas,

Thank you for taking my patch for a spin in your infrastructure. :)

On 09/01/15 20:01, Thomas Schatzl wrote:

> Over the winter holidays I have been running standard benchmarks we use
> for performance work: basically specjbb2000/2005/2013, specjvm98/2008
> and some known problematic things (like the program at
> https://bugs.openjdk.java.net/browse/JDK-8062128 :)).
>
> Hardware have been two socket/32 thread x86 machines, and 64 thread
> SPARC T4 (I "ported" your change there which has been trivial).
>
> I asked other teams to run on other benchmarks and hardware,
> unfortunately some build issue prevented them to start the build I gave
> them (some files were missing, i.e. I messed up). This will take a while
> further.
>
> As far as these benchmarks are concerned I made the following
> observations based on the results:
>
> Except for specjvm2008 xml, in particular the xml.validation
> sub-benchmark there is no significant difference in throughput/scores in
> all of these benchmarks.
> On SPARC, XML.validation improves by 20%, on x86 5%, which for the
> entire specjvm2008 xml benchmark gives a significant improvement on
> SPARC, but the improvement on xml.validation is too small in total to
> give a significant improvement on x86.
>
> The micro-benchmark from JDK-8062128 gives a 20% improvement too.
>
> In general the memory barrier does not seem to be the real bottleneck
> for many applications. The filters (cross-region, and the in-young test)
> remove the need for executing the memory barrier.

Very interesting! It's fortunate that most general applications are fine 
even with the fence in the write barrier. I guess applications commonly 
write in young. I did not expect a huge change in general, but merely to 
improve the worst cases. And also I just really don't like fences and 
get a bit angry when I see them haha!

> I will keep on asking for runs with your change on more
> applications/benchmarks, also on different hardware.

Thank you. :)

> Btw, about half a year ago we did an analysis of the bottlenecks in the
> write barrier and the following things seemed to be most problematic
> iirc:
>
>   - the (code) size of the barrier causes significantly worse code
> generation, and code execution/caching. E.g. for applications that you
> run completely in the young gen.
>   - actual pushing of entries onto the refinement queues is really slow.
> As soon as a significant amount of entries were generated, throughput
> decreases significantly. This maybe because the refinement threads start
> working, taking cpu time.
>
> Of course, the storeload does not exactly help.

That is a very interesting and important finding! Because part of the 
barrier is larger and more complex than it needs to be to reduce hits on 
the storeload fence which can be removed. In pseudo code part of it 
looked like this:

if (*card == YOUNG) goto done
storeload()
if (*card == DIRTY) goto done

My patch removed the fence and hence changed it to this:

if (*card == YOUNG) goto done
if (*card == DIRTY) goto done

But now without the fence I think it can be further optimized to this:

if (*card != CLEAN) goto done

So by removing the fence, the two cases YOUNG || DIRTY can be joined to 
reduce some branching and code size (and of course the fence too). Don't 
expect this to have a huge impact but every instruction counts, 
especially if code size and complexity of the barrier was determined to 
be a bottleneck. ;)

> I did some brief checking. Here are some results, including some
> comments internally on this though:
>
>   - the IPI needs to go to all cpus that might have matching TLBs, not
> only the CPUs that are scheduled to run on.

But those CPUs that are not involved with the process wouldn't have TLB 
entries for those special pages, right?

>   - on OSes IPI is often "a highly serialized function", e.g. from a
> short look on Linux source code, it seems that there can be only one
> cross-cpu call at the same time. (I admit that I might be completely
> wrong, but there seems to be some global per-cpu locking going on).
> That might prevent scaling if you have many of those issued quickly in
> succession.

Even if that is the case, I don't think it will necessarily prevent 
scaling. We only need one CPU to perform the cross call. If two threads 
both try to serialize at the same time in an overlapping fashion, it 
suffices that one of them does it anyway since it is a global fence. My 
current implementation isn't sophisticated enough to do this yet, but I 
don't think it would be too hard to fix that.

>   - it does not work on non-TSO systems.

You might be right there, I haven't spent time thinking about the 
non-TSO systems yet. Yet I'm curious, why do you think it would not work 
on non-TSO systems? I mean we should still be able to replace the 
storeload barrier with this technique, and the other reorderings are 
already handled properly right?

>>> There is chance that a (potentially much more local) StoreLoad is much
>>> less expensive than mprotect with a modestly large system overall.
>
> There is apparently ongoing work on making memory barriers less
> expensive in the CPU too, that will likely continue.

I don't doubt it. :)

> Just fyi, the number of cards per buffer by default is 256
> (G1UpdateBufferSize) - maybe you mixed this number up with the total
> size of a queue in 64 bit systems (which is 2048).

Not impossible! Thanks.

> The main goal is to reduce the number of buffers that need to be
> processed by the gc threads to keep latency down. Another getting to a
> safepoint asap, processing 256 entries might take a while.

Hmm my implementation I sent batched cards to be processed. But it 
didn't filter after global fencing (after cleaning the cards) if the 
cards had become dirty again before processing them. Perhaps this could 
reduce the number of cards processed by gc threads. I read a comment in 
code saying it was very likely a card gets dirtied again after being 
processed.

> I do not remember saying that if we have more threads, we _definitely_
> need smaller buffers, sorry for any misunderstanding. This needs to be
> investigated in detail first :)
>
> My main point was that there will most likely be more refinement done by
> the mutator threads in the future (on large workloads). Because
> currently, in some applications, the way buffers are scheduled between
> refinement threads and mutator threads is simply not suitable.

I can imagine!

> More processing work done by the mutator means more IPIs, and more
> concern about getting to a safepoint quickly.
>
> I agree that something could be done about these problems, sure.

Yeah by sharing global fences and timestamping them, I'm pretty sure 
this can be handled well.

> In theory, yes. However this is not true in common applications due to
> the way the card filtering/dirtying of the card table interacts with the
> number of created cards. The observation is, that the faster you process
> the cards, the more cards you generate in the queues at mutator time
> because the card table entries get cleared faster.
>
> Creating an entry is a lot faster than processing it :)

Good because by batching, processing is further delayed and I think it 
can therefore be filtered better and more efficiently: After cleaning N 
cards and globally fencing some of the cards to be processed might quite 
likely have been dirtied again, and can then be ignored.

Unfortunately, I didn't implement this in the patch I sent to you, just 
got rid of the fence.

> Also, throwing more threads/cores at the problem yields diminishing
> returns: there are bottlenecks in the VM (e.g. remembered set
> processing), and hardware bottlenecks. Memory bandwidth just does not
> scale as much as available CPU, and a single card results in processing
> another 512 bytes of memory in the heap.

Hmm I had this idea before that perhaps you could have two dirty card 
values, like MILDLY_DIRTY where only a single address is enqueued and 
REALLY_DIRTY where the whole card is enqueued for refinement.

So if the card is clean, it changes it to MILDLY_DIRTY and enqueues the 
single reference that needs processing. And if it was already 
MILDLY_DIRTY, we give up and mark the whole card for refinement. Is this 
viable? Didn't think too much about it yet though, but it wouldn't 
surprise me if this has already been considered? If not it could 
probably improve situations where locality is poor and only a single 
reference is changed causing all references on the card to be processed. 
What do you think?

> I will report back as soon as I have more results.

Okay thanks a lot for your help and interesting discussion Thomas. :)

/Erik