Re: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)

Tue Feb 6 03:32:24 UTC 2024

Hi Roberto,
Thanks for the reply which clearly answered my questions. We know that G1 post barrier is kind of heavy comparing
to Parallel GC. I wonder if we may still find the a way to use imprecise card mark to have more optimization
 opportunities to get rid of redundant card marks which is not done in current G1 barrier implementation either.
For example if we could generate barrier in machnode level before register allocation we might be able to have object
to use imprecise card mark.
Thanks,
Liang
------------------------------------------------------------------
From:Roberto Castaneda Lozano <roberto.castaneda.lozano at oracle.com>
Send Time:2024 Feb. 5 (Mon.) 17:49
To:porters-dev <porters-dev at openjdk.org>; adinn <adinn at redhat.com>; "MAO, Liang" <maoliang.ml at alibaba-inc.com>
Subject:Re: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)
Hi Liang,
Thanks for your interest! Before addressing your specific questions, let me summarize our view and strategy regarding performance work. Since the late barrier expansion model does not expose all details of the barrier operations to C2, there will always be cases where our proposal leads to slight inefficiencies at the micro level compared to the current model. Our view is that these small inefficiencies are tolerable as long as they do not translate into regressions for interesting applications, and will be outweighed anyway by barrier optimization work that this JEP enables G1 engineers to perform. With this in mind, our strategy is to take the opportunity to re-evaluate the optimizations that C2 currently applies to G1 barriers, and re-implement in the late barrier expansion model only those which have a demonstrable performance effect at the application level (see "Optimizations" subsection in the JEP). Many of these optimizations can also be performed in the late barrier expansion model, albeit in a more explicit way.
Regarding question a), our current prototype inlines all barrier checks together with the corresponding memory access operation ("fast path"), but places the runtime calls together with their prologue/epilogue code out-of-line in assembly stubs ("slow path"). We plan to re-evaluate, in the context of this JEP, the performance effect of moving more parts of the post-barrier to the stub, similarly to JDK-8225776.
Regarding question b), yes, our current prototype always performs precise card marking. We have not yet found empirical evidence that the potential inefficiencies in the generated code translate into regressions at the application level, but are happy to reconsider this if someone knows of an interesting application-level benchmark where imprecise card marking makes a significant difference.
Hope that answers your questions!
Thanks,
Roberto
________________________________________
From: Liang Mao <maoliang.ml at alibaba-inc.com>
Sent: Sunday, February 4, 2024 7:43 AM
To: porters-dev; Roberto Castaneda Lozano; adinn
Subject: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)
Hi Roberto,
Excited to hear the news about improving G1 barrier! I have a few questions about this proposal:
 a) ZGC uses late expansion because it has a clear fast path and a medium/slow path. The fastpath
contains only 1 or 2 simple instructions so doesn't need optimization from c2. G1 post barrier has several
branch check and doesn't have clear boudaries of fast or slow paths. And there could be optimization opportunity such
as JDK-8225776. Permanently avoiding C2 optimization might lose performance.
 b) G1(as well as card table remset GC) uses imprecise card mark which marks the object address card instead of the field address.
If we use late expansion, we only have field address there and therefore have to recompute the object address which
needs additional instructions or registers. BTW, I didn't see the details in the prototype implementation. We can
alway use precise card mark in G1 anyway. Imprecise card mark has the advantage to eliminate redundant card
mark while writing into different field of an object because the card mark addresses are the same. Parallel GC can perform this optimization.
The late expansion could benifit from domination analysis to remove redudant barriers and traditional ideal optimization could barelly help
G1 barrier. Looking forwarding to your reply and progress!
Thanks,
Liang
________________________________________
发件人: porters-dev <porters-dev-retn at openjdk.org> 代表 Roberto Castaneda Lozano
发送时间: 2024年2月2日 22:37
收件人: Andrew Dinn <adinn at redhat.com>; porters-dev at openjdk.org
主题: Re: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)
Hi Andrew,
Thanks for your interest! I am unfortunately not very familiar with Shenandoah and its barrier model, but in principle late barrier expansion should be applicable to any collector where barriers are tightly coupled to individual memory access operations and performance does not depend too much on exposing barrier operation details to the JIT compiler.
If it helps, our prototype is available here: https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion <https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion ><https://urldefense.com/v3/__https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion__ <https://urldefense.com/v3/__https://github.com/robcasloz/jdk/tree/g1-late-barrier-expansion__ >;!!ACWV5N9M2RV99hQ!Ol930UIJb0zvV45LMomqimMtgIGdUCXGpdnLXxHYmH3UxFoSk03gvyZZz-RbR_jy_yRRJujVMs6j750RaY7X19c6wKVuUROfW_2o$>. Please note that this is early, experimental work and might change significantly as the JEP evolves.
Thanks,
Roberto
________________________________________
From: Andrew Dinn <adinn at redhat.com>
Sent: Friday, February 2, 2024 2:33 PM
To: Roberto Castaneda Lozano; porters-dev at openjdk.org
Subject: [External] : Re: Heads-up: Late G1 Barrier Expansion (Draft JEP)
Hi Roberto,
On 02/02/2024 13:18, Roberto Castaneda Lozano wrote:
> I have written (together with Erik Österlund) a draft JEP for
> simplifying C2's handling of G1 barriers, see
> https://bugs.openjdk.org/browse/JDK-8322295 <https://bugs.openjdk.org/browse/JDK-8322295 >. This is a heads-up that
> the implementation of this JEP requires platform-specific support from
> all OpenJDK ports. While interpreter G1 barrier implementations are
> available for all ports and can be largely reused, the JEP
> additionally requires 1) defining G1-specific ADL instructions and 2)
> implementing platform-specific logic to support runtime calls from the
> barrier code. For ports that already support ZGC, the effort should be
> smaller, as the logic for 2) can be shared between ZGC and G1.
>
> To give a rough estimation of the required effort, the x86-64 changes
> in our prototype involve approximately 900 line insertions and 300
> line deletions over 9 files, among which approximately 300 deleted and
> inserted lines correspond to logic factored out from ZGC.
I looked at the proposal and was interested in the approach, not least because ZGC appears to have traversed the path that this JEP recommends
G1 to follow.
Have you considered whether this same approach might be taken with the Shenandoah GC? Alternatively, can declare any basic assumptions regarding how G1 operates that are needed to enable this change which might therefore need be met by Shenandoah?
Of course, access to the prototype code might help answer those questions (at least it would help someone better versed in Shenandoah than me) but a high-level summary of what in the design of G1 and ZGC makes this approach work or, conversely, might make it fail would, if available, be a great help.
regards,
Andrew Dinn
-----------
=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20240206/84ec921f/attachment-0001.htm>