Missing entry in SCC due to decomp count mismatch
During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the production run with the premain branch . I don't see this happening with the mainline. The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between 25-40 mb) compared to the other compilations. The method in question is jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z The assembly phase added two entries for this method in the code cache: [3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' ... [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code' In the production run the "preload" version was successfully loaded: [0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (decomp: 0, hash: 0x493f24e2, has clinit barriers) The PrintTieredEvents logs indicate this method was also sent for compilation during replay training: 0.877593: [force-compile level=4 [jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1 queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0] Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this message: [0.878s][info ][scc,nmethod ] Missing entry for 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp_level 4, decomp: 0, hash: 0x493f24e2) This is followed by the C2 compilation of the method. It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is stored in the MethodData. Is it possible that the method data is not yet installed when replay training is done? If so, is that by design or a bug? [0] https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... Thanks, - Ashutosh Mehra
Thank you for report, Ashutosh Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there. `decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that production run will follow the same compilation/deoptimization steps. Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps workflow is slow now when we use TD. I need to check. Thanks, Vladimir K On 7/24/24 7:54 AM, Ashutosh Mehra wrote:
During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the production run with the premain branch . I don't see this happening with the mainline. The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between 25-40 mb) compared to the other compilations. The method in question is jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z
The assembly phase added two entries for this method in the code cache:
[3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' ... [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code'
In the production run the "preload" version was successfully loaded:
[0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (decomp: 0, hash: 0x493f24e2, has clinit barriers)
The PrintTieredEventslogs indicate this method was also sent for compilation during replay training:
0.877593: [force-compile level=4 [jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1 queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0]
Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this message:
[0.878s][info ][scc,nmethod] Missing entry for 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp_level 4, decomp: 0, hash: 0x493f24e2)
This is followed by the C2 compilation of the method.
It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is stored in the MethodData. Is it possible that the method data is not yet installed when replay training is done? If so, is that by design or a bug?
[0] https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... <https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp#L938>
Thanks, - Ashutosh Mehra
Hi Vladimir,
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
Yes, this is with the 1-step workflow. - Ashutosh Mehra On Wed, Jul 24, 2024 at 3:56 PM Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
Thank you for report, Ashutosh
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
`decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that production run will follow the same compilation/deoptimization steps.
Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps workflow is slow now when we use TD. I need to check.
Thanks, Vladimir K
On 7/24/24 7:54 AM, Ashutosh Mehra wrote:
During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the production run with the premain branch . I don't see this happening with the mainline. The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between 25-40 mb) compared to the other compilations. The method in question is
jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z
The assembly phase added two entries for this method in the code cache:
[3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp
level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' ... [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp
level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code'
In the production run the "preload" version was successfully loaded:
[0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (decomp:
0, hash: 0x493f24e2, has clinit barriers)
The PrintTieredEventslogs indicate this method was also sent for compilation during replay training:
0.877593: [force-compile level=4
[jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1
queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0]
Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this message:
[0.878s][info ][scc,nmethod] Missing entry for
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z'
(comp_level 4, decomp: 0, hash: 0x493f24e2)
This is followed by the C2 compilation of the method.
It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is stored in the MethodData. Is it possible that the method data is not yet installed when replay training is done? If so, is that by design or a bug?
[0]
https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24...
< https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24...
Thanks, - Ashutosh Mehra
I looked on code and for "Preloading" nmethod we ignore decompile counter. We check it only when looking for "normal" AOT code. The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code. I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code. It seems we are "lucky" in most cases because we deoptimize (and update counter) may be level 2 (C1) AOT code or "Preloaded" code. That is what I see in my runs which has opposite issue: we record AOT code with `decomp == 0` and when we load it "preload" AOT code was deoptimized and counter updated in MDO and check failed when we tried to load "normal" AOT code. Anyway. As I said, for one-step workflow we should not use decompile counter. I may actually save only latest version of "normal" nmethod even for 5-step workflow. The question is which nmethod version corresponds to saved MDO? For on-step workflow the answer is simple and we can ignore the counter. For 5 steps I need to check corresponding MDO to save correct version. Thanks, Vladimir K [1] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... [2] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... On 7/24/24 3:46 PM, Ashutosh Mehra wrote:
Hi Vladimir,
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
Yes, this is with the 1-step workflow.
- Ashutosh Mehra
On Wed, Jul 24, 2024 at 3:56 PM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
Thank you for report, Ashutosh
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
`decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that production run will follow the same compilation/deoptimization steps.
Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps workflow is slow now when we use TD. I need to check.
Thanks, Vladimir K
On 7/24/24 7:54 AM, Ashutosh Mehra wrote: > During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the production > run with the premain branch . I don't see this happening with the mainline. > The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between > 25-40 mb) compared to the other compilations. > The method in question is > jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z > > The assembly phase added two entries for this method in the code cache: > > [3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp > level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' > ... > [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp > level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code' > > In the production run the "preload" version was successfully loaded: > > [0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (decomp: > 0, hash: 0x493f24e2, has clinit barriers) > > The PrintTieredEventslogs indicate this method was also sent for compilation during replay training: > > 0.877593: [force-compile level=4 > [jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1 > queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 > compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0] > > Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this message: > > [0.878s][info ][scc,nmethod] Missing entry for > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > (comp_level 4, decomp: 0, hash: 0x493f24e2) > > This is followed by the C2 compilation of the method. > > It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is > stored in the MethodData. > Is it possible that the method data is not yet installed when replay training is done? If so, is that by design or a bug? > > [0] > https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MGy9Dev-630RGry8or7lUEBQ4OYoS8cYwif2Z56RKLsqk7kx5YSD65AzSK9yYycpOWGubLGwFIZSImuh3ykRKw$> > <https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MGy9Dev-630RGry8or7lUEBQ4OYoS8cYwif2Z56RKLsqk7kx5YSD65AzSK9yYycpOWGubLGwFIZSImuh3ykRKw$>> > > Thanks, > - Ashutosh Mehra
The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code.
That's right.
I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code.
For "Missing" message, the counter is obtained from Method->method_data() which would return null if the MethodData recorded in the training data is not yet installed in the Method. If Method->method_data() is null, then it prints 0 as the counter value. This is the case I am seeing if I print the method data in the "Missing" message: [0.878s][info ][scc,nmethod ] Missing entry for 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (md: (nil), comp_level 4, decomp: 0, hash: 0x493f24e2) Notice that md is null. With one step workflow we should ignore decomp count because code is
generated not during execution but based on training data in forked VM - no deoptimization happens there.
hmm, I see some entries marked as "made not entrant" in the output of PrintCompilation during the assembly phase of 1-step workflow. S131 432 % 2 jdk.internal.util.ArraysSupport::unsignedHashCode @ 8 (36 bytes) made not entrant S133 433 2 jdk.internal.util.ArraysSupport::unsignedHashCode (36 bytes) made not entrant S158 506 2 java.util.HashMap::putVal (300 bytes) made not entrant Thanks, - Ashutosh Mehra On Thu, Jul 25, 2024 at 8:18 PM Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
I looked on code and for "Preloading" nmethod we ignore decompile counter. We check it only when looking for "normal" AOT code.
The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code.
I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code.
It seems we are "lucky" in most cases because we deoptimize (and update counter) may be level 2 (C1) AOT code or "Preloaded" code. That is what I see in my runs which has opposite issue: we record AOT code with `decomp == 0` and when we load it "preload" AOT code was deoptimized and counter updated in MDO and check failed when we tried to load "normal" AOT code.
Anyway. As I said, for one-step workflow we should not use decompile counter. I may actually save only latest version of "normal" nmethod even for 5-step workflow. The question is which nmethod version corresponds to saved MDO? For on-step workflow the answer is simple and we can ignore the counter. For 5 steps I need to check corresponding MDO to save correct version.
Thanks, Vladimir K
[1] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... [2] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach...
On 7/24/24 3:46 PM, Ashutosh Mehra wrote:
Hi Vladimir,
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
Yes, this is with the 1-step workflow.
- Ashutosh Mehra
On Wed, Jul 24, 2024 at 3:56 PM Vladimir Kozlov < vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
Thank you for report, Ashutosh
Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
`decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that production run will follow the same compilation/deoptimization steps.
Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps workflow is slow now when we use TD. I need to check.
Thanks, Vladimir K
On 7/24/24 7:54 AM, Ashutosh Mehra wrote: > During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the production > run with the premain branch . I don't see this happening with the mainline. > The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between > 25-40 mb) compared to the other compilations. > The method in question is > jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z > > The assembly phase added two entries for this method in the code cache: > > [3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp > level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' > ... > [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (comp > level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code' > > In the production run the "preload" version was successfully loaded: > > [0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (decomp: > 0, hash: 0x493f24e2, has clinit barriers) > > The PrintTieredEventslogs indicate this method was also sent for compilation during replay training: > > 0.877593: [force-compile level=4 > [jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1 > queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 > compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0] > > Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this message: > > [0.878s][info ][scc,nmethod] Missing entry for > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > (comp_level 4, decomp: 0, hash: 0x493f24e2) > > This is followed by the C2 compilation of the method. > > It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is > stored in the MethodData. > Is it possible that the method data is not yet installed when replay training is done? If so, is that by design or a bug? > > [0] >
https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965...
> <
https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965...
> > Thanks, > - Ashutosh Mehra
Ashutosh, please try this patch: diff --git a/src/hotspot/share/code/SCCache.cpp b/src/hotspot/share/code/SCCache.cpp index 551f80e1411..87f2d718320 100644 --- a/src/hotspot/share/code/SCCache.cpp +++ b/src/hotspot/share/code/SCCache.cpp @@ -934,8 +934,7 @@ static bool check_entry(SCCEntry::Kind kind, uint id, uint comp_level, uint deco if (entry->kind() == kind) { assert(entry->id() == id, "sanity"); if (kind != SCCEntry::Code || (!entry->not_entrant() && !entry->has_clinit_barriers() && - entry->comp_level() == comp_level && - (comp_level == CompLevel_limited_profile || entry->decompile() == decomp))) { + (entry->comp_level() == comp_level))) { return true; // Found } } Thanks, Vladimir K On 7/26/24 8:17 AM, Ashutosh Mehra wrote:
The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code.
That's right.
I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code.
For "Missing" message, the counter is obtained from Method->method_data() which would return null if the MethodData recorded in the training data is not yet installed in the Method. If Method->method_data() is null, then it prints 0 as the counter value. This is the case I am seeing if I print the method data in the "Missing" message:
[0.878s][info ][scc,nmethod ] Missing entry for 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (md: (nil), comp_level 4, decomp: 0, hash: 0x493f24e2)
Notice that md is null.
With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
hmm, I see some entries marked as "made not entrant" in the output of PrintCompilation during the assembly phase of 1-step workflow.
S131 432 % 2 jdk.internal.util.ArraysSupport::unsignedHashCode @ 8 (36 bytes) made not entrant S133 433 2 jdk.internal.util.ArraysSupport::unsignedHashCode (36 bytes) made not entrant S158 506 2 java.util.HashMap::putVal (300 bytes) made not entrant
Thanks, - Ashutosh Mehra
On Thu, Jul 25, 2024 at 8:18 PM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
I looked on code and for "Preloading" nmethod we ignore decompile counter. We check it only when looking for "normal" AOT code.
The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code.
I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code.
It seems we are "lucky" in most cases because we deoptimize (and update counter) may be level 2 (C1) AOT code or "Preloaded" code. That is what I see in my runs which has opposite issue: we record AOT code with `decomp == 0` and when we load it "preload" AOT code was deoptimized and counter updated in MDO and check failed when we tried to load "normal" AOT code.
Anyway. As I said, for one-step workflow we should not use decompile counter. I may actually save only latest version of "normal" nmethod even for 5-step workflow. The question is which nmethod version corresponds to saved MDO? For on-step workflow the answer is simple and we can ignore the counter. For 5 steps I need to check corresponding MDO to save correct version.
Thanks, Vladimir K
[1] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCache.cpp*L2868__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYcRh3ntYA$> [2] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCache.cpp*L317__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYe9CuvRDQ$>
On 7/24/24 3:46 PM, Ashutosh Mehra wrote: > Hi Vladimir, > > Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not > during execution but based on training data in forked VM - no deoptimization happens there. > > > Yes, this is with the 1-step workflow. > > - Ashutosh Mehra > > > On Wed, Jul 24, 2024 at 3:56 PM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com> <mailto:vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>>> wrote: > > Thank you for report, Ashutosh > > Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not > during execution but based on training data in forked VM - no deoptimization happens there. > > `decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that > production run will follow the same compilation/deoptimization steps. > > Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps > workflow is slow now when we use TD. I need to check. > > Thanks, > Vladimir K > > On 7/24/24 7:54 AM, Ashutosh Mehra wrote: > > During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the > production > > run with the premain branch . I don't see this happening with the mainline. > > The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between > > 25-40 mb) compared to the other compilations. > > The method in question is > > jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z > > > > The assembly phase added two entries for this method in the code cache: > > > > [3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod > > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > (comp > > level: 4, decomp: 1, has clinit barriers) to Startup Code Cache 'quarkus-getting-started.cds.code' > > ... > > [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod > > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > (comp > > level: 4, decomp: 1) to Startup Code Cache 'quarkus-getting-started.cds.code' > > > > In the production run the "preload" version was successfully loaded: > > > > [0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod > > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > (decomp: > > 0, hash: 0x493f24e2, has clinit barriers) > > > > The PrintTieredEventslogs indicate this method was also sent for compilation during replay training: > > > > 0.877593: [force-compile level=4 > > [jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1 > > queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00 total=56,0 mdo=0(0),0(0) max levels=4,0 > > compilable=c1,c1-osr,c2,c2-osr status=idle mtd: mdo=18830(8306), 0(0), deps=0] > > > > Ideally this request should have been fulfilled by the second entry in the code cache. But instead I see this > message: > > > > [0.878s][info ][scc,nmethod] Missing entry for > > 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' > > (comp_level 4, decomp: 0, hash: 0x493f24e2) > > > > This is followed by the C2 compilation of the method. > > > > It looks like the failure to find the second entry is due to a mismatch of the decomp count [0]. The decomp count is > > stored in the MethodData. > > Is it possible that the method data is not yet installed when replay training is done? If so, is that by design > or a bug? > > > > [0] > > > https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYcE3EES5w$> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MGy9Dev-630RGry8or7lUEBQ4OYoS8cYwif2Z56RKLsqk7kx5YSD65AzSK9yYycpOWGubLGwFIZSImuh3ykRKw$>> > > > <https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYcE3EES5w$> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MGy9Dev-630RGry8or7lUEBQ4OYoS8cYwif2Z56RKLsqk7kx5YSD65AzSK9yYycpOWGubLGwFIZSImuh3ykRKw$>>> > > > > Thanks, > > - Ashutosh Mehra >
Yes, this seems to work. I don't see the "Missing entry" message for StackMapGenerator::processBlock any more, and it is now able to locate the c2 compiled version in the code cache and use it. [0.911s][info ][scc,nmethod ] 2662 (L4): Reading nmethod 'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (md: 0x7f99f00c5f68, decomp: 0, hash: 0x493f24e2) Thanks, - Ashutosh Mehra On Fri, Jul 26, 2024 at 12:28 PM Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
Ashutosh, please try this patch:
diff --git a/src/hotspot/share/code/SCCache.cpp b/src/hotspot/share/code/SCCache.cpp index 551f80e1411..87f2d718320 100644 --- a/src/hotspot/share/code/SCCache.cpp +++ b/src/hotspot/share/code/SCCache.cpp @@ -934,8 +934,7 @@ static bool check_entry(SCCEntry::Kind kind, uint id, uint comp_level, uint deco if (entry->kind() == kind) { assert(entry->id() == id, "sanity"); if (kind != SCCEntry::Code || (!entry->not_entrant() && !entry->has_clinit_barriers() && - entry->comp_level() == comp_level && - (comp_level == CompLevel_limited_profile || entry->decompile() == decomp))) { + (entry->comp_level() == comp_level))) { return true; // Found } }
Thanks, Vladimir K
On 7/26/24 8:17 AM, Ashutosh Mehra wrote:
The counter's value in "Preloading" and "Missing" messages is taking
from current MDO [1] and not from the entry
describing AOT code.
That's right.
I remember (Igor may confirm it) that CompilerCounters data is
nulled for TD including _nof_decompiles value. That is
why we see 0 when loading code.
For "Missing" message, the counter is obtained from
Method->method_data() which would return null if the MethodData
recorded in the training data is not yet installed in the Method. If Method->method_data() is null, then it prints 0 as the counter value. This is the case I am seeing if I print the method data in the "Missing" message:
[0.878s][info ][scc,nmethod ] Missing entry for
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z' (md:
(nil), comp_level 4, decomp: 0, hash: 0x493f24e2)
Notice that md is null.
With one step workflow we should ignore decomp count because code is generated not during execution but based on training data in forked VM - no deoptimization happens there.
hmm, I see some entries marked as "made not entrant" in the output of PrintCompilation during the assembly phase of 1-step workflow.
S131 432 % 2 jdk.internal.util.ArraysSupport::unsignedHashCode @ 8 (36 bytes) made not entrant S133 433 2 jdk.internal.util.ArraysSupport::unsignedHashCode (36 bytes) made not entrant S158 506 2 java.util.HashMap::putVal (300 bytes) made not entrant
Thanks, - Ashutosh Mehra
On Thu, Jul 25, 2024 at 8:18 PM Vladimir Kozlov < vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
I looked on code and for "Preloading" nmethod we ignore decompile counter. We check it only when looking for "normal" AOT code.
The counter's value in "Preloading" and "Missing" messages is taking from current MDO [1] and not from the entry describing AOT code.
I remember (Igor may confirm it) that CompilerCounters data is nulled for TD including _nof_decompiles value. That is why we see 0 when loading code.
It seems we are "lucky" in most cases because we deoptimize (and update counter) may be level 2 (C1) AOT code or "Preloaded" code. That is what I see in my runs which has opposite issue: we record AOT code with `decomp == 0` and when we load it "preload" AOT code was deoptimized and counter updated in MDO and check failed when we tried to load "normal" AOT code.
Anyway. As I said, for one-step workflow we should not use decompile counter. I may actually save only latest version of "normal" nmethod even for 5-step workflow. The question is which nmethod version corresponds to saved MDO? For on-step workflow the answer is simple and we can ignore the counter. For 5 steps I need to check corresponding MDO to save correct version.
Thanks, Vladimir K
[1] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
[2] https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/SCCach... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
On 7/24/24 3:46 PM, Ashutosh Mehra wrote: > Hi Vladimir, > > Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not > during execution but based on training data in forked VM - no deoptimization happens there. > > > Yes, this is with the 1-step workflow. > > - Ashutosh Mehra > > > On Wed, Jul 24, 2024 at 3:56 PM Vladimir Kozlov < vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com> <mailto:vladimir.kozlov@oracle.com <mailto: vladimir.kozlov@oracle.com>>> wrote: > > Thank you for report, Ashutosh > > Is this with one step workflow? With one step workflow we should ignore decomp count because code is generated not > during execution but based on training data in forked VM - no deoptimization happens there. > > `decomp count` was introduced for 5 steps workflow when we generate aot code as we execute application with idea that > production run will follow the same compilation/deoptimization steps. > > Actually I implemented it before we start using TD to trigger compilation. May be this is the reason that 5 steps > workflow is slow now when we use TD. I need to check. > > Thanks, > Vladimir K > > On 7/24/24 7:54 AM, Ashutosh Mehra wrote: > > During the startup of a quarkus app, I see a particular method that gets C2 compiled almost every time in the > production > > run with the premain branch . I don't see this happening with the mainline. > > The reason this method caught my attention is the significant amount of memory its C2 compilation consumes (between > > 25-40 mb) compared to the other compilations. > > The method in question is > > jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z > > > > The assembly phase added two entries for this method in the code cache: > > > > [3.391s][info ][scc,nmethod ] 2631 (L4): Writing nmethod > >
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z'
> (comp > > level: 4, decomp: 1, has clinit barriers) to Startup Code
Cache 'quarkus-getting-started.cds.code'
> > ... > > [7.215s][info ][scc,nmethod ] 4354 (L4): Writing nmethod > >
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z'
> (comp > > level: 4, decomp: 1) to Startup Code Cache
'quarkus-getting-started.cds.code'
> > > > In the production run the "preload" version was
successfully loaded:
> > > > [0.695s][info ][scc,nmethod ] 727 (L4): Preloading nmethod > >
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z'
> (decomp: > > 0, hash: 0x493f24e2, has clinit barriers) > > > > The PrintTieredEventslogs indicate this method was also
sent for compilation during replay training:
> > > > 0.877593: [force-compile level=4 > >
[jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z] @-1
> > queues=0,0 rate=0.000000 load=0.007812 k=1.00,1.00
total=56,0 mdo=0(0),0(0) max levels=4,0
> > compilable=c1,c1-osr,c2,c2-osr status=idle mtd:
mdo=18830(8306), 0(0), deps=0]
> > > > Ideally this request should have been fulfilled by the
second entry in the code cache. But instead I see this
> message: > > > > [0.878s][info ][scc,nmethod] Missing entry for > >
'jdk.internal.classfile.impl.StackMapGenerator::processBlock(Ljdk/internal/classfile/impl/RawBytecodeHelper;)Z'
> > (comp_level 4, decomp: 0, hash: 0x493f24e2) > > > > This is followed by the C2 compilation of the method. > > > > It looks like the failure to find the second entry is due
to a mismatch of the decomp count [0]. The
decomp count is > > stored in the MethodData. > > Is it possible that the method data is not yet installed
when replay training is done? If so, is that by
design > or a bug? > > > > [0] > > >
https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYcE3EES5w$> < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965...
> > > < https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b24... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb99653624d02a923a314ce40086753b240fc/src/hotspot/share/code/SCCache.cpp*L938__;Iw!!ACWV5N9M2RV99hQ!MUQ7NwYRlq4ggiM1yNmZO3xJb1OV-cpfGFcPgPZ9zA7vbDKlPtPvsVtsm-gca03z_WTzgi8AUzOWOYcE3EES5w$> < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/ec5eb9965...
> > > > Thanks, > > - Ashutosh Mehra >
participants (2)
-
Ashutosh Mehra
-
Vladimir Kozlov