premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)
I have prototyped two (aarch64-specific) solutions for JDK-8335440 both of which fix the G1 write barrier in AOT code to use the runtime region grain size. Both solutions make AOT code resilient to any change in max heap between assembly and production runs. The problem arises because ergonomics uses the heap size to derive a G1 region size and the latter size determines what shift is needed to convert a store address to a card table index. In currently generated nmethod and *stub* code) the shift count is installed as an immediate operand of a generated shift instruction. In AOT code the shift counts needs to be appropriate to the current runtime region size. AOT code can resolve this requirement in two ways. It can load the shift from a well known location and supply the shift count as a register operand. Alternatively, it can employ load-time rewriting of the instruction stream to update the immediate operand. Both current solutions rely on loading rather than instruction rewriting. The first solution installs the shift count in a (byte) field added to every Java thread. It modifies barrier generation when the SCCache is open for writing to load the shift count from the thread field. This solution requires no relocation when the AOT stub/nmethod is loaded from the cache since the load is always at a fixed offset from the thread register. If the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand. https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l... The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand. https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l... I ran the javac benchmark with each of these solutions compared against the equivalent premain build. Neither indicated any noticeable change in execution time -- at least not within the margins of error of the test runs (which, on my M2 machine were +/- 5 msecs in a 95 msec run). A better test might be to take a long running app and see if the change to the AOT barrier code introduced any change in overall execution time. I implemented these two solutions first because neither of them requires implementing any new relocations. There are two alternatives which would require new relocations that may still be worth investigating. Option three is to mark the shift instruction with a new relocation. Patching of the relocation address would simply require regenerating it with an immediate that matches (log2 of the) current region size. The fourth option is to load the shift count from a data area associated with the current blob. In the case of an nmethod this would be the nmethod constants section. In the case of a generated stub this would have to be a dedicated memory address in its associated blob. Either way the data location would need to be marked with a new relocation. Patching of the relocation address would simply require copying the (log2 of the) current region size ito the data area. I'll hold off on adding these solutions (also on implementing the x86 versions -- well, more likely, letting Ashu provide them ;-) until I get some feedback on these first two. I'll also see if I can get any better indication of whether the performance of the first two solutions is an issue. I think solution one is by far the simplest, resolving the immediate issue with least fuss (note that I poked the necessary data byte into a hole in the thread record so it has no space implications). However, if we end up having to tweak generated code to deal with other config changes the alternatives may be worth investing in as they might scale better. regards, Andrew Dinn -----------
On 7/15/24 9:23 AM, Andrew Dinn wrote:
I have prototyped two (aarch64-specific) solutions for JDK-8335440 both of which fix the G1 write barrier in AOT code to use the runtime region grain size. Both solutions make AOT code resilient to any change in max heap between assembly and production runs.
The problem arises because ergonomics uses the heap size to derive a G1 region size and the latter size determines what shift is needed to convert a store address to a card table index. In currently generated nmethod and *stub* code) the shift count is installed as an immediate operand of a generated shift instruction. In AOT code the shift counts needs to be appropriate to the current runtime region size. AOT code can resolve this requirement in two ways. It can load the shift from a well known location and supply the shift count as a register operand. Alternatively, it can employ load-time rewriting of the instruction stream to update the immediate operand.
Both current solutions rely on loading rather than instruction rewriting. The first solution installs the shift count in a (byte) field added to every Java thread. It modifies barrier generation when the SCCache is open for writing to load the shift count from the thread field. This solution requires no relocation when the AOT stub/nmethod is loaded from the cache since the load is always at a fixed offset from the thread register. If the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l...
My main concern with first solution is that we add a field to `Thread` class which will be used only at the start by AOT code. This is also load which may miss CPU's cache. On other hand it is "straight-forward" simple change.
The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l...
This is more complicated change which looks like half-step to direction to have separate relocation for it.
I ran the javac benchmark with each of these solutions compared against the equivalent premain build. Neither indicated any noticeable change in execution time -- at least not within the margins of error of the test runs (which, on my M2 machine were +/- 5 msecs in a 95 msec run). A better test might be to take a long running app and see if the change to the AOT barrier code introduced any change in overall execution time.
I implemented these two solutions first because neither of them requires implementing any new relocations. There are two alternatives which would require new relocations that may still be worth investigating. Option three is to mark the shift instruction with a new relocation. Patching of the relocation address would simply require regenerating it with an immediate that matches (log2 of the) current region size.
Please, try this approach (new relocation). Thanks, Vladimir K
The fourth option is to load the shift count from a data area associated with the current blob. In the case of an nmethod this would be the nmethod constants section. In the case of a generated stub this would have to be a dedicated memory address in its associated blob. Either way the data location would need to be marked with a new relocation. Patching of the relocation address would simply require copying the (log2 of the) current region size ito the data area.
I'll hold off on adding these solutions (also on implementing the x86 versions -- well, more likely, letting Ashu provide them ;-) until I get some feedback on these first two. I'll also see if I can get any better indication of whether the performance of the first two solutions is an issue. I think solution one is by far the simplest, resolving the immediate issue with least fuss (note that I poked the necessary data byte into a hole in the thread record so it has no space implications). However, if we end up having to tweak generated code to deal with other config changes the alternatives may be worth investing in as they might scale better.
regards,
Andrew Dinn -----------
On 7/15/24 9:23 AM, Andrew Dinn wrote:
I have prototyped two (aarch64-specific) solutions for JDK-8335440 both of which fix the G1 write barrier in AOT code to use the runtime region grain size. Both solutions make AOT code resilient to any change in max heap between assembly and production runs.
The problem arises because ergonomics uses the heap size to derive a G1 region size and the latter size determines what shift is needed to convert a store address to a card table index. In currently generated nmethod and *stub* code) the shift count is installed as an immediate operand of a generated shift instruction. In AOT code the shift counts needs to be appropriate to the current runtime region size. AOT code can resolve this requirement in two ways. It can load the shift from a well known location and supply the shift count as a register operand. Alternatively, it can employ load-time rewriting of the instruction stream to update the immediate operand.
Both current solutions rely on loading rather than instruction rewriting. The first solution installs the shift count in a (byte) field added to every Java thread. It modifies barrier generation when the SCCache is open for writing to load the shift count from the thread field. This solution requires no relocation when the AOT stub/nmethod is loaded from the cache since the load is always at a fixed offset from the thread register. If the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l...
The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset? ldr grain, [pc, #5678] I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size. Thanks - Ioi
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l...
I ran the javac benchmark with each of these solutions compared against the equivalent premain build. Neither indicated any noticeable change in execution time -- at least not within the margins of error of the test runs (which, on my M2 machine were +/- 5 msecs in a 95 msec run). A better test might be to take a long running app and see if the change to the AOT barrier code introduced any change in overall execution time.
I implemented these two solutions first because neither of them requires implementing any new relocations. There are two alternatives which would require new relocations that may still be worth investigating. Option three is to mark the shift instruction with a new relocation. Patching of the relocation address would simply require regenerating it with an immediate that matches (log2 of the) current region size.
The fourth option is to load the shift count from a data area associated with the current blob. In the case of an nmethod this would be the nmethod constants section. In the case of a generated stub this would have to be a dedicated memory address in its associated blob. Either way the data location would need to be marked with a new relocation. Patching of the relocation address would simply require copying the (log2 of the) current region size ito the data area.
I'll hold off on adding these solutions (also on implementing the x86 versions -- well, more likely, letting Ashu provide them ;-) until I get some feedback on these first two. I'll also see if I can get any better indication of whether the performance of the first two solutions is an issue. I think solution one is by far the simplest, resolving the immediate issue with least fuss (note that I poked the necessary data byte into a hole in the thread record so it has no space implications). However, if we end up having to tweak generated code to deal with other config changes the alternatives may be worth investing in as they might scale better.
regards,
Andrew Dinn -----------
Hi Ioi, On 16/07/2024 17:33, ioi.lam@oracle.com wrote:
On 7/15/24 9:23 AM, Andrew Dinn wrote:
. . . The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
ldr grain, [pc, #5678]
That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, i.e. via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is reported by the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses a sequence like this movz reg #<16bit> movk reg #<16bit>, #16 movk reg #<16bit>, #32 ldrb reg, reg . . . lsr reg2, reg, reg2 The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to be inserted in the SCC address table's list of external addresses. This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the SCCache. With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences if it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size. I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed the lsr.
With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers plus auxiliary classes). With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp + add to install a pc-relative address into a register and then an ldrb via that register. adr reg, #<21bits> ldrb reg, reg ... lsr reg2, reg, reg2 or adrp reg, #<21bits> # selects 12 bit-aligned page add reg, #<12bits> ldrb reg, reg ... lsr reg2, reg, reg2 The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod (or non-nmethod blob). Option 3 involves generating the normal barrier lsr, reg, #imm, reg The difference is that for AOT code we would mark the instruction with a new relocation, let's call it reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size shift was reported by the card table. Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob). regards, Andrew Dinn -----------
We don't have such a reloc at present..
What about section_word_Relocation so we can put grain value into constants section? Thanks, Vladimir K On 7/17/24 3:15 AM, Andrew Dinn wrote:
Hi Ioi,
On 16/07/2024 17:33, ioi.lam@oracle.com wrote:
On 7/15/24 9:23 AM, Andrew Dinn wrote:
. . . The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
ldr grain, [pc, #5678]
That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, i.e. via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is reported by the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses a sequence like this
movz reg #<16bit> movk reg #<16bit>, #16 movk reg #<16bit>, #32 ldrb reg, reg . . . lsr reg2, reg, reg2
The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to be inserted in the SCC address table's list of external addresses.
This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the SCCache.
With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences if it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size. I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed the lsr.
With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers plus auxiliary classes).
With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp + add to install a pc-relative address into a register and then an ldrb via that register.
adr reg, #<21bits> ldrb reg, reg ... lsr reg2, reg, reg2
or
adrp reg, #<21bits> # selects 12 bit-aligned page add reg, #<12bits> ldrb reg, reg ... lsr reg2, reg, reg2
The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod (or non-nmethod blob).
Option 3 involves generating the normal barrier
lsr, reg, #imm, reg
The difference is that for AOT code we would mark the instruction with a new relocation, let's call it reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size shift was reported by the card table.
Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob).
regards,
Andrew Dinn -----------
On 17/07/2024 18:27, Vladimir Kozlov wrote:
We don't have such a reloc at present..
What about section_word_Relocation so we can put grain value into constants section?
I agree that when compiling an nmethod we would need to use a section_word_type reloc to mark the adrp that accesses the constant. That would ensure that the offset used by the adrp is kept consistent across buffer resizes and at install when the displacement may change. However, what I was talking about was a new reloc, needed only when the SCCache restores code, which would mark the constant itself. When AOT code is restored we need to ensure any such constant is rewritten using the runtime grain size. We could attempt to do the rewrite of the constant as a side-effect of processing the section_word_type reloc during code restore. However, we would need to know for sure that the constant being accessed by the adrp was definitely the grain size. Is that what you were thinking of, Vladimir? Of course that would not work for stubs which need to include a barrier and a reference to the barrier shift (I believe this only applies for some of the memory copy stubs). In this case we would have to load the constant from a data slot allocated in amongst the instructions. So, we I think would not be able to identify the location of the constant with a section_word_type reloc. regards, Andrew Dinn -----------
On 7/17/24 3:15 AM, Andrew Dinn wrote:
Hi Ioi,
On 16/07/2024 17:33, ioi.lam@oracle.com wrote:
On 7/15/24 9:23 AM, Andrew Dinn wrote:
. . . The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
ldr grain, [pc, #5678]
That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, i.e. via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is reported by the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses a sequence like this
movz reg #<16bit> movk reg #<16bit>, #16 movk reg #<16bit>, #32 ldrb reg, reg . . . lsr reg2, reg, reg2
The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to be inserted in the SCC address table's list of external addresses.
This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the SCCache.
With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences if it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size. I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed the lsr.
With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers plus auxiliary classes).
With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp + add to install a pc-relative address into a register and then an ldrb via that register.
adr reg, #<21bits> ldrb reg, reg ... lsr reg2, reg, reg2
or
adrp reg, #<21bits> # selects 12 bit-aligned page add reg, #<12bits> ldrb reg, reg ... lsr reg2, reg, reg2
The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod (or non-nmethod blob).
Option 3 involves generating the normal barrier
lsr, reg, #imm, reg
The difference is that for AOT code we would mark the instruction with a new relocation, let's call it reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size shift was reported by the card table.
Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob).
regards,
Andrew Dinn -----------
-- regards, Andrew Dinn ----------- Red Hat Distinguished Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill
On 7/18/24 4:00 AM, Andrew Dinn wrote:
On 17/07/2024 18:27, Vladimir Kozlov wrote:
> We don't have such a reloc at present..
What about section_word_Relocation so we can put grain value into constants section?
I agree that when compiling an nmethod we would need to use a section_word_type reloc to mark the adrp that accesses the constant. That would ensure that the offset used by the adrp is kept consistent across buffer resizes and at install when the displacement may change.
However, what I was talking about was a new reloc, needed only when the SCCache restores code, which would mark the constant itself. When AOT code is restored we need to ensure any such constant is rewritten using the runtime grain size.
We could attempt to do the rewrite of the constant as a side-effect of processing the section_word_type reloc during code restore. However, we would need to know for sure that the constant being accessed by the adrp was definitely the grain size. Is that what you were thinking of, Vladimir?
Of course that would not work for stubs which need to include a barrier and a reference to the barrier shift (I believe this only applies for some of the memory copy stubs). In this case we would have to load the constant from a data slot allocated in amongst the instructions. So, we I think would not be able to identify the location of the constant with a section_word_type reloc.
Yes, you are right, section_word_type will not work. What about allocating word in CodeCache as we do for some intrinsics stubs tables? You will need to generate it only once and can use runtime_type relocation to access it. It is all about loading with existing relocation vs specialized relocation for immediate value (Option three). I would like to see how complex option three is. Thanks, Vladimir K
regards,
Andrew Dinn -----------
On 7/17/24 3:15 AM, Andrew Dinn wrote:
Hi Ioi,
On 16/07/2024 17:33, ioi.lam@oracle.com wrote:
On 7/15/24 9:23 AM, Andrew Dinn wrote:
. . . The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal generation. In order to make this visible to the compilers and SCC address table the address of this field is exported via the card table. This solution requires the AOT code to reference the target address using a runtime address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate operand.
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
ldr grain, [pc, #5678]
That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, i.e. via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is reported by the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses a sequence like this
movz reg #<16bit> movk reg #<16bit>, #16 movk reg #<16bit>, #32 ldrb reg, reg . . . lsr reg2, reg, reg2
The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to be inserted in the SCC address table's list of external addresses.
This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the SCCache.
With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences if it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size. I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed the lsr.
With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers plus auxiliary classes).
With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp + add to install a pc-relative address into a register and then an ldrb via that register.
adr reg, #<21bits> ldrb reg, reg ... lsr reg2, reg, reg2
or
adrp reg, #<21bits> # selects 12 bit-aligned page add reg, #<12bits> ldrb reg, reg ... lsr reg2, reg, reg2
The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod (or non-nmethod blob).
Option 3 involves generating the normal barrier
lsr, reg, #imm, reg
The difference is that for AOT code we would mark the instruction with a new relocation, let's call it reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size shift was reported by the card table.
Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob).
regards,
Andrew Dinn -----------
Hi Vladimir, On 18/07/2024 17:15, Vladimir Kozlov wrote:
What about allocating word in CodeCache as we do for some intrinsics stubs tables? You will need to generate it only once and can use runtime_type relocation to access it.
I am looking into that now. I've been working on something else that might interest you ...
It is all about loading with existing relocation vs specialized relocation for immediate value (Option three). I would like to see how complex option three is. I have an implementation of option 3 in my JDK-8335440-aot-reloc branch. m.b. it is based on a slightly out of date premain but the implementation indicates what is involved even without a rebase (I'll do that soon):
https://github.com/openjdk/leyden/compare/premain...adinn:leyden:JDK-8335440... Basically this solution emits an aot_reloc for the GC barrier left shift immediate instruction when we are generating AOT code (StoreCachedCode == true). When loading AOT code (LoadCachedCode == true) any left shift immediate tagged with an aot_reloc has its operand patched with the log region grain size from the current JVM. I use a format field to identify what reloc is required so the same model will support other AOT Load time relocs if need be. The implementation makes it fairly obvious that we could use the same technique of tagging with an aot_reloc at StoreCachedCode time and patching at LoadCachedCode time for any other instruction (or indivisible instruction sequence) which 1) encodes some runtime constant from the original JVM and 2) is able to be reset by patching it to use the value in the new JVM. It is worth noting that for both C1 and C2 generated code I had to set a tag on the LIR node (c1_LIROp in C1, URShiftNode in C2) in order to mark it as a relocatable instruction. Later on, in the generation phase, I detect the mark and emit a reloc to the code buffer. This works because none of the LIR transforms modify the left shift node by merging it into some other operation or by merging another operation into it (I checked but this is a contingent fact based on the current state of the code). Clearly, in the case of barrier patching we could finesse the above problem by generating the GC barrier late enough to bypass graph normalization and back end reductions. However, if we want to use a similar technique to AOT patch other instructions or sequences then we will need a more reliable way of ensuring that the relocatable instructions are not replaced or merged during normalization/final reduction. regards, Andrew Dinn -----------
The implementation makes it fairly obvious that we could use the same technique of tagging with an aot_reloc at StoreCachedCode time and patching at LoadCachedCode time for any other instruction (or indivisible instruction sequence) which 1) encodes some runtime constant from the original JVM and 2) is able to be reset by patching it to use the value in the new JVM.
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too. We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too. It is constant node which is embedded into AddP node: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c... I marked it only in platform specific code: https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared... Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0). Thanks, Vladimir K On 7/23/24 9:12 AM, Andrew Dinn wrote:
Hi Vladimir,
On 18/07/2024 17:15, Vladimir Kozlov wrote:
What about allocating word in CodeCache as we do for some intrinsics stubs tables? You will need to generate it only once and can use runtime_type relocation to access it.
I am looking into that now. I've been working on something else that might interest you ...
It is all about loading with existing relocation vs specialized relocation for immediate value (Option three). I would like to see how complex option three is. I have an implementation of option 3 in my JDK-8335440-aot-reloc branch. m.b. it is based on a slightly out of date premain but the implementation indicates what is involved even without a rebase (I'll do that soon):
https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premai... Basically this solution emits an aot_reloc for the GC barrier left shift immediate instruction when we are generating AOT code (StoreCachedCode == true). When loading AOT code (LoadCachedCode == true) any left shift immediate tagged with an aot_reloc has its operand patched with the log region grain size from the current JVM. I use a format field to identify what reloc is required so the same model will support other AOT Load time relocs if need be.
The implementation makes it fairly obvious that we could use the same technique of tagging with an aot_reloc at StoreCachedCode time and patching at LoadCachedCode time for any other instruction (or indivisible instruction sequence) which 1) encodes some runtime constant from the original JVM and 2) is able to be reset by patching it to use the value in the new JVM.
It is worth noting that for both C1 and C2 generated code I had to set a tag on the LIR node (c1_LIROp in C1, URShiftNode in C2) in order to mark it as a relocatable instruction. Later on, in the generation phase, I detect the mark and emit a reloc to the code buffer. This works because none of the LIR transforms modify the left shift node by merging it into some other operation or by merging another operation into it (I checked but this is a contingent fact based on the current state of the code).
Clearly, in the case of barrier patching we could finesse the above problem by generating the GC barrier late enough to bypass graph normalization and back end reductions. However, if we want to use a similar technique to AOT patch other instructions or sequences then we will need a more reliable way of ensuring that the relocatable instructions are not replaced or merged during normalization/final reduction.
regards,
Andrew Dinn -----------
On 23/07/2024 18:44, Vladimir Kozlov wrote:
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too.
Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a narrow oop or klass encode or decode. However, that would only be useful with certain limits. 1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects stored in the CDS mapped heap section. 2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects are also rewritten to use the runtime encoding. 3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room (nops) for both heap-base adjustment and shift even when they are not needed at generate time.
We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too.
It is constant node which is embedded into AddP node: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c...
I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is done by calling __ load_byte_map_base() and the macr assembler uses an ExternalAddress when StoreCachedCode is set.
I marked it only in platform specific code: https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared...
I also think there is no mystery as to why this works on AArch64. The equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls __ load_byte_map_base() However, that said, looking at the barrier card table write it seems the card table shift is another value that the user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize which is a gc global command line option so susceptible to being reset. We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size may be changed ergonomically when -Xmx is passed.
Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data? regards, Andrew Dinn -----------
Thank you for checking code for card_table_address.
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
I am currently thinking about adding indication what relocation is pointing on: blob, stubs, external address, string. Currently we are looking through all our address tables in SCCache to find match. Would be nice if we can get this information from relocation. And I also thought about using format for it but realized it is PD and it reserves bits in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1) new relocations without using additional bits. In short add new relocation if you need it. Thanks, Vladimir K On 7/24/24 7:40 AM, Andrew Dinn wrote:
On 23/07/2024 18:44, Vladimir Kozlov wrote:
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too.
Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a narrow oop or klass encode or decode. However, that would only be useful with certain limits.
1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects stored in the CDS mapped heap section.
2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects are also rewritten to use the runtime encoding.
3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room (nops) for both heap-base adjustment and shift even when they are not needed at generate time.
We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too.
It is constant node which is embedded into AddP node: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is done by calling
__ load_byte_map_base()
and the macr assembler uses an ExternalAddress when StoreCachedCode is set.
I marked it only in platform specific code: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I also think there is no mystery as to why this works on AArch64. The equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls
__ load_byte_map_base()
However, that said, looking at the barrier card table write it seems the card table shift is another value that the user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize which is a gc global command line option so susceptible to being reset.
We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size may be changed ergonomically when -Xmx is passed.
Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
regards,
Andrew Dinn -----------
I now have a fourth solution to the region resizing problem which I think is probably the nicest looking one so far: https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l... Opinions welcome! This version relies on a code cache AOT data area, generated into the initial stubs blob, which stores runtime constants that parameterize operation of AOT code. The AOT data area is initialized with the current runtime constants immediately after universe init. AOT code loads the constants indirectly via the region base address rather than embedding them into instructions. Non-AOT code can continue to use these constants as immediate values. With this patch I have included the region grain size shift and the card table shift in the data area. So, this patch ensures that AOT code loads both grain shift and card shift counts rather than using immediates. I tested it passing -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works fine. The only address that needs relocating wit this patch is the AOT data area base. Note that it was useful using this to implement loading of two AOT code parameters as it highlights a general problem that is not face with only one field. I decide to register only the base address of the AOT data area with the code cache as a relocatable (external) address because this is more robust than requiring registration of the address of every data field embedded in the AOT data area. However, in order for the AOT cache to be able to relocate loads from the data area they need to be implemented by installing the base address in a register and then adding the offset before executing the load. This is fine for stub code where the generator can simply plant a load effective address (lea) of a pointer constant into a register and then add in the offset. However, I know C2 Value optimizations will certainly coalesce a pointer constant load followed by an add to an updated pointer constant load. I think this can also happen with C1 (it doesn't at the moment because in the current C1 code the grain shift load -- at offset 0 -- happens inline, while the card shift load -- at offset 4 -- only happens in a callout to stub code). I fixed this general problem by ensuring that the C1 and C2 back ends detect pointer constants that lie anywhere in the data area address *range*. For (only) those cases the lea is implemenetd by generating a base address load plus offset add. It would be nice to be ale to fold the offset into the subsequent load. I'll look into that as a follow-up. Either way, I think this looks like the simplest and most robust solution to ensuring that AOT code is correctly parameterized at runtime. It is particularly nice that it doesn't need any new relocations. regards, Andrew Dinn ----------- On 24/07/2024 20:43, Vladimir Kozlov wrote:
Thank you for checking code for card_table_address.
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
I am currently thinking about adding indication what relocation is pointing on: blob, stubs, external address, string. Currently we are looking through all our address tables in SCCache to find match. Would be nice if we can get this information from relocation. And I also thought about using format for it but realized it is PD and it reserves bits in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1) new relocations without using additional bits.
In short add new relocation if you need it.
Thanks, Vladimir K
On 7/24/24 7:40 AM, Andrew Dinn wrote:
On 23/07/2024 18:44, Vladimir Kozlov wrote:
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too.
Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a narrow oop or klass encode or decode. However, that would only be useful with certain limits.
1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects stored in the CDS mapped heap section.
2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects are also rewritten to use the runtime encoding.
3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room (nops) for both heap-base adjustment and shift even when they are not needed at generate time.
We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too.
It is constant node which is embedded into AddP node: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is done by calling
__ load_byte_map_base()
and the macr assembler uses an ExternalAddress when StoreCachedCode is set.
I marked it only in platform specific code: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I also think there is no mystery as to why this works on AArch64. The equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls
__ load_byte_map_base()
However, that said, looking at the barrier card table write it seems the card table shift is another value that the user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize which is a gc global command line option so susceptible to being reset.
We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size may be changed ergonomically when -Xmx is passed.
Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
regards,
Andrew Dinn -----------
-- regards, Andrew Dinn ----------- Red Hat Distinguished Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill
Very nice. I like it. Thanks, Vladimir K On 7/31/24 8:23 AM, Andrew Dinn wrote:
I now have a fourth solution to the region resizing problem which I think is probably the nicest looking one so far:
https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain.... Opinions welcome!
This version relies on a code cache AOT data area, generated into the initial stubs blob, which stores runtime constants that parameterize operation of AOT code. The AOT data area is initialized with the current runtime constants immediately after universe init. AOT code loads the constants indirectly via the region base address rather than embedding them into instructions. Non-AOT code can continue to use these constants as immediate values.
With this patch I have included the region grain size shift and the card table shift in the data area. So, this patch ensures that AOT code loads both grain shift and card shift counts rather than using immediates. I tested it passing -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works fine.
The only address that needs relocating wit this patch is the AOT data area base. Note that it was useful using this to implement loading of two AOT code parameters as it highlights a general problem that is not face with only one field.
I decide to register only the base address of the AOT data area with the code cache as a relocatable (external) address because this is more robust than requiring registration of the address of every data field embedded in the AOT data area. However, in order for the AOT cache to be able to relocate loads from the data area they need to be implemented by installing the base address in a register and then adding the offset before executing the load.
This is fine for stub code where the generator can simply plant a load effective address (lea) of a pointer constant into a register and then add in the offset. However, I know C2 Value optimizations will certainly coalesce a pointer constant load followed by an add to an updated pointer constant load. I think this can also happen with C1 (it doesn't at the moment because in the current C1 code the grain shift load -- at offset 0 -- happens inline, while the card shift load -- at offset 4 -- only happens in a callout to stub code). I fixed this general problem by ensuring that the C1 and C2 back ends detect pointer constants that lie anywhere in the data area address *range*. For (only) those cases the lea is implemenetd by generating a base address load plus offset add.
It would be nice to be ale to fold the offset into the subsequent load. I'll look into that as a follow-up. Either way, I think this looks like the simplest and most robust solution to ensuring that AOT code is correctly parameterized at runtime. It is particularly nice that it doesn't need any new relocations.
regards,
Andrew Dinn -----------
On 24/07/2024 20:43, Vladimir Kozlov wrote:
Thank you for checking code for card_table_address.
> Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info > about what/how to patch the target instruction(s) using reloc data?
I am currently thinking about adding indication what relocation is pointing on: blob, stubs, external address, string. Currently we are looking through all our address tables in SCCache to find match. Would be nice if we can get this information from relocation. And I also thought about using format for it but realized it is PD and it reserves bits in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1) new relocations without using additional bits.
In short add new relocation if you need it.
Thanks, Vladimir K
On 7/24/24 7:40 AM, Andrew Dinn wrote:
On 23/07/2024 18:44, Vladimir Kozlov wrote:
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too.
Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a narrow oop or klass encode or decode. However, that would only be useful with certain limits.
1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects stored in the CDS mapped heap section.
2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects are also rewritten to use the runtime encoding.
3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room (nops) for both heap-base adjustment and shift even when they are not needed at generate time.
We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too.
It is constant node which is embedded into AddP node: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is done by calling
__ load_byte_map_base()
and the macr assembler uses an ExternalAddress when StoreCachedCode is set.
I marked it only in platform specific code: https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I also think there is no mystery as to why this works on AArch64. The equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls
__ load_byte_map_base()
However, that said, looking at the barrier card table write it seems the card table shift is another value that the user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize which is a gc global command line option so susceptible to being reset.
We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size may be changed ergonomically when -Xmx is passed.
Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
regards,
Andrew Dinn -----------
Hi Vladimir, I have done a port of Andrew D's patch for version 4 of the solution to x8-64. Here is the link to the change sets on top of Andrew's original patch for v4. https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-83... Can you please review the patch? Any feedback is appreciated. Few things worth calling out: 1. As Andrew D mentioned in his previous mail, C2 coalesces a pointer constant load followed by an add to an updated pointer constant load. And because C2 on x86 generates external_word reloc for every constant pointer loaded, it means we would have to add every address in AOTRuntimeConstants range to the SCAddressTable, To avoid this, I used the same mechanism as aarch64 to identify the AOTRuntimeConstants addresses in the back-end. 2. C1 doesn't support use of 3-operand shift instruction (sarx/shlx/shrx). I extended C1 to lower shift ops using these instructions if possible. Does this make sense? 3. In G1BarrierSetAssembler::g1_write_barrier_post() the tmp2 register is not available for use. The caller G1BarrierSetAssembler::oop_store_at uses tmp2 for storing uncompressed oop. Also, rscratch2 is not available as it is passed as tmp1 which is also in use. So we don't really have any free registers left. In my patch I resorted to using rscratch1. Is there a need to spill that register to stack before using it? Thanks, - Ashutosh Mehra On Wed, Jul 31, 2024 at 4:08 PM Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
Very nice. I like it.
Thanks, Vladimir K
On 7/31/24 8:23 AM, Andrew Dinn wrote:
I now have a fourth solution to the region resizing problem which I think is probably the nicest looking one so far:
https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain....
Opinions welcome!
This version relies on a code cache AOT data area, generated into the initial stubs blob, which stores runtime constants that parameterize operation of AOT code. The AOT data area is initialized with the current runtime constants immediately after universe init. AOT code loads the constants indirectly via the region base address rather than embedding them into instructions. Non-AOT code can continue to use these constants as immediate values.
With this patch I have included the region grain size shift and the card table shift in the data area. So, this patch ensures that AOT code loads both grain shift and card shift counts rather than using immediates. I tested it passing -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works fine.
The only address that needs relocating wit this patch is the AOT data area base. Note that it was useful using this to implement loading of two AOT code parameters as it highlights a general problem that is not face with only one field.
I decide to register only the base address of the AOT data area with the code cache as a relocatable (external) address because this is more robust than requiring registration of the address of every data field embedded in the AOT data area. However, in order for the AOT cache to be able to relocate loads from the data area they need to be implemented by installing the base address in a register and then adding the offset before executing the load.
This is fine for stub code where the generator can simply plant a load effective address (lea) of a pointer constant into a register and then add in the offset. However, I know C2 Value optimizations will certainly coalesce a pointer constant load followed by an add to an updated pointer constant load. I think this can also happen with C1 (it doesn't at the moment because in the current C1 code the grain shift load -- at offset 0 -- happens inline, while the card shift load -- at offset 4 -- only happens in a callout to stub code). I fixed this general problem by ensuring that the C1 and C2 back ends detect pointer constants that lie anywhere in the data area address *range*. For (only) those cases the lea is implemenetd by generating a base address load plus offset add.
It would be nice to be ale to fold the offset into the subsequent load. I'll look into that as a follow-up. Either way, I think this looks like the simplest and most robust solution to ensuring that AOT code is correctly parameterized at runtime. It is particularly nice that it doesn't need any new relocations.
regards,
Andrew Dinn -----------
On 24/07/2024 20:43, Vladimir Kozlov wrote:
Thank you for checking code for card_table_address.
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
I am currently thinking about adding indication what relocation is pointing on: blob, stubs, external address, string. Currently we are looking through all our address tables in SCCache to find match. Would be nice if we can get this information from relocation. And I also thought about using format for it but realized it is PD and it reserves bits in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1) new relocations without using additional bits.
In short add new relocation if you need it.
Thanks, Vladimir K
On 7/24/24 7:40 AM, Andrew Dinn wrote:
On 23/07/2024 18:44, Vladimir Kozlov wrote:
I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too.
Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a narrow oop or klass encode or decode. However, that would only be useful with certain limits.
1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects stored in the CDS mapped heap section.
2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects are also rewritten to use the runtime encoding.
3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room (nops) for both heap-base adjustment and shift even when they are not needed at generate time.
We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because it is the same value in training and production runs. But we need to fix it too.
It is constant node which is embedded into AddP node:
https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is done by calling
__ load_byte_map_base()
and the macr assembler uses an ExternalAddress when StoreCachedCode is set.
I marked it only in platform specific code:
https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s...
I also think there is no mystery as to why this works on AArch64. The equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls
__ load_byte_map_base()
However, that said, looking at the barrier card table write it seems the card table shift is another value that the user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize which is a gc global command line option so susceptible to being reset.
We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size may be changed ergonomically when -Xmx is passed.
Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info about what/how to patch the target instruction(s) using reloc data?
regards,
Andrew Dinn -----------
Do we really need AOTRuntimeConstants instance be a stub in CodeCache? Looking on this code and I think this complicate implementation if use ExternalAddress relocation anyway. Can it be static instance in SCCache.cpp which we initialize the same way as in current changes? I do not understand why you have concern about having separate relocation for each constant in AOTRuntimeConstants. We have only 2 now. Even they grow I don't think we will have a lot of them. Actually the question: will it simplify code if each constant's address is treated as separate and not offset in AOTRuntimeConstants? Instead of loading AOTRuntimeConstants address and then adding offset we just load address of constant. I don't understand question about C1 (I am not familiar with it). What it generates for shifts now? Why you need 3-operand shift? Code style: I noticed and aarch64.ad still using old stile of instructions definition using `enc_class`. We don't do this any more (at least in x86*.ad) when we adding new instructions with one implementation. Please use `ins_encode %{ %}` in such cases. I think, if you need additional register you should push/pop to guarantee we don't stomp over some used value. Thanks, Vladimir On 8/14/24 12:19 PM, Ashutosh Mehra wrote:
Hi Vladimir,
I have done a port of Andrew D's patch for version 4 of the solution to x8-64. Here is the link to the change sets on top of Andrew's original patch for v4. https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-83... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!MMdG9tP3z_X4-DsBZBIi5944rqlHAPv9JkmNuc2W8V0r_lUjDKz7AVe15KyfC3C92KLfpIoSzNFXM103WlW7jg$>
Can you please review the patch? Any feedback is appreciated.
Few things worth calling out: 1. As Andrew D mentioned in his previous mail, C2 coalesces a pointer constant load followed by an add to an updated pointer constant load. And because C2 on x86 generates external_word reloc for every constant pointer loaded, it means we would have to add every address in AOTRuntimeConstants range to the SCAddressTable, To avoid this, I used the same mechanism as aarch64 to identify the AOTRuntimeConstants addresses in the back-end.
2. C1 doesn't support use of 3-operand shift instruction (sarx/shlx/shrx). I extended C1 to lower shift ops using these instructions if possible. Does this make sense?
3. In G1BarrierSetAssembler::g1_write_barrier_post() the tmp2 register is not available for use. The caller G1BarrierSetAssembler::oop_store_at uses tmp2 for storing uncompressed oop. Also, rscratch2 is not available as it is passed as tmp1 which is also in use. So we don't really have any free registers left. In my patch I resorted to using rscratch1. Is there a need to spill that register to stack before using it?
Thanks, - Ashutosh Mehra
On Wed, Jul 31, 2024 at 4:08 PM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
Very nice. I like it.
Thanks, Vladimir K
On 7/31/24 8:23 AM, Andrew Dinn wrote: > I now have a fourth solution to the region resizing problem which I think is probably the nicest looking one so far: > > > https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain.... <https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!JI8n2MMyF844NBqR63MhWXggKzPCoouZ8pcwgasAa94iYPRzpcD7af-zSicg5Wm0nv9ZyaF1CpaC74Mqn3g$> > Opinions welcome! > > This version relies on a code cache AOT data area, generated into the initial stubs blob, which stores runtime constants > that parameterize operation of AOT code. The AOT data area is initialized with the current runtime constants immediately > after universe init. AOT code loads the constants indirectly via the region base address rather than embedding them into > instructions. Non-AOT code can continue to use these constants as immediate values. > > With this patch I have included the region grain size shift and the card table shift in the data area. So, this patch > ensures that AOT code loads both grain shift and card shift counts rather than using immediates. I tested it passing > -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works fine. > > The only address that needs relocating wit this patch is the AOT data area base. Note that it was useful using this to > implement loading of two AOT code parameters as it highlights a general problem that is not face with only one field. > > I decide to register only the base address of the AOT data area with the code cache as a relocatable (external) address > because this is more robust than requiring registration of the address of every data field embedded in the AOT data > area. However, in order for the AOT cache to be able to relocate loads from the data area they need to be implemented by > installing the base address in a register and then adding the offset before executing the load. > > This is fine for stub code where the generator can simply plant a load effective address (lea) of a pointer constant > into a register and then add in the offset. However, I know C2 Value optimizations will certainly coalesce a pointer > constant load followed by an add to an updated pointer constant load. I think this can also happen with C1 (it doesn't > at the moment because in the current C1 code the grain shift load -- at offset 0 -- happens inline, while the card shift > load -- at offset 4 -- only happens in a callout to stub code). I fixed this general problem by ensuring that the C1 and > C2 back ends detect pointer constants that lie anywhere in the data area address *range*. For (only) those cases the lea > is implemenetd by generating a base address load plus offset add. > > It would be nice to be ale to fold the offset into the subsequent load. I'll look into that as a follow-up. Either way, > I think this looks like the simplest and most robust solution to ensuring that AOT code is correctly parameterized at > runtime. It is particularly nice that it doesn't need any new relocations. > > regards, > > > Andrew Dinn > ----------- > > On 24/07/2024 20:43, Vladimir Kozlov wrote: >> Thank you for checking code for card_table_address. >> >> > Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info >> > about what/how to patch the target instruction(s) using reloc data? >> >> I am currently thinking about adding indication what relocation is pointing on: blob, stubs, external address, string. >> Currently we are looking through all our address tables in SCCache to find match. Would be nice if we can get this >> information from relocation. And I also thought about using format for it but realized it is PD and it reserves bits >> in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1) new relocations without using additional bits. >> >> In short add new relocation if you need it. >> >> Thanks, >> Vladimir K >> >> On 7/24/24 7:40 AM, Andrew Dinn wrote: >>> On 23/07/2024 18:44, Vladimir Kozlov wrote: >>>> I agree. As we discussed on meeting we may use it to patch compressed oops/klass code too. >>> >>> Yes, we could potentially do something similar to adjust the shift and base adjustment employed when generating a >>> narrow oop or klass encode or decode. However, that would only be useful with certain limits. >>> >>> 1) We can only cater for a change from one narrow-oop/klass base+shift configuration to another base+shift >>> configuration. Patching code to switch from narrow oops/klass pointers to full width (or vice versa) would require >>> adjusting not just the oop/klass load/store but also other instructions that hard-wire header/payload field sizes and >>> offsets based on the narrow/full width layout assumptions. Likewise it would require resizing and repacking objects >>> stored in the CDS mapped heap section. >>> >>> 2) Even if we restrict ourselves to the case where we retain narrow oops and simply allow for a change of base and/or >>> shift (i.e. keep the object layouts the same), that will only work if pointers installed in mapped CDS heap objects >>> are also rewritten to use the runtime encoding. >>> >>> 3) When generating AOT code for the encode and decode we would have to allow for the worst case i.e. leave room >>> (nops) for both heap-base adjustment and shift even when they are not needed at generate time. >>> >>>> We have the same issue with card_table_address. I actually don't know why current AOT code works??? May be because >>>> it is the same value in training and production runs. But we need to fix it too. >>> >>>> It is constant node which is embedded into AddP node: >>>> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp*L96__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-9E0w0Jbo$> >>> >>> I believe C2 is ok. On AArch64 the C2 back end has an operand called immByteMapBase that matches any use of >>> byte_map_base as a pointer constant. A corresponding rule ensures that any attempt to load that constant pointer is >>> done by calling >>> >>> __ load_byte_map_base() >>> >>> and the macr assembler uses an ExternalAddress when StoreCachedCode is set. >>> >>>> I marked it only in platform specific code: >>>> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/s... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp*L68__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-92hI_Afg$> >>> >>> I also think there is no mystery as to why this works on AArch64. The equivalent code in >>> cardTableBarrierSetAssembler_aarch64.cpp also calls >>> >>> __ load_byte_map_base() >>> >>> However, that said, looking at the barrier card table write it seems the card table shift is another value that the >>> user might decide to reconfigure between assembly and production runs. The shift is derived form GCCardTableSize >>> which is a gc global command line option so susceptible to being reset. >>> >>> We could use an aot_reloc with a different format to tag card table shift lsr instructions and patch them at shared >>> code load to use the current runtime GCCardTableSize. Of course, this is not as urgent a problem as the grain size >>> shift because the card table size only ever changes thanks to an explicit command line request whereas the grain size >>> may be changed ergonomically when -Xmx is passed. >>> >>>> Be careful with using `format()` in shared code because it depends on platform specific setting `format_width` in >>>> `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0). >>> >>> Oh, that's a nuisance. I was concerned that I was perhaps using the wrong reloc model here. Should I be encoding info >>> about what/how to patch the target instruction(s) using reloc data? >>> >>> regards, >>> >>> >>> Andrew Dinn >>> ----------- >>> >> >
On 15/08/2024 00:50, Vladimir Kozlov wrote:
Do we really need AOTRuntimeConstants instance be a stub in CodeCache? Looking on this code and I think this complicate implementation if use ExternalAddress relocation anyway. Can it be static instance in SCCache.cpp which we initialize the same way as in current changes?
Yes, you are right: we don't need the AOTRuntimeConstants struct to be embedded in the cache if we are using an ExternalRelocation. The initial idea for doing that was because it provides a 32-bit bound on the offset from any struct field to a corresponding field load (lea). That would allow us to use PC-relative loads of the field addresses -- which saves instruction footprint on AArch64, possibly also on x86? However, we can only do that if we also implement a new relocation (which seems unnecessary). So, we will redefine the AOTRuntimeConstants struct as a C++ static field.
I do not understand why you have concern about having separate relocation for each constant in AOTRuntimeConstants. We have only 2 now. Even they grow I don't think we will have a lot of them. Actually the question: will it simplify code if each constant's address is treated as separate and not offset in AOTRuntimeConstants? Instead of loading AOTRuntimeConstants address and then adding offset we just load address of constant.
Yes, you are right again: if we register the address of every field in the struct in the SCC address table then that will simplify the back end code. I was not sure how much we wanted to rely on address table registration but given that we *are* reliant on it just now then we might as well have the benefit. So, we will do the extra registration and limit the back end changes to detect ConP occurrences in the AOTRuntimeConstants range and load them using an ExternalAddress reloc.
I don't understand question about C1 (I am not familiar with it). What it generates for shifts now? Why you need 3-operand shift?
I'm not sure '3-operand' is the right term to use to describe what is at stake here. It's really about the kind of operand rather than their number. The current region/card shift operations employ an immediate shift count. The AOT region/card shift operations need to be fed the shift count via a register (the target of the constant load). C1 does currently generate shifts where the shift count is in a register but it only does so using the flavours of shift (sar/sal/shr/shl) that require the register to be ECX. It doesn't currently generate the alternative instructions (sarx/shlx/shrx) that accept a shift count in any register. I'm not 100% sure why but I think this is because not all CPUs implement them (they require CPUID feature BM2) or it may be to do with them setting/not setting flags. Ashu is proposing adding support for these latter instructions to C1 so the barrier code does not have to pin the shift count to ECX. Do we care? Or can we live with the effect of pinning?
Code style: I noticed and aarch64.ad still using old stile of instructions definition using `enc_class`. We don't do this any more (at least in x86*.ad) when we adding new instructions with one implementation. Please use `ins_encode %{ %}` in such cases.
Sure, we will fix that.
I think, if you need additional register you should push/pop to guarantee we don't stomp over some used value. Ok, we will play safe and do that.
One final comment: Much of this discussion is going to be side-stepped by the G1 late barrier implementation that is currently working its way into mainline. That will remove the need to inject barrier IR subgraphs into the C1 and C2 IR graphs and eliminate most (all?) of the associated C2 reduction rules from the ad files. I think we should still update the patch to accommodate the feedback above. We can simplify it further once the G1 changes come in from mainline and pursue in parallel the idea of accommodating changes to oop encodings using this model. regards, Andrew Dinn -----------
Ashu is proposing adding support for these latter instructions to C1 so the barrier code does not have to pin the shift count to ECX. Do we care? Or can we live with the effect of pinning?
We only care about correctness currently. Performance (registers spilling) is not issue. And I am concern about size and complexity of Leyden specific changes which eventually will be pushed into mainline. Unless you see benefits for new C1 shift instructions in mainline I would go with using ECX. And again, do this if code for using ECX is simpler than adding new instructions. If it is not - use new instructions. I was hopping that "Late Barrier Expansion for G1" JEP may help here too but it has only C2 changes :( I agree with updating this patch and may be even pushing it into premain (for testing) regardless "Late Barrier Expansion for G1" JEP status. Thanks, Vladimir K On 8/15/24 3:18 AM, Andrew Dinn wrote:
On 15/08/2024 00:50, Vladimir Kozlov wrote:
Do we really need AOTRuntimeConstants instance be a stub in CodeCache? Looking on this code and I think this complicate implementation if use ExternalAddress relocation anyway. Can it be static instance in SCCache.cpp which we initialize the same way as in current changes?
Yes, you are right: we don't need the AOTRuntimeConstants struct to be embedded in the cache if we are using an ExternalRelocation. The initial idea for doing that was because it provides a 32-bit bound on the offset from any struct field to a corresponding field load (lea). That would allow us to use PC-relative loads of the field addresses -- which saves instruction footprint on AArch64, possibly also on x86? However, we can only do that if we also implement a new relocation (which seems unnecessary).
So, we will redefine the AOTRuntimeConstants struct as a C++ static field.
I do not understand why you have concern about having separate relocation for each constant in AOTRuntimeConstants. We have only 2 now. Even they grow I don't think we will have a lot of them. Actually the question: will it simplify code if each constant's address is treated as separate and not offset in AOTRuntimeConstants? Instead of loading AOTRuntimeConstants address and then adding offset we just load address of constant.
Yes, you are right again: if we register the address of every field in the struct in the SCC address table then that will simplify the back end code. I was not sure how much we wanted to rely on address table registration but given that we *are* reliant on it just now then we might as well have the benefit.
So, we will do the extra registration and limit the back end changes to detect ConP occurrences in the AOTRuntimeConstants range and load them using an ExternalAddress reloc.
I don't understand question about C1 (I am not familiar with it). What it generates for shifts now? Why you need 3-operand shift?
I'm not sure '3-operand' is the right term to use to describe what is at stake here. It's really about the kind of operand rather than their number. The current region/card shift operations employ an immediate shift count. The AOT region/card shift operations need to be fed the shift count via a register (the target of the constant load).
C1 does currently generate shifts where the shift count is in a register but it only does so using the flavours of shift (sar/sal/shr/shl) that require the register to be ECX. It doesn't currently generate the alternative instructions (sarx/shlx/shrx) that accept a shift count in any register. I'm not 100% sure why but I think this is because not all CPUs implement them (they require CPUID feature BM2) or it may be to do with them setting/not setting flags.
Ashu is proposing adding support for these latter instructions to C1 so the barrier code does not have to pin the shift count to ECX. Do we care? Or can we live with the effect of pinning?
Code style: I noticed and aarch64.ad still using old stile of instructions definition using `enc_class`. We don't do this any more (at least in x86*.ad) when we adding new instructions with one implementation. Please use `ins_encode %{ %}` in such cases.
Sure, we will fix that.
I think, if you need additional register you should push/pop to guarantee we don't stomp over some used value. Ok, we will play safe and do that.
One final comment: Much of this discussion is going to be side-stepped by the G1 late barrier implementation that is currently working its way into mainline. That will remove the need to inject barrier IR subgraphs into the C1 and C2 IR graphs and eliminate most (all?) of the associated C2 reduction rules from the ad files. I think we should still update the patch to accommodate the feedback above. We can simplify it further once the G1 changes come in from mainline and pursue in parallel the idea of accommodating changes to oop encodings using this model.
regards,
Andrew Dinn -----------
Hi Vladimir, Here is the updated patch. We have refactored the code as per your suggestions. https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-83... Let us know if you have any more suggestions. Thanks, - Ashutosh Mehra On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
Ashu is proposing adding support for these latter instructions to C1 so the barrier code does not have to pin the shift count to ECX. Do we care? Or can we live with the effect of pinning?
We only care about correctness currently. Performance (registers spilling) is not issue. And I am concern about size and complexity of Leyden specific changes which eventually will be pushed into mainline. Unless you see benefits for new C1 shift instructions in mainline I would go with using ECX. And again, do this if code for using ECX is simpler than adding new instructions. If it is not - use new instructions.
I was hopping that "Late Barrier Expansion for G1" JEP may help here too but it has only C2 changes :(
I agree with updating this patch and may be even pushing it into premain (for testing) regardless "Late Barrier Expansion for G1" JEP status.
Thanks, Vladimir K
On 8/15/24 3:18 AM, Andrew Dinn wrote:
On 15/08/2024 00:50, Vladimir Kozlov wrote:
Do we really need AOTRuntimeConstants instance be a stub in CodeCache? Looking on this code and I think this complicate implementation if use ExternalAddress relocation anyway. Can it be static instance in SCCache.cpp which we initialize the same way as in current changes?
Yes, you are right: we don't need the AOTRuntimeConstants struct to be embedded in the cache if we are using an ExternalRelocation. The initial idea for doing that was because it provides a 32-bit bound on the offset from any struct field to a corresponding field load (lea). That would allow us to use PC-relative loads of the field addresses -- which saves instruction footprint on AArch64, possibly also on x86? However, we can only do that if we also implement a new relocation (which seems unnecessary).
So, we will redefine the AOTRuntimeConstants struct as a C++ static field.
I do not understand why you have concern about having separate relocation for each constant in AOTRuntimeConstants. We have only 2 now. Even they grow I don't think we will have a lot of them. Actually the question: will it simplify code if each constant's address is treated as separate and not offset in AOTRuntimeConstants? Instead of loading AOTRuntimeConstants address and then adding offset we just load address of constant.
Yes, you are right again: if we register the address of every field in the struct in the SCC address table then that will simplify the back end code. I was not sure how much we wanted to rely on address table registration but given that we *are* reliant on it just now then we might as well have the benefit.
So, we will do the extra registration and limit the back end changes to detect ConP occurrences in the AOTRuntimeConstants range and load them using an ExternalAddress reloc.
I don't understand question about C1 (I am not familiar with it). What it generates for shifts now? Why you need 3-operand shift?
I'm not sure '3-operand' is the right term to use to describe what is at stake here. It's really about the kind of operand rather than their number. The current region/card shift operations employ an immediate shift count. The AOT region/card shift operations need to be fed the shift count via a register (the target of the constant load).
C1 does currently generate shifts where the shift count is in a register but it only does so using the flavours of shift (sar/sal/shr/shl) that require the register to be ECX. It doesn't currently generate the alternative instructions (sarx/shlx/shrx) that accept a shift count in any register. I'm not 100% sure why but I think this is because not all CPUs implement them (they require CPUID feature BM2) or it may be to do with them setting/not setting flags.
Ashu is proposing adding support for these latter instructions to C1 so the barrier code does not have to pin the shift count to ECX. Do we care? Or can we live with the effect of pinning?
Code style: I noticed and aarch64.ad still using old stile of instructions definition using `enc_class`. We don't do this any more (at least in x86*.ad) when we adding new instructions with one implementation. Please use `ins_encode %{ %}` in such cases.
Sure, we will fix that.
I think, if you need additional register you should push/pop to guarantee we don't stomp over some used value. Ok, we will play safe and do that.
One final comment: Much of this discussion is going to be side-stepped by the G1 late barrier implementation that is currently working its way into mainline. That will remove the need to inject barrier IR subgraphs into the C1 and C2 IR graphs and eliminate most (all?) of the associated C2 reduction rules from the ad files. I think we should still update the patch to accommodate the feedback above. We can simplify it further once the G1 changes come in from mainline and pursue in parallel the idea of accommodating changes to oop encodings using this model.
regards,
Andrew Dinn -----------
No. Why you still use base+offsets for constants address? Is it for Aarch64? What I want is something like this: static address grain_shift_address() { return &_aot_runtime_constants::_grain_shift; } And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants into SCCache.cpp. I don't see the need to separate create_field_address_list() code from field_addresses_list(). Add assert to field_addresses_list() to check that values are initialized. Vladimir K On 8/15/24 1:43 PM, Ashutosh Mehra wrote:
Hi Vladimir, Here is the updated patch. We have refactored the code as per your suggestions. https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-83... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYt2zbt_w$>
Let us know if you have any more suggestions.
Thanks, - Ashutosh Mehra
On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
> Ashu is proposing adding support for these latter instructions to C1 so > the barrier code does not have to pin the shift count to ECX. Do we > care? Or can we live with the effect of pinning?
We only care about correctness currently. Performance (registers spilling) is not issue. And I am concern about size and complexity of Leyden specific changes which eventually will be pushed into mainline. Unless you see benefits for new C1 shift instructions in mainline I would go with using ECX. And again, do this if code for using ECX is simpler than adding new instructions. If it is not - use new instructions.
I was hopping that "Late Barrier Expansion for G1" JEP may help here too but it has only C2 changes :(
I agree with updating this patch and may be even pushing it into premain (for testing) regardless "Late Barrier Expansion for G1" JEP status.
Thanks, Vladimir K
On 8/15/24 3:18 AM, Andrew Dinn wrote: > On 15/08/2024 00:50, Vladimir Kozlov wrote: >> Do we really need AOTRuntimeConstants instance be a stub in CodeCache? >> Looking on this code and I think this complicate implementation if use >> ExternalAddress relocation anyway. Can it be static instance in >> SCCache.cpp which we initialize the same way as in current changes? > > Yes, you are right: we don't need the AOTRuntimeConstants struct to be > embedded in the cache if we are using an ExternalRelocation. The initial > idea for doing that was because it provides a 32-bit bound on the offset > from any struct field to a corresponding field load (lea). That would > allow us to use PC-relative loads of the field addresses -- which saves > instruction footprint on AArch64, possibly also on x86? However, we can > only do that if we also implement a new relocation (which seems > unnecessary). > > So, we will redefine the AOTRuntimeConstants struct as a C++ static field. > >> I do not understand why you have concern about having separate >> relocation for each constant in AOTRuntimeConstants. We have only 2 >> now. Even they grow I don't think we will have a lot of them. Actually >> the question: will it simplify code if each constant's address is >> treated as separate and not offset in AOTRuntimeConstants? Instead of >> loading AOTRuntimeConstants address and then adding offset we just >> load address of constant. > > Yes, you are right again: if we register the address of every field in > the struct in the SCC address table then that will simplify the back end > code. I was not sure how much we wanted to rely on address table > registration but given that we *are* reliant on it just now then we > might as well have the benefit. > > So, we will do the extra registration and limit the back end changes to > detect ConP occurrences in the AOTRuntimeConstants range and load them > using an ExternalAddress reloc. > >> I don't understand question about C1 (I am not familiar with it). What >> it generates for shifts now? Why you need 3-operand shift? > > I'm not sure '3-operand' is the right term to use to describe what is at > stake here. It's really about the kind of operand rather than their > number. The current region/card shift operations employ an immediate > shift count. The AOT region/card shift operations need to be fed the > shift count via a register (the target of the constant load). > > C1 does currently generate shifts where the shift count is in a register > but it only does so using the flavours of shift (sar/sal/shr/shl) that > require the register to be ECX. It doesn't currently generate the > alternative instructions (sarx/shlx/shrx) that accept a shift count in > any register. I'm not 100% sure why but I think this is because not all > CPUs implement them (they require CPUID feature BM2) or it may be to do > with them setting/not setting flags. > > Ashu is proposing adding support for these latter instructions to C1 so > the barrier code does not have to pin the shift count to ECX. Do we > care? Or can we live with the effect of pinning? > >> Code style: I noticed and aarch64.ad <https://urldefense.com/v3/__http://aarch64.ad__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYhejMFCA$> still using old stile of >> instructions definition using `enc_class`. We don't do this any more >> (at least in x86*.ad) when we adding new instructions with one >> implementation. Please use `ins_encode %{ %}` in such cases. > > Sure, we will fix that. > >> I think, if you need additional register you should push/pop to >> guarantee we don't stomp over some used value. > Ok, we will play safe and do that. > > One final comment: Much of this discussion is going to be side-stepped > by the G1 late barrier implementation that is currently working its way > into mainline. That will remove the need to inject barrier IR subgraphs > into the C1 and C2 IR graphs and eliminate most (all?) of the associated > C2 reduction rules from the ad files. I think we should still update the > patch to accommodate the feedback above. We can simplify it further once > the G1 changes come in from mainline and pursue in parallel the idea of > accommodating changes to oop encodings using this model. > > regards, > > > Andrew Dinn > ----------- >
On 8/15/24 3:06 PM, Vladimir Kozlov wrote:
No.
Why you still use base+offsets for constants address? Is it for Aarch64?
What I want is something like this:
static address grain_shift_address() { return &_aot_runtime_constants::_grain_shift; }
And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants into SCCache.cpp.
I don't see the need to separate create_field_address_list() code from field_addresses_list().
Add assert to field_addresses_list() to check that values are initialized.
Actually you don't need assert since we need only addresses of fields. I don't see use of aot_runtime_constants_at() anymore. Also why you need aot_runtime_constants()? Why not have _aot_runtime_constants as static field of AOTRuntimeConstants class which you can initialize in (again) SCCache.cpp? Vladimir K
Vladimir K
On 8/15/24 1:43 PM, Ashutosh Mehra wrote:
Hi Vladimir, Here is the updated patch. We have refactored the code as per your suggestions. https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-83... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYt2zbt_w$>
Let us know if you have any more suggestions.
Thanks, - Ashutosh Mehra
On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov <vladimir.kozlov@oracle.com <mailto:vladimir.kozlov@oracle.com>> wrote:
> Ashu is proposing adding support for these latter instructions to C1 so > the barrier code does not have to pin the shift count to ECX. Do we > care? Or can we live with the effect of pinning?
We only care about correctness currently. Performance (registers spilling) is not issue. And I am concern about size and complexity of Leyden specific changes which eventually will be pushed into mainline. Unless you see benefits for new C1 shift instructions in mainline I would go with using ECX. And again, do this if code for using ECX is simpler than adding new instructions. If it is not - use new instructions.
I was hopping that "Late Barrier Expansion for G1" JEP may help here too but it has only C2 changes :(
I agree with updating this patch and may be even pushing it into premain (for testing) regardless "Late Barrier Expansion for G1" JEP status.
Thanks, Vladimir K
On 8/15/24 3:18 AM, Andrew Dinn wrote: > On 15/08/2024 00:50, Vladimir Kozlov wrote: >> Do we really need AOTRuntimeConstants instance be a stub in CodeCache? >> Looking on this code and I think this complicate implementation if use >> ExternalAddress relocation anyway. Can it be static instance in >> SCCache.cpp which we initialize the same way as in current changes? > > Yes, you are right: we don't need the AOTRuntimeConstants struct to be > embedded in the cache if we are using an ExternalRelocation. The initial > idea for doing that was because it provides a 32-bit bound on the offset > from any struct field to a corresponding field load (lea). That would > allow us to use PC-relative loads of the field addresses -- which saves > instruction footprint on AArch64, possibly also on x86? However, we can > only do that if we also implement a new relocation (which seems > unnecessary). > > So, we will redefine the AOTRuntimeConstants struct as a C++ static field. > >> I do not understand why you have concern about having separate >> relocation for each constant in AOTRuntimeConstants. We have only 2 >> now. Even they grow I don't think we will have a lot of them. Actually >> the question: will it simplify code if each constant's address is >> treated as separate and not offset in AOTRuntimeConstants? Instead of >> loading AOTRuntimeConstants address and then adding offset we just >> load address of constant. > > Yes, you are right again: if we register the address of every field in > the struct in the SCC address table then that will simplify the back end > code. I was not sure how much we wanted to rely on address table > registration but given that we *are* reliant on it just now then we > might as well have the benefit. > > So, we will do the extra registration and limit the back end changes to > detect ConP occurrences in the AOTRuntimeConstants range and load them > using an ExternalAddress reloc. > >> I don't understand question about C1 (I am not familiar with it). What >> it generates for shifts now? Why you need 3-operand shift? > > I'm not sure '3-operand' is the right term to use to describe what is at > stake here. It's really about the kind of operand rather than their > number. The current region/card shift operations employ an immediate > shift count. The AOT region/card shift operations need to be fed the > shift count via a register (the target of the constant load). > > C1 does currently generate shifts where the shift count is in a register > but it only does so using the flavours of shift (sar/sal/shr/shl) that > require the register to be ECX. It doesn't currently generate the > alternative instructions (sarx/shlx/shrx) that accept a shift count in > any register. I'm not 100% sure why but I think this is because not all > CPUs implement them (they require CPUID feature BM2) or it may be to do > with them setting/not setting flags. > > Ashu is proposing adding support for these latter instructions to C1 so > the barrier code does not have to pin the shift count to ECX. Do we > care? Or can we live with the effect of pinning? > >> Code style: I noticed and aarch64.ad
<https://urldefense.com/v3/__http://aarch64.ad__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYhejMFCA$> still using old stile of >> instructions definition using `enc_class`. We don't do this any more >> (at least in x86*.ad) when we adding new instructions with one >> implementation. Please use `ins_encode %{ %}` in such cases. > > Sure, we will fix that. > >> I think, if you need additional register you should push/pop to >> guarantee we don't stomp over some used value. > Ok, we will play safe and do that. > > One final comment: Much of this discussion is going to be side-stepped > by the G1 late barrier implementation that is currently working its way > into mainline. That will remove the need to inject barrier IR subgraphs > into the C1 and C2 IR graphs and eliminate most (all?) of the associated > C2 reduction rules from the ad files. I think we should still update the > patch to accommodate the feedback above. We can simplify it further once > the G1 changes come in from mainline and pursue in parallel the idea of > accommodating changes to oop encodings using this model. > > regards, > > > Andrew Dinn > ----------- >
On 15/08/2024 23:30, Vladimir Kozlov wrote:
Why you still use base+offsets for constants address? Is it for Aarch64?
What I want is something like this:
static address grain_shift_address() { return &_aot_runtime_constants::_grain_shift; }
Hmm, I thought I had made the changes to use a direct address but they were not in the aarch64 version I pushed. I have pushed a new version that fixes the macroassembler and g1 stubs to load the field address as a constant. I removed the xxx_offset() methods from AOTRuntimeConstants and replaced them with xxx_address() methods to make sure it can only be done that way.
And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants into SCCache.cpp.
Yes, SCCache.hpp/cpp is a much better place for this class than stubRoutines.hpp/cpp. The need to generate AOT code variants that employ constants from the current runtime is not specific to stubGenerator stubs. However, it always relates to use of the AOT cache.
I don't see the need to separate create_field_address_list() code from field_addresses_list().
Sorry, this was a hangover from an initial implementation. The getter now returns the array from a static field initialized in SCCache.cpp.
Add assert to field_addresses_list() to check that values are initialized.
Actually you don't need assert since we need only addresses of fields.
Left as is.
I don't see use of aot_runtime_constants_at() anymore.
It is redundant and has been removed.
Also why you need aot_runtime_constants()? Why not have _aot_runtime_constants as static field of AOTRuntimeConstants class which you can initialize in (again) SCCache.cpp? Agreed. The static singleton is now only used internally to the class. So, we no longer need aot_runtime_constants() any more.
Latest code: https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-l... This passes tier1 tests on aarch64 and successfully improves performance for the javac benchmark. regards, Andrew Dinn -----------
Yes, this is much better. Thanks, Vladimir K On 8/20/24 7:11 AM, Andrew Dinn wrote:
On 15/08/2024 23:30, Vladimir Kozlov wrote:
Why you still use base+offsets for constants address? Is it for Aarch64?
What I want is something like this:
static address grain_shift_address() { return &_aot_runtime_constants::_grain_shift; }
Hmm, I thought I had made the changes to use a direct address but they were not in the aarch64 version I pushed. I have pushed a new version that fixes the macroassembler and g1 stubs to load the field address as a constant. I removed the xxx_offset() methods from AOTRuntimeConstants and replaced them with xxx_address() methods to make sure it can only be done that way.
And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants into SCCache.cpp.
Yes, SCCache.hpp/cpp is a much better place for this class than stubRoutines.hpp/cpp. The need to generate AOT code variants that employ constants from the current runtime is not specific to stubGenerator stubs. However, it always relates to use of the AOT cache.
I don't see the need to separate create_field_address_list() code from field_addresses_list().
Sorry, this was a hangover from an initial implementation. The getter now returns the array from a static field initialized in SCCache.cpp.
Add assert to field_addresses_list() to check that values are initialized.
Actually you don't need assert since we need only addresses of fields.
Left as is.
I don't see use of aot_runtime_constants_at() anymore.
It is redundant and has been removed.
Also why you need aot_runtime_constants()? Why not have _aot_runtime_constants as static field of AOTRuntimeConstants class which you can initialize in (again) SCCache.cpp? Agreed. The static singleton is now only used internally to the class. So, we no longer need aot_runtime_constants() any more.
Latest code:
https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain.... This passes tier1 tests on aarch64 and successfully improves performance for the javac benchmark.
regards,
Andrew Dinn -----------
On 20/08/2024 18:35, Vladimir Kozlov wrote:
Yes, this is much better.
Ok, thanks. I'll push this into the premain branch. regards, Andrew Dinn ----------- Red Hat Distinguished Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill
On 8/20/24 7:11 AM, Andrew Dinn wrote:
On 15/08/2024 23:30, Vladimir Kozlov wrote:
Why you still use base+offsets for constants address? Is it for Aarch64?
What I want is something like this:
static address grain_shift_address() { return &_aot_runtime_constants::_grain_shift; }
Hmm, I thought I had made the changes to use a direct address but they were not in the aarch64 version I pushed. I have pushed a new version that fixes the macroassembler and g1 stubs to load the field address as a constant. I removed the xxx_offset() methods from AOTRuntimeConstants and replaced them with xxx_address() methods to make sure it can only be done that way.
And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants into SCCache.cpp.
Yes, SCCache.hpp/cpp is a much better place for this class than stubRoutines.hpp/cpp. The need to generate AOT code variants that employ constants from the current runtime is not specific to stubGenerator stubs. However, it always relates to use of the AOT cache.
I don't see the need to separate create_field_address_list() code from field_addresses_list().
Sorry, this was a hangover from an initial implementation. The getter now returns the array from a static field initialized in SCCache.cpp.
Add assert to field_addresses_list() to check that values are initialized.
Actually you don't need assert since we need only addresses of fields.
Left as is.
I don't see use of aot_runtime_constants_at() anymore.
It is redundant and has been removed.
Also why you need aot_runtime_constants()? Why not have _aot_runtime_constants as static field of AOTRuntimeConstants class which you can initialize in (again) SCCache.cpp? Agreed. The static singleton is now only used internally to the class. So, we no longer need aot_runtime_constants() any more.
Latest code:
https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain.... This passes tier1 tests on aarch64 and successfully improves performance for the javac benchmark.
regards,
Andrew Dinn -----------
participants (4)
-
Andrew Dinn
-
Ashutosh Mehra
-
ioi.lam@oracle.com
-
Vladimir Kozlov