RFR: JDK-8261552: s390: MacroAssembler::encode_klass_not_null() may produce wrong results for non-zero values of narrow klass base

Thu Feb 18 05:39:39 UTC 2021

On Wed, 17 Feb 2021 16:53:03 GMT, Lutz Schmidt <lucy at openjdk.org> wrote:

>> If Compressed class pointer base has a non-zero value it may cause MacroAssembler::encode_klass_not_null() to encode a Klass pointer to a wrong narrow pointer.
>> 
>> This can be reproduced by starting the VM with
>> -Xshare:dump -XX:HeapBaseMinAddress=2g -Xmx128m
>> but CDS is not involved. It is only relevant insofar as this is the only way to get the following combination:
>> - heap is allocated at 0x800_0000. It is small and ends at 0x8800_0000.
>> - class space follows at 0x8800_0000
>> - the narrow klass pointer base points to the start of the class space at 0x8800_0000.
>> 
>> In MacroAssembler::encode_klass_not_null(), there is the following section:
>> 
>>   if (base != NULL) {
>>     unsigned int base_h = ((unsigned long)base)>>32;
>>     unsigned int base_l = (unsigned int)((unsigned long)base);
>>     if ((base_h != 0) && (base_l == 0) && VM_Version::has_HighWordInstr()) {
>>       lgr_if_needed(dst, current);
>>       z_aih(dst, -((int)base_h));     // Base has no set bits in lower half.
>>     } else if ((base_h == 0) && (base_l != 0)) {   (A)
>>       lgr_if_needed(dst, current);                
>>       z_agfi(dst, -(int)base_l);                   (B)
>>     } else {
>>       load_const(Z_R0, base);
>>       lgr_if_needed(dst, current);
>>       z_sgr(dst, Z_R0);
>>     }
>>     current = dst;
>>   }
>> 
>> We enter the condition at (A) if the narrow klass pointer base is non-zero but fits into 32bit. At (B), we want to substract the base from the Klass pointer; we do this by calculating the 32bit twos-complement of the base and add it with AGFI. AGFI adds a 32bit immediate to a 64bit register. In this case, it produces the wrong result if the base is >0x800_0000:
>> 
>> In the case of the crash, we have:
>> base: 				  8800_0000
>> klass pointer: 			  8804_1040
>> 32bit two's complement of base:   7800_0000
>> added to the klass pointer: 	1_0004_1040
>> 
>> So the result of the "substraction" is 1_0004_1040, it should be 4_1040, which would be the correct offset of the Klass* pointer within the ccs.
>> 
>> This bug has been dormant; was activated by JDK-8250989 which changed the way class space reservation happens at CDS dump time. It surfaced first as crash in a CDS-specific jtreg test (JDK-8261552).
>> 
>> ================
>> 
>> Fix:
>> 
>> I changed the AGFI instruction to a pure 32bit add (AFI). That works as long as the Klass pointer also fits into 32bit. So I narrowed the condition at (A) to only fire if it can be ensured that both narrow base and Klass* pointers fit into 32bit.
>> 
>> I also added a runtime verification in that case that any Klass pointer passed down is indeed a 32bit pointer. However, I am not really sure this is useful, or that this is the best way to do this (using TMHH and TMHL). I was looking for something like TMH or TML to check whole 32bit words but could not find any.
>> 
>> ----
>> 
>> Tests:
>> 
>> I manually tested that the crash disappears, which it does. I stepped through the encoding code and the values now look right.
>> 
>> I also did build a VM with the ability to override both class space start address and the narrow klass pointer base to exact values (see https://github.com/openjdk/jdk/compare/master...tstuefe:override-ccs-start-and-base). 
>> 
>> I used this method to test various combinations: 
>> - narrow klass pointer base > 0 < 4g + ccs end < 4g  (we hit our branch doing AFI)
>> - narrow klass pointer base > 0 < 4g + ccs end > 4g  (we hit the fallback doing SGR with r0)
>> - narrow klass pointer base = 0                      (we dont do anything)
>> 
>> (would this override-feature be useful? We could do better testing).
>> 
>> Thanks, Thomas
>
> src/hotspot/cpu/s390/macroAssembler_s390.cpp line 3657:
> 
>> 3655:     } else {
>> 3656:       load_const(Z_R0, base);
>> 3657:       lgr_if_needed(dst, current);
> 
> What would you think of a more general rework like this? The comments in the code should explain the intentions/assumptions/conclusions.
> 
> // Klass oop manipulations if compressed.
> void MacroAssembler::encode_klass_not_null(Register dst, Register src) {
>   Register current = (src != noreg) ? src : dst; // Klass is in dst if no src provided. (dst == src) also possible.
>   address  base    = CompressedKlassPointers::base();
>   int      shift   = CompressedKlassPointers::shift();
>   bool     need_zero_extend = false;
>   assert(UseCompressedClassPointers, "only for compressed klass ptrs");
> 
>   BLOCK_COMMENT("cKlass encoder {");
> 
> #ifdef ASSERT
>   Label ok;
>   z_tmll(current, KlassAlignmentInBytes-1); // Check alignment.
>   z_brc(Assembler::bcondAllZero, ok);
>   // The plain disassembler does not recognize illtrap. It instead displays
>   // a 32-bit value. Issueing two illtraps assures the disassembler finds
>   // the proper beginning of the next instruction.
>   z_illtrap(0xee);
>   z_illtrap(0xee);
>   bind(ok);
> #endif
> 
>   // Scale down the incoming klass pointer first.
>   // We then can be sure we calculate an offset that fits into 32 bit.
>   // More generally speaking: all subsequent calculations are purely 32-bit.
>   if (shift != 0) {
>     assert (LogKlassAlignmentInBytes == shift, "decode alg wrong");
>     z_srlg(dst, current, shift);
>     need_zero_extend = true;
>     current = dst;
>   }
> 
>   if (base != NULL) {
>     // Use scaled-down base address parts to match scaled-down klass pointer.
>     unsigned int base_h = ((unsigned long)base)>>(32+shift);
>     unsigned int base_l = (unsigned int)(((unsigned long)base)>>shift);
> 
>     // General considerations:
>     //  - when calculating (current_h - base_h), all digits must cancel (become 0).
>     //    Otherwise, we would end up with a compressed klass pointer which doesn't
>     //    fit into 32-bit.
>     //  - Only bit#33 of the difference could potentially be non-zero. For that
>     //    to happen, (current_l < base_l) must hold. In this case, the subtraction
>     //    will create a borrow out of bit#32, nicely killing bit#33.
>     //  - With the above, we only need to consider current_l and base_l to
>     //    calculate the result.
>     //  - Both values are treated as unsigned. The unsigned subtraction is
>     //    replaced by adding (unsigned) the 2's complement of the subtrahend.
> 
>     if (base_l == 0) {
>       //  - By theory, the calculation to be performed here (current_h - base_h) MUST
>       //    cancel all high-word bits. Otherwise, we would end up with an offset
>       //    (i.e. compressed klass pointer) that does not fit into 32 bit.
>       //  - current_l remains unchanged.
>       //  - Therefore, we can replace all calculation with just a
>       //    zero-extending load 32 to 64 bit.
>       //  - Even that can be replaced with a conditional load if dst != current.
>       //    (this is a local view. The shift step may have requested zero-extension).
>     } else {
>       // To begin with, we may need to copy and/or zero-extend the register operand.
>       // We have to calculate (current_l - base_l). Because there is no unsigend
>       // subtract instruction with immediate operand, we add the 2's complement of base_l.
>       if (need_zero_extend) {
>         z_llgfr(dst, current);
>         need_zero_extend = false;
>       } else {
>         llgfr_if_needed(dst, current); // zero-extension while copying comes at no extra cost.
>       }
>       current = dst;
>       z_alfi(dst, -(int)base_l);
>     }
> 
>   if (need_zero_extend) {
>     // We must zero-extend the calculated result. It may have some leftover bits in
>     // the hi-word because we only did optimized calculations.
>     z_llgfr(dst, current);
>   } else {
>     llgfr_if_needed(dst, current); // zero-extension while copying comes at no extra cost.
>   }
> 
>   BLOCK_COMMENT("} cKlass encoder");
> }

Looks nice and elegant. 

But as said offlist, I dislike the fact that this hard codes the limitation to 32bit for the narrow klass pointer range.

That restriction is artificial and we may just want to drop it. E.g. one recurring idea I have is to drop the duality in metaspace between non-class- and class-metaspace, and just store everything in class space. That would save quite a bit of memory (less overhead) and make the metaspace coding quite a bit simpler. However, in that case it could be that we exceed the current 3g limit and may even exceed 32bit. Since add+shift for decoding is universally done on all platforms at least if CDS is on, this should work out of the box. Unless of course the platforms hard-code the 32bit limitation into their encoding schemes.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2595