Revisiting Valhalla markword

Mon Jun 15 06:45:43 UTC 2020

Also you may read this text here:http://cr.openjdk.java.net/~skuksenko/valhalla/markword/markword.txt

* Revisiting Valhalla markword.

   I'd like express some thoughts about Valhalla and markword and what to do having deprecated biased locking.

   It should be continue to this emailhttps://mail.openjdk.java.net/pipermail/valhalla-dev/2020-June/007437.html  (https://bugs.openjdk.java.net/browse/JDK-8247298), but I decided to start a new thread.

   What do we want to get? Talking about markword I'd like to separate mandatory information (functional requirements) and desirable information (performance requirements).

   Information in markword:
     * instance states, can't be placed anywhere else:
      - age
      - hash
      - larval state (Valhalla)
      instance state can be mutable and non-mutable (set from the beginning and doesn't change the lifecycle of object), all listed states are mutable and require more careful processing.
     * class states, can be found in Klass, but it's desirable to place it into markword for performance:
      - is_inline          (Valhalla)
      - is_array_flattened (Valhalla)
      - is_array_nullfree  (Valhalla)

<64-bit mode on>

   If look into markword it looks like there are a plenty of space: 26 unused bits.

   //  64 bits:
   //  --------
   //  unused:25 hash:31 -->| unused_gap:1   age:4    biased_lock:1 lock:2 (normal object)

   From the other side - markword can be replaced with some other pointer and in that case we have no free space in markword.

   //  - the two lock bits are used to describe three states: locked/unlocked and monitor.
   //    [ ........   | 1 | 01]  biased
   //    [header      | 0 | 01]  unlocked           regular object header
   //    [ptr             | 00]  locked             ptr points to real header on stack
   //    [ptr             | 10]  monitor            inflated lock (header is wapped out)
   //    [ptr             | 11]  marked             used to mark an object

   In order understand how it works it important to know what kind of pointers could be stored there. Pointer stored in markword is always native and uncompressed (if it's oop).
   That really leaves no place for any other information except lower ptr bits which are free only because of the pointer object alignment.

   All current code around markword's pointers is taking care of only two lower bits forcing minimal alignment to 4 bytes. It's always clear only two lower bits when extracting pointer from the markword.
   But it doesn't mean that real alignment can't be higher.
   biased_lock bit works because 101 bits combination is never combined with pointers inside markword and can't interfere even with 4 bytes aligned pointers.

   Two lower bits 01 means that information in markword is not a pointer and any other bits may be used as desired.

   locked(00) - ptr to BasicLock. It is allocated on Java stack and guaranteed alignment for it is unclear. At least it hard to check it with quick look into code.
   Minimal alignment is really 4 bytes, but having such comment in the code
     basicLock.hpp:     BasicLock _lock;                                    // the lock, must be double word aligned
   we may assume assume that real alignment is 8 bytes (free 3 bits). That place should be proved by someone else who knows markword better than me.

   monitor(10)- ptr to ObjectMonitor. Large and long live (really forever) native object. It has 64 bytes alignment (should be aligned to cache line).

   marked(11) - ptr to "whatever". It even can be not ptr, but just two lower bits as "mark". Widely used by garbage collectors, but also by JVMTI and JFR. GCs and JVMTI are using it to store forwarding pointer -
   uncompressed oop referencing to a new copy of the old object. Having oop gives there us minimal alignment 8 bytes and 3 free bits of information. Real alignment of JFR's pointers is unknown,
   but I think we may demand from JFR any desired alignment.
   The most nice fact about "marked" markword - we don't need to care about it. We can't use this bits combination "11", but in reality neither Valhalla code nor runtime never able to see such markword.
   Mostly it used at safepoints and original markword is restored before execution starts back. It may be used by concurrent GC, but GC should restore (or provide oop with proper markword) via read barrier.
   It's the way as Shenandoah works. I didn't find any trace of "marked" markword inside ZGC, looks like ZGC uses own different forwarding technique.

   The fact means that we should care only about unlocked(or header), locked and monitor markword states.

   Practically, since inline type can't be transferred to locked/monitors state it means that we may choose any unused bit as inline type marker and check that bit plus lower 01 as the whole inline type mask.
   Having 01 in lower bits we may add any combination of upper unused bits and get a lot of place for different states and information. It could be a problem when information is displaced by locking,
   but inline types have no synchronization.

   Let's cover more technical details.

   1. Ideal Solution.

     If we prove that BasicLock alignment is 8 bytes always or can force BasicLock alignment to 8 bytes we are getting 3 low bits for persistent information.
     1 bit (which is biased_lock now is free). Also we have to agree that inline type bit is more important than mark potential features for other JVM components.
     We may reserve this one bit (biased_lock) dedicated for marking inline type, the other two to to encode existing states (unlocked, locked, monitor, marked).
     Having the single bit marking inline type is the way get theoretical minimal possible performance overhead for Valhalla.
     However it's still unclear how big difference between this minimal possible overhead and current multibit mask check.

     My advocating for a single a bit check is based on two sources of overhead. The first direct overhead we need more instructions, singe bit check could be done just with one "test" instruction,
     multibit check requires "and" and the following "cmp". The second overhead is indirect, multibit check needs additional register, it increase register allocation pressure,
     more register spilling and may have quite big performance impact for tight loops.

     The problem is that now BiasedLocking is only deprecated. To use that bit we have to wait until main JDK source repository completely get rid of BiasedLocking code,
     otherwise the process of merging with mainline will be turned into nightmare.

   2. Not so ideal solution, buy it works.

     Done with assumption that we don't not interfere in lower 3 bits of markword.

     Current header:
     //  --------
     //  unused:25 hash:31 -->| unused_gap:1   age:4    biased_lock:1 lock:2 (normal object)

     Using high markword bits for new information is undesirable. Large constants decrease code density. Moreover x86 has no "test" and "and" instructions with immediate larger than 32 bit,
     that will increase code complexity.

     Absolutely no reason to have hash bits so low. It's more natural to move it to the upper part of the markword (highest 32 bits).

     Suggested header:
     //  64 bits:
     //  --------
     //  <always 0>:1 hash:31 -->|   unused:23 larval:1  is_inline:1 age:4  <biased_lock or unused>:1 lock:2

     I am not sure if we could get benefits from having identity hash code in isolated high 4 bytes of the markword. Maybe. But it won't be worse than existing placement.
     What may be done - store identity hash code of inline type in the header as for normal objects. It looks that currently identity hash code is recalculated every time,
     and even doesn't depends on inline type instance fields (https://mail.openjdk.java.net/pipermail/valhalla-dev/2020-May/007288.html).
     So we may count hash as function from primitive fields plus identity hash from references plus recursive identity hash for inline type fields.
     If inline type instance is allocated on heap - we count it once and store to markword as usual.

    2.1 is_inline check.

     When checking is_inline bits it is necessary to care about bit number 0 , bit number 1 doesn't matter. It because we can't see "marked" state (11) here.
     Current two lower bits design is done in the following way [<fat or thin lock>:1 <unlocked or locked>:1].
     Both fat lock (ObjectMonitor) and thin lock (BasicLock) contain displaced original header. So the lowest markword bit is 0 when header is displaced and 1 when it's original.
     Current code is checking only the lowest markword bit, for example (markWord.hpp):

     bool has_displaced_mark_helper() const {
         return ((value() & unlocked_value) == 0); // where unlocked_value == 1
     }

     Inline type can't be locked, thus header will never be displaced.

     There are two way to generate code for is_inline check:

     1. As usual:

       ; load markword to reg
       and reg,0x81
       cmp reg,0x81
       jne not_inline (or je is_inline)
       ; here is inline

     2. Split bit check

       ; load markword to reg
       test reg,0x1   ; check if markword is displaced
       jz not_inline
       test reg,0x80
       jz not_inline
       ; here is inline

     The second variant has more instructions and more branches, but it saves register. Experiments are required to check if there is a performance difference.
     Besides first half of split check is aligned with array properties checks (see below).

     Note: Maintain mutable (as I called above) instance states in markword is a complex task. Several threads may mutate markword, or even locking may do header displacement at the same time.
     Very high risk of data race. For example, hash code placement into markword has a proper protection for this, even may inflate monitor just to store hash properly.
     It's safe to do mutation of larval bit, because inline class never be locked and header never be displaced.

    2.2 Array checks.

     There are two kinds of possible array checks is_array_flattened and is_array_nullfree (we may add something else, there are plenty of free bits for it).

     Experiments with aaload and aastore operations (http://cr.openjdk.java.net/~skuksenko/valhalla/reports/aaload/aaload.txt  andhttp://cr.openjdk.java.net/~skuksenko/valhalla/reports/aastore/aastore.txt)
     have shown that the key performance overhead of getting properties from the Klass is uncompressing klass pointer.

     It's absolutely no way to fit that information into markword for all kinds of markwords. But we have enough unused bits to store that information in unlocked header.
     Let's use markword as a cache (fast path) for array properties. If markword is displaced we go to Klass and load required information from Klass.
     Having hypothesis that people rarely do synchronization on arrays we are getting quite fast solution.

     When we go to Klass to get that properties we shouldn't extract it from "_layout_helper". We have to get it from "_prototype_header" where that bits should be properly set after Klass creation.
     That allow to unify tail part of checks. In order to get more cache friendly behavior "_prototype_header" field should be moved from the end of Klass structure to the beginning (before of after "_layout_helper" field).

       Array check looks like:

        ; load markword to reg
        test reg, 0x1   ; check if markword is displaced
        jz displaced   ;  --------> uncompress klass ptr, load "_prototype_header" into the same reg and jump back
back:  test reg,<is_array_flattened_mask>  ;     <------------------------------
        jz not_flattened

     Of course all this is necessary if we have -XX:+UseCompressedClassPointers. Otherwise it's better to go directly to the Klass and take "_prototype_header".

     Note 1: This way works only if our property is immutable. But is_array_flattened and is_array_nullfree are satisfied this requirement.

     Note 2: If header is displaced, original markword may be reached by offset zero from displacing pointer. It's tempting to reach original markword and avoid klass pointer uncompressing.
     I want to warn from this. First, the object may be in state INFLATING (a.k.a. zero markword) that means null check for displaced ptr should be added.
     The second, BasicLock (allocated on stack) may be in the process of destruction when we are trying to reach displaced header that may cause data rate and incorrect header value.

<64-bit mode off>

<32-bit mode on>

     Unfortunately 32-bit header has no free space.
     At the same time there are no reasons to cache array properties into markword. Klass ptr is uncompressed. Let's go to Klass.

     32-bit markword:
     //             hash:25 ------------>| age:4    biased_lock:1 lock:2 (normal object)

     First of all we have to fit larval state into markword. There is no other place for it. I see only two options: put into place of biased_lock bit or take it from hash (narrow hash to 24 bits).

     is_inline check:
     - may get it from Klass
     - may fit into hash or biased_lock bit (reduce hash width).

<32-bit mode off>