[master] RFR: Implement non-racy fast-locking [v5]

Wed Aug 3 19:40:42 UTC 2022

On Tue, 2 Aug 2022 00:01:34 GMT, John R Rose <jrose at openjdk.org> wrote:

>  I put in a query to Sandhya on the Intel Java performance team to see if they have further advice; will share if there is anything.

sandhya.viswanathan at intel.com replied (thanks Sandhya!) as follows:

> From the “Intel Software Optimization Reference Manual” at:
> <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html>
> Section 3.5.2.3 of the manual indicates that a partial register stall occurs when an instruction refers to a register, portions of which were written previously by other instruction. This stall is much smaller on newer architectures. E.g. write to ah, bh, ch, or dh followed by a 2, 4, or 8 byte read of the same register can cause an overhead. 32-bit operation followed by a 64-bit read doesn’t incur that overhead.
> Excerpts from the manual:
> ----
> > General purpose registers can be accessed in granularities of bytes, words, doublewords; 64-bit mode also supports quadword granularity. Referencing a portion of a register is referred to as a partial register reference.
> > A partial register stall happens when an instruction refers to a register, portions of which were previously modified by other instructions. For example, partial register stalls occurs with a read to AX while previous instructions stored AL and AH, or a read to EAX while previous instruction modified AX.
> > The delay of a partial register stall is small in processors based on Intel Core microarchitecture, and in Pentium M processor (with CPUID signature family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M processors (CPUID signature with family 6, model 9) and the P6 family incur a large penalty.
> > Note that in Intel 64 architecture, an update to the lower 32 bits of a 64 bit integer register is architecturally defined to zero extend the upper 32 bits. While this action may be logically viewed as a 32 bit update, it is really a 64 bit update (and therefore does not cause a partial stall).
> > Referencing partial registers frequently produces code sequences with either false or real dependencies.
> ----
> 
> So, it is recommended to use the same width operation on a GPR for up to 32bit data width. 32-bit operation followed by a 64-bit read is ok, so andl is ok.
> 
> For btc instruction REX.W indicates 64 bit. Between btcx and andx instruction andx has better throughput so better to use andx.

And here is my summary, FWIW.  I think something like this should go somewhere into `assembler.hpp`, since it is of general interest in the long term:

1. In order to change bits in the low 8 bits or low 16 bits of a 32-bit or 64-bit register, preserving the unaffected bits, use the full-sized operation, even if a narrower operation (e.g., movb, andb) seems nicer in some way.
2. If you don’t mind resetting the high 32 bits, a 32-bit operation on a 64-bit operand is fine also.
3. Failing to follow this advice has a significant pipeline penalty on older processors, and is also likely to have at least some cost on newer processors.  (Reference: Section 3.5.2.3, Intel Software Optimization Reference Manual.)
4. Even for single-bit updates, prefer and/or/xor to btr/bts/btc in registers.  The latter instructions, when locked on memory operands, may sometimes be useful for bitwise CAS-like operations.  However, full-word cmpxchg has more reliable performance, even for single-bit updates.
5. It is true that instruction cache pressure can degrade performance of an excessively sparse instruction workload, as whole.  But, because of deep prefetch, the effect on performance of memory encoding size of an individual instruction is usually insignificant.

-------------

PR: https://git.openjdk.org/lilliput/pull/51