RFR: 8201318: Introduce GCThreadLocalData to abstract GC-specific data belonging to a thread

Tue Apr 10 20:25:13 UTC 2018

Hi Erik

On 04/10/2018 09:59 PM, Erik Österlund wrote:
>> I think the major concern should be the instruction size. On x86 what matters is what category
>> immediate offset falls into. Some hand-crafted assembly:
>>
>>     0:    48 89 42 7f              mov    %rax,0x7f(%rdx)
>>     4:    48 89 82 80 00 00 00     mov    %rax,0x80(%rdx)
>>     b:    80 7a 7f 00              cmpb   $0x0,0x7f(%rdx)
>>     f:    80 ba 80 00 00 00 00     cmpb   $0x0,0x80(%rdx)
>>    16:    80 7a 7f 41              cmpb   $0x41,0x7f(%rdx)
>>    1a:    80 ba 80 00 00 00 41     cmpb   $0x41,0x80(%rdx)
>>    21:    f6 42 7f 00              testb  $0x0,0x7f(%rdx)
>>    25:    f6 82 80 00 00 00 00     testb  $0x0,0x80(%rdx)
>>    2c:    f6 42 7f 41              testb  $0x41,0x7f(%rdx)
>>    30:    f6 82 80 00 00 00 41     testb  $0x41,0x80(%rdx)
>>
>>
>> In our case, we want to pack the most used fields under first 128 bytes. Maybe we should put polling
>> page at offset 0, and trim GCTLD to 96 bytes?
> 
> Note that the offset will not be 0 due to the vtable. It will be 8 on 64 bit machines. 

True! Regardless, I'll take 0x8(%r15) over 0x888(%r15) any day :)

> I once prototyped a thread-local poll utilizing conditional branches that truly used offset 0 to
> get optimal encoding (6 bytes for the test and shortened branch - same size as the old testl
> $page encoding for global polling). I had to go down a deep rabbit hole of exposing the TLS in
> r15 at an offset into the Thread, adjusting all offsets for our generated code, and making the
> locking code deal with owners being "almost" equal, as the owner is either Thread* or an internal
> pointer into that thread, depending on what part of the locking code was being used. After a lot
> of blood, sweat and tears, my conclusion from that exercise was that it made absolutely no
> observable difference. But I got the T-shirt anyway.
Cool. For safepoint polls, I would understand if it turned out performance-neutral. I also know from
Shenandoah that even the tiny codegen improvements for GC barriers (that are much more frequent than
safepoint polls even after optimizations) do pay off. Trying to fit the hottest JavaThread fields
below 128 bytes seems much easier to do than pulling off the real-0 thing.

The implication for this patch is that we should probably trim the GCTLD size below 128 bytes (from
Per's initial suggestion of 192), and reserve some some space for e.g. polling page (if we want to
move it to lower offset), and account for vtable waste. And then make a stronger comment about
making GCTLD larger than 128 bytes for the future engineers.

I suggest 96 bytes. G1 does not need this much anyhow. Shenandoah does not need this much as well.
ZGC does not need this much too, right?

Thanks,
-Aleksey