RFR: 8201318: Introduce GCThreadLocalData to abstract GC-specific data belonging to a thread

Erik Österlund erik.osterlund at oracle.com
Tue Apr 10 19:59:36 UTC 2018


Hi Aleksey,

On 2018-04-10 18:50, Aleksey Shipilev wrote:
> On 04/10/2018 05:59 PM, Robbin Ehn wrote:
>> On 2018-04-10 17:47, Aleksey Shipilev wrote:
>>> On 04/10/2018 05:25 PM, Robbin Ehn wrote:
>>>> Had quick look, what I saw looked good. (not a full review)
>>>> Is there a reason for moving the gc data to 'zero offset' in Thread?
>>> Oh! I missed that, and I fully agree with this move. At least one reason I see, smaller offsets
>>> against TLS open up opportunities for denser code-generation when e.g. GC barriers poll thread-local
>>> data. Right now SATB barrier generates something like "cmpb $0x0, 0x3d8(%r15)", while it could
>>> generate just "cmpb $0x0, 0x0(%r15)" now :)
>> Yes, but it pushes down e.g.:
>> 333   volatile void* _polling_page;                 // Thread local polling page
>> Which may not matter, as long as it's on the first page I suppose.
> I think the major concern should be the instruction size. On x86 what matters is what category
> immediate offset falls into. Some hand-crafted assembly:
>
>     0:	48 89 42 7f          	mov    %rax,0x7f(%rdx)
>     4:	48 89 82 80 00 00 00 	mov    %rax,0x80(%rdx)
>     b:	80 7a 7f 00          	cmpb   $0x0,0x7f(%rdx)
>     f:	80 ba 80 00 00 00 00 	cmpb   $0x0,0x80(%rdx)
>    16:	80 7a 7f 41          	cmpb   $0x41,0x7f(%rdx)
>    1a:	80 ba 80 00 00 00 41 	cmpb   $0x41,0x80(%rdx)
>    21:	f6 42 7f 00          	testb  $0x0,0x7f(%rdx)
>    25:	f6 82 80 00 00 00 00 	testb  $0x0,0x80(%rdx)
>    2c:	f6 42 7f 41          	testb  $0x41,0x7f(%rdx)
>    30:	f6 82 80 00 00 00 41 	testb  $0x41,0x80(%rdx)
>
>
> In our case, we want to pack the most used fields under first 128 bytes. Maybe we should put polling
> page at offset 0, and trim GCTLD to 96 bytes?

Note that the offset will not be 0 due to the vtable. It will be 8 on 64 
bit machines. I once prototyped a thread-local poll utilizing 
conditional branches that truly used offset 0 to get optimal encoding (6 
bytes for the test and shortened branch - same size as the old testl 
$page encoding for global polling). I had to go down a deep rabbit hole 
of exposing the TLS in r15 at an offset into the Thread, adjusting all 
offsets for our generated code, and making the locking code deal with 
owners being "almost" equal, as the owner is either Thread* or an 
internal pointer into that thread, depending on what part of the locking 
code was being used. After a lot of blood, sweat and tears, my 
conclusion from that exercise was that it made absolutely no observable 
difference. But I got the T-shirt anyway.

Thanks,
/Erik

> Thanks,
> -Aleksey



More information about the hotspot-dev mailing list