Request for reviews (M): 7059037: Use BIS for zeroing on T4

Thu Aug 25 09:42:03 PDT 2011

Hi Vladimir,

this looks like a good starting point. Have you already seen my comments which I had added to bug 7059037?
I just pasted them below.

Kind regards,
Martin D

I'm aware of 2 easy to implement but problematic ways to use block initializing
instructions for TLAB initialization:

1. Use them in ClearArray. The problem here is that objects are not cache line
aligned in general so we need to clear the slow way before (and after?) a cache line
boundary. This is not difficult to implement but has quite some overhead and
does not avoid fetching cache lines from memory at the beginning (end?) of objects.

2. Use them in zero_to_words and activate -XX:+ZeroTLAB. This will clear
the whole TLABs when they get allocated. Doesn't perform well when TLABs get
large and cache lines get squeezed out to other levels in the memory hierarchy.
(BTW: filling with badHeapWordVal in ThreadLocalAllocBuffer::allocate breaks
ZeroTLAB function in debug build, maybe we should open a new bug for it)

My new proposal is to combine the zeroing with the prefetching. We only have to
make sure that we always clear up to some distance behind the object being allocated.
Then we can disable the ClearArray nodes as it is done when ZeroTLAB is used. We already
have tlab_pf_top_offset which is used with AllocatePrefetchStyle==2. Block initializing
prefetching could be implemented using such kind of a prefetch watermark. If we establish
to align the TLABs to cache line boundaries and to use a size which is divisible by the
cache line size, this should be easy to implement (which shouldn't be a bad thing for any
platform).

We could use an AllocatePrefetchDistance of one cache line behind new_eden_top which
probably makes sense, but playing with it might still be interesting because some processors
use automatic hardware prefetching which can interfere with what we're doing. We should
probably clear so far ahead that the hardware prefetch engine doesn't overtake us.

-----Original Message-----
From: hotspot-compiler-dev-bounces at openjdk.java.net [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Kozlov
Sent: Donnerstag, 25. August 2011 02:52
To: hotspot compiler
Subject: Request for reviews (M): 7059037: Use BIS for zeroing on T4

http://cr.openjdk.java.net/~kvn/7059037/webrev

7059037: Use BIS for zeroing on T4

On T4 BIS to the beginning of cache line always zeros it. Use it for zeroing new
allocated java objects. The main code is in MacroAssembler::bis_zeroing() and is
used by C2 generated code (ClearArray), runtime (Copy::fill_to_aligned_words())
and template interpreter (TemplateTable::_new()). New stub zero_aligned_words
was added to use in runtime.

BIS is used only for objects bigger than BlkZeroingLowLimit (2Kbyte) since it
requires membar. 2Hb was selected based on microbenchmark results.

I also added wrasi(Reg, immI) instruction which I used during development.
VM_Version::has_mru_blk_init() is replaced with has_blk_zeroing() since original
was not used.
Zap new object in CollectedHeap::allocate_from_tlab_slow() instead of zeroing it
since it will be cleaned later in init_obj().
Fixed call sites of check_for_bad_heap_word_value() where klass is not
initialized to avoid the verification failure.