RFR (S) 8146801: Allocating short arrays of non-constant size is slow

Tue Mar 1 01:01:44 UTC 2016

Did CPU, you tested on, supports ERMS (fast stos)?

clear_mem() is used only in .ad which is only C2. You can put it under 
#ifdef COMPILER2 and you can access Matcher::init_array_short_size then.

Why x86_32.ad does not have similar changes?

Do we really should care for old CPUs (UseFastStosb == false)?

Use short branch instructions jccb and jmpb!!!!

movptr(Address(base, cnt, Address::times_ptr), 0); is too big. You have 
RAX for that.

Labels declaration (except DONE) and bind(LONG); should be inside if 
(!is_large) { since it is only used there.

You have too many jumps per code. I would suggest next:

   Label DONE;

   xorptr(tmp, tmp);

   if (!is_large) {
     Label LOOP, LONG;
     cmpptr(cnt, InitArrayShortSize/BytesPerLong);
     jccb(Assembler::greater, LONG);

     decrement(cnt);
     jccb(Assembler::negative, DONE); // Zero length

     NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words for 
32-bit VM

     BIND(LOOP);
     movptr(Address(base, cnt, Address::times_ptr), tmp);
     decrement(cnt);
     jccb(Assembler::greaterEqual, LOOP);

     BIND(LONG);
   }

I was thinking may be we should do it in Ideal graph instead of 
assembler. But it could trigger Fill array or split iterations 
optimizations which may not good for such small arrays.

Thanks,
Vladimir

On 2/29/16 1:02 PM, Aleksey Shipilev wrote:
> Hi,
>
> Object storage zeroing uses "rep stos" instructions on x86, which are
> fast on long lengths, but have the setup penalty. We successfully avoid
> that penalty when zeroing the objects of known lengths (all objects and
> arrays of constant sizes). However, we don't do anything for arrays of
> non-constant sizes, which are very frequent.
>
> See more details here:
>    https://bugs.openjdk.java.net/browse/JDK-8146801
>
> Patch:
>    http://cr.openjdk.java.net/~shade/8146801/webrev.02/
>
> The core of the changes is at MacroAssembler::clear_mem.
>
> The rest is collateral:
>    a) pulling InitArrayShortSize from Matchers to global VM options to
> get the access to it in MacroAssembler;
>    b) dragging ClearArrayNode::_is_large when ClearArrayNode::Ideal bails
> on large constant length -- otherwise we produce effectively dead code
> for short loop in MacroAssembler, that is never taken.
>
> With this patch, the allocation performance for small arrays is improved
> 3-4x. Performance data and disassemblies:
>    http://cr.openjdk.java.net/~shade/8146801/notes.txt
>
> Testing: JPRT -testset hotspot; targeted microbenchmarks; RBT
> hotspot/test/:hotspot_all,vm.runtime.testlist,vm.compiler.testlist
>
> Cheers,
> -Aleksey
>
>