review (M) for 4809552: Optimize Arrays.fill(...)

Fri Aug 20 12:49:25 PDT 2010

It seems like a bit of a mixed bag.  Moderately sized fills are slightly slower because of the extra alignment but larger fills are faster.  I'll play with it some more.

tom

On Aug 20, 2010, at 11:46 AM, Vladimir Kozlov wrote:

> Tom Rodriguez wrote:
>>> Use next movdqa since you aligned address:
>>> +         movdqa(Address(to, 0), xtmp);
>>> +         movdqa(Address(to, 16), xtmp);
>>> 
>>> instead of
>>> +         movq(Address(to, 0), xtmp);
>>> +         movq(Address(to, 8), xtmp);
>>> +         movq(Address(to, 16), xtmp);
>>> +         movq(Address(to, 24), xtmp);
>> But it's only aligned to 8 bytes, not 16.  maybe it would be worth it to align to 16?
> 
> Sorry, you are right, it requires 16 not 8 bytes. :(
> I think it worth to align to 16 since it will benefit all x86.
> 
>      movdl(xtmp, value);
>      pshufd(xtmp, xtmp, 0);
> 
> +     // align to 16 bytes, we know we are 8 byte aligned to start
> +     Label L_skip_align16;
> +     testptr(to, 8);
> +     jccb(Assembler::zero, L_skip_align16);
> +     subl(count, 2<<shift);
> +     jcc(Assembler::below, L_copy_4_bytes); // Short arrays (< 8 bytes)
> +     movq(Address(to, 0), xtmp);
> +     addptr(to, 8);
> +     BIND(L_skip_align16);
> 
>      subl(count, 8 << shift);
>      jcc(Assembler::less, L_check_fill_8_bytes);
>      align(16);
> 
> Vladimir
> 
>> tom
>>> Vladimir
>>> 
>>> Vladimir Kozlov wrote:
>>>> Tom,
>>>> First, I would not call these changes Medium. They are Large at least.
>>>> Should we allow OptimizeFill only when UseLoopPredicate is true?
>>>> loopTransform.cpp:
>>>> In match_fill_loop() should we exclude StoreCMNode also?
>>>> RAW store check is hidden in as_AddP()->unpack_offsets(). Should we do it explicitly?
>>>> store and store_value is not set for "copy candidate":
>>>> +         if (value->is_Load() && lpt->_body.contains(value)) {
>>>> +           // tty->print_cr("possible copy candidate");
>>>> +         } else {
>>>> +           msg  = "variant store value";
>>>> +         }
>>>> Why you assume that on 'else' it is mem_phi?:
>>>> +       if (n == head->phi()) {
>>>> +         // ok
>>>> +       } else {
>>>> +         // mem_phi
>>>> +       }
>>>> Should we also skip proj node (ifFalse) or it is not part of loop body?
>>>> +     } else if (n->is_CountedLoopEnd()) {
>>>> +       // ok so skip it.
>>>> +           msg = "node used outside loop";
>>>>                      ^ is
>>>> How you translate next assert message?:
>>>> +     assert(store_value->is_Load(), "shouldn't only happen for this case");
>>>> the next dump should be under flag and 'msg' should reflect "possible copy" or set msg_node:
>>>> + #ifdef ASSERT
>>>> +     tty->print_cr("possible copy");
>>>> +     store_value->dump();
>>>> +     store->dump();
>>>> + #endif
>>>> +     msg = "variant store in loop";
>>>> For Op_LShiftX there is no check (n->in(1) == head->phi()):
>>>> +       } else if (n->Opcode() == Op_LShiftX) {
>>>> +         shift = n;
>>>> +         assert(type2aelembytes(store->as_Mem()->memory_type(), true) == 1 << shift->in(2)->get_int(), "scale should match");
>>>> s_offs already includes base_offset, see GraphKit::array_element_address():
>>>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + s_offs * element_size) % HeapWordSize == 0);
>>>> Also the above expression is wrong if initial index != 0.
>>>> And actually you don't need to calculate it in match_fill_loop() since
>>>> it is used only in call to StubRoutines::select_fill_function() to verify
>>>> that element type is supported.
>>>> In intrinsify_fill() initial index value is taking into account for aligned
>>>> but base_offset_in_bytes could be already part of offset and you need
>>>> to multiply by element_size only initial index:
>>>> +   if (offset != NULL && head->init_trip()->is_Con()) {
>>>> +     intptr_t offs = offset->find_intptr_t_type()->get_con() + head->init_trip()->get_int();
>>>> +     int element_size = type2aelembytes(t);
>>>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + offs * element_size) % HeapWordSize == 0);
>>>> +   }
>>>> stubRoutines.cpp:
>>>> why you have specialized copies for testing _jint_fill and _jbyte_fill. Is not it covered by TEST_FILL already?
>>>> stubGenerator_sparc.cpp:
>>>> +   //  Generate stub for disjoint short fill.  If "aligned" is true, the
>>>>     ^ Generate stub for array fill.
>>>> +   //      from:  O0
>>>>           ^ to
>>>> +   //      to:    O1
>>>>           ^ value
>>>> O5 is not used and not input argument:
>>>> +     const Register offset    = O5;   // offset from start of arrays
>>>> stubs are generated only for byte,short and int, so allowing bollean, char and float is wrong:
>>>> +     switch (t) {
>>>> +       case T_BOOLEAN:
>>>> +       case T_BYTE:
>>>> +         shift = 2;
>>>> +         break;
>>>> +       case T_CHAR:
>>>> +       case T_SHORT:
>>>> +         shift = 1;
>>>> +         break;
>>>> +       case T_FLOAT:
>>>> +       case T_INT:
>>>> +         shift = 0;
>>>> +         break;
>>>> +       default: ShouldNotReachHere();
>>>> +     }
>>>> The same in assembler_x86.cpp
>>>> In stubGenerator_x86_64.cpp
>>>> new fill_32_bytes_forward() is not used.
>>>> Remove commented code for T_LONG in both stubGenerator_x86_??.cpp
>>>> I did not look on assembler. May be tomorrow.
>>>> Thanks,
>>>> Vladimir
>>>> Tom Rodriguez wrote:
>>>>> 4809552: Optimize Arrays.fill(...)
>>>>> Reviewed-by:
>>>>> 
>>>>> This adds new logic to recognize fill idioms and convert them into a
>>>>> call to an optimized fill routine.  Loop predication creates easily
>>>>> matched loops that are simply replaced with calls to the new assembly
>>>>> stubs.  Currently only 1,2 and 4 byte primitive types are supported.
>>>>> Objects and longs/double will be supported in a later putback.  Tested
>>>>> with runthese, nsk and ctw plus jbb2005.
>>>>> 
>>>>> http://cr.openjdk.java.net/~never/4809552