review (M) for 4809552: Optimize Arrays.fill(...)

Fri Aug 20 11:58:48 PDT 2010

On Aug 20, 2010, at 11:46 AM, Vladimir Kozlov wrote:

> Tom Rodriguez wrote:
>>> Use next movdqa since you aligned address:
>>> +         movdqa(Address(to, 0), xtmp);
>>> +         movdqa(Address(to, 16), xtmp);
>>> 
>>> instead of
>>> +         movq(Address(to, 0), xtmp);
>>> +         movq(Address(to, 8), xtmp);
>>> +         movq(Address(to, 16), xtmp);
>>> +         movq(Address(to, 24), xtmp);
>> But it's only aligned to 8 bytes, not 16.  maybe it would be worth it to align to 16?
> 
> Sorry, you are right, it requires 16 not 8 bytes. :(
> I think it worth to align to 16 since it will benefit all x86.
> 
>      movdl(xtmp, value);
>      pshufd(xtmp, xtmp, 0);
> 
> +     // align to 16 bytes, we know we are 8 byte aligned to start
> +     Label L_skip_align16;
> +     testptr(to, 8);
> +     jccb(Assembler::zero, L_skip_align16);
> +     subl(count, 2<<shift);
> +     jcc(Assembler::below, L_copy_4_bytes); // Short arrays (< 8 bytes)
> +     movq(Address(to, 0), xtmp);
> +     addptr(to, 8);
> +     BIND(L_skip_align16);
> 
>      subl(count, 8 << shift);
>      jcc(Assembler::less, L_check_fill_8_bytes);
>      align(16);

I'll test that out.

tom

> 
> Vladimir
> 
>> tom
>>> Vladimir
>>> 
>>> Vladimir Kozlov wrote:
>>>> Tom,
>>>> First, I would not call these changes Medium. They are Large at least.
>>>> Should we allow OptimizeFill only when UseLoopPredicate is true?
>>>> loopTransform.cpp:
>>>> In match_fill_loop() should we exclude StoreCMNode also?
>>>> RAW store check is hidden in as_AddP()->unpack_offsets(). Should we do it explicitly?
>>>> store and store_value is not set for "copy candidate":
>>>> +         if (value->is_Load() && lpt->_body.contains(value)) {
>>>> +           // tty->print_cr("possible copy candidate");
>>>> +         } else {
>>>> +           msg  = "variant store value";
>>>> +         }
>>>> Why you assume that on 'else' it is mem_phi?:
>>>> +       if (n == head->phi()) {
>>>> +         // ok
>>>> +       } else {
>>>> +         // mem_phi
>>>> +       }
>>>> Should we also skip proj node (ifFalse) or it is not part of loop body?
>>>> +     } else if (n->is_CountedLoopEnd()) {
>>>> +       // ok so skip it.
>>>> +           msg = "node used outside loop";
>>>>                      ^ is
>>>> How you translate next assert message?:
>>>> +     assert(store_value->is_Load(), "shouldn't only happen for this case");
>>>> the next dump should be under flag and 'msg' should reflect "possible copy" or set msg_node:
>>>> + #ifdef ASSERT
>>>> +     tty->print_cr("possible copy");
>>>> +     store_value->dump();
>>>> +     store->dump();
>>>> + #endif
>>>> +     msg = "variant store in loop";
>>>> For Op_LShiftX there is no check (n->in(1) == head->phi()):
>>>> +       } else if (n->Opcode() == Op_LShiftX) {
>>>> +         shift = n;
>>>> +         assert(type2aelembytes(store->as_Mem()->memory_type(), true) == 1 << shift->in(2)->get_int(), "scale should match");
>>>> s_offs already includes base_offset, see GraphKit::array_element_address():
>>>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + s_offs * element_size) % HeapWordSize == 0);
>>>> Also the above expression is wrong if initial index != 0.
>>>> And actually you don't need to calculate it in match_fill_loop() since
>>>> it is used only in call to StubRoutines::select_fill_function() to verify
>>>> that element type is supported.
>>>> In intrinsify_fill() initial index value is taking into account for aligned
>>>> but base_offset_in_bytes could be already part of offset and you need
>>>> to multiply by element_size only initial index:
>>>> +   if (offset != NULL && head->init_trip()->is_Con()) {
>>>> +     intptr_t offs = offset->find_intptr_t_type()->get_con() + head->init_trip()->get_int();
>>>> +     int element_size = type2aelembytes(t);
>>>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + offs * element_size) % HeapWordSize == 0);
>>>> +   }
>>>> stubRoutines.cpp:
>>>> why you have specialized copies for testing _jint_fill and _jbyte_fill. Is not it covered by TEST_FILL already?
>>>> stubGenerator_sparc.cpp:
>>>> +   //  Generate stub for disjoint short fill.  If "aligned" is true, the
>>>>     ^ Generate stub for array fill.
>>>> +   //      from:  O0
>>>>           ^ to
>>>> +   //      to:    O1
>>>>           ^ value
>>>> O5 is not used and not input argument:
>>>> +     const Register offset    = O5;   // offset from start of arrays
>>>> stubs are generated only for byte,short and int, so allowing bollean, char and float is wrong:
>>>> +     switch (t) {
>>>> +       case T_BOOLEAN:
>>>> +       case T_BYTE:
>>>> +         shift = 2;
>>>> +         break;
>>>> +       case T_CHAR:
>>>> +       case T_SHORT:
>>>> +         shift = 1;
>>>> +         break;
>>>> +       case T_FLOAT:
>>>> +       case T_INT:
>>>> +         shift = 0;
>>>> +         break;
>>>> +       default: ShouldNotReachHere();
>>>> +     }
>>>> The same in assembler_x86.cpp
>>>> In stubGenerator_x86_64.cpp
>>>> new fill_32_bytes_forward() is not used.
>>>> Remove commented code for T_LONG in both stubGenerator_x86_??.cpp
>>>> I did not look on assembler. May be tomorrow.
>>>> Thanks,
>>>> Vladimir
>>>> Tom Rodriguez wrote:
>>>>> 4809552: Optimize Arrays.fill(...)
>>>>> Reviewed-by:
>>>>> 
>>>>> This adds new logic to recognize fill idioms and convert them into a
>>>>> call to an optimized fill routine.  Loop predication creates easily
>>>>> matched loops that are simply replaced with calls to the new assembly
>>>>> stubs.  Currently only 1,2 and 4 byte primitive types are supported.
>>>>> Objects and longs/double will be supported in a later putback.  Tested
>>>>> with runthese, nsk and ctw plus jbb2005.
>>>>> 
>>>>> http://cr.openjdk.java.net/~never/4809552