review (M) for 4809552: Optimize Arrays.fill(...)

Fri Aug 20 10:57:59 PDT 2010

Actually I think it does not make sense to use movdqu()
since it still slower on cache line boundary.
Align to 8 bytes always and use only movdqa().

Vladimir

Vladimir Kozlov wrote:
> Assembler part review.
> 
> In stubGenerator_sparc.cpp
> 
> Move next lines above 64 bit value construction:
> 
> +     __ cmp(count, 2<<shift); // Short arrays (< 8 bytes) copy by element
> +     __ brx(Assembler::lessUnsigned, false, Assembler::pn, 
> L_copy_4_bytes); // use unsigned cmp
> +     __ delayed()->nop();
> 
> In assembler_x86.cpp
> 
> You don't need next:
> +     jmpb(L_copy_4_bytes); // all dwords were copied
> 
> Use next movdqa since you aligned address:
> +         movdqa(Address(to, 0), xtmp);
> +         movdqa(Address(to, 16), xtmp);
> 
> instead of
> +         movq(Address(to, 0), xtmp);
> +         movq(Address(to, 8), xtmp);
> +         movq(Address(to, 16), xtmp);
> +         movq(Address(to, 24), xtmp);
> 
> Vladimir
> 
> Vladimir Kozlov wrote:
>> Tom,
>>
>> First, I would not call these changes Medium. They are Large at least.
>>
>> Should we allow OptimizeFill only when UseLoopPredicate is true?
>>
>> loopTransform.cpp:
>>
>> In match_fill_loop() should we exclude StoreCMNode also?
>>
>> RAW store check is hidden in as_AddP()->unpack_offsets(). Should we do 
>> it explicitly?
>>
>> store and store_value is not set for "copy candidate":
>>
>> +         if (value->is_Load() && lpt->_body.contains(value)) {
>> +           // tty->print_cr("possible copy candidate");
>> +         } else {
>> +           msg  = "variant store value";
>> +         }
>>
>> Why you assume that on 'else' it is mem_phi?:
>>
>> +       if (n == head->phi()) {
>> +         // ok
>> +       } else {
>> +         // mem_phi
>> +       }
>>
>> Should we also skip proj node (ifFalse) or it is not part of loop body?
>>
>> +     } else if (n->is_CountedLoopEnd()) {
>> +       // ok so skip it.
>>
>> +           msg = "node used outside loop";
>>                        ^ is
>>
>> How you translate next assert message?:
>> +     assert(store_value->is_Load(), "shouldn't only happen for this 
>> case");
>>
>> the next dump should be under flag and 'msg' should reflect "possible 
>> copy" or set msg_node:
>>
>> + #ifdef ASSERT
>> +     tty->print_cr("possible copy");
>> +     store_value->dump();
>> +     store->dump();
>> + #endif
>> +     msg = "variant store in loop";
>>
>> For Op_LShiftX there is no check (n->in(1) == head->phi()):
>>
>> +       } else if (n->Opcode() == Op_LShiftX) {
>> +         shift = n;
>> +         assert(type2aelembytes(store->as_Mem()->memory_type(), true) 
>> == 1 << shift->in(2)->get_int(), "scale should match");
>>
>>
>> s_offs already includes base_offset, see 
>> GraphKit::array_element_address():
>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + s_offs * 
>> element_size) % HeapWordSize == 0);
>> Also the above expression is wrong if initial index != 0.
>> And actually you don't need to calculate it in match_fill_loop() since
>> it is used only in call to StubRoutines::select_fill_function() to verify
>> that element type is supported.
>>
>>
>> In intrinsify_fill() initial index value is taking into account for 
>> aligned
>> but base_offset_in_bytes could be already part of offset and you need
>> to multiply by element_size only initial index:
>>
>> +   if (offset != NULL && head->init_trip()->is_Con()) {
>> +     intptr_t offs = offset->find_intptr_t_type()->get_con() + 
>> head->init_trip()->get_int();
>> +     int element_size = type2aelembytes(t);
>> +     aligned = ((arrayOopDesc::base_offset_in_bytes(t) + offs * 
>> element_size) % HeapWordSize == 0);
>> +   }
>>
>> stubRoutines.cpp:
>> why you have specialized copies for testing _jint_fill and 
>> _jbyte_fill. Is not it covered by TEST_FILL already?
>>
>> stubGenerator_sparc.cpp:
>> +   //  Generate stub for disjoint short fill.  If "aligned" is true, the
>>       ^ Generate stub for array fill.
>>
>> +   //      from:  O0
>>             ^ to
>> +   //      to:    O1
>>             ^ value
>>
>> O5 is not used and not input argument:
>> +     const Register offset    = O5;   // offset from start of arrays
>>
>> stubs are generated only for byte,short and int, so allowing bollean, 
>> char and float is wrong:
>> +     switch (t) {
>> +       case T_BOOLEAN:
>> +       case T_BYTE:
>> +         shift = 2;
>> +         break;
>> +       case T_CHAR:
>> +       case T_SHORT:
>> +         shift = 1;
>> +         break;
>> +       case T_FLOAT:
>> +       case T_INT:
>> +         shift = 0;
>> +         break;
>> +       default: ShouldNotReachHere();
>> +     }
>>
>> The same in assembler_x86.cpp
>>
>> In stubGenerator_x86_64.cpp
>> new fill_32_bytes_forward() is not used.
>>
>> Remove commented code for T_LONG in both stubGenerator_x86_??.cpp
>>
>> I did not look on assembler. May be tomorrow.
>>
>> Thanks,
>> Vladimir
>>
>>
>> Tom Rodriguez wrote:
>>> 4809552: Optimize Arrays.fill(...)
>>> Reviewed-by:
>>>
>>> This adds new logic to recognize fill idioms and convert them into a
>>> call to an optimized fill routine.  Loop predication creates easily
>>> matched loops that are simply replaced with calls to the new assembly
>>> stubs.  Currently only 1,2 and 4 byte primitive types are supported.
>>> Objects and longs/double will be supported in a later putback.  Tested
>>> with runthese, nsk and ctw plus jbb2005.
>>>
>>> http://cr.openjdk.java.net/~never/4809552