Performance of Arrays.fill for primitives and references

Wed Feb 5 00:58:37 PST 2014

Hi Vladimir,

thank you for your time and very clear answers. I am learning a lot from this.

Regards,
Marko

On 4. velj. 2014., at 23:28, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:

> On 2/4/14 1:18 PM, Marko Topolnik wrote:
>> Hi Vladimir,
>> 
>> Thank you for your answer, that settles the matter. As a disclaimer, I do not have an actual use case where I would miss the performance advantage in question, and it is probably hard to come by such a use case given the other overheads of handling objects.
>> 
>> I was also surprised to see an actual `call` CPU instruction instead of inlined code at that position and would be curious to know the reason. Is the `call` a necessity due to the highly custom nature of the hand-coded assembly?
> 
> The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:
> 
> http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/54f0c207dc35/src/cpu/x86/vm/macroAssembler_x86.cpp
> 
> Regards,
> Vladimir
> 
>> 
>> Regards,
>> Marko
>> 
>> On 4. velj. 2014., at 19:13, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>> 
>>> Hi Marko,
>>> 
>>> For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.
>>> 
>>> Regards,
>>> Vladimir
>>> 
>>> On 2/4/14 2:08 AM, Marko Topolnik wrote:
>>>> I have encountered a large discrepancy in the performance of
>>>> Arrays.fill(int[], int) vs. Arrays.fill(Object[], Object). From jmh:
>>>> 
>>>> Benchmark              Mode   Samples         Mean   Mean error    Units
>>>> fillIntArray           avgt        10      802.393       19.323    ns/op
>>>> fillReferenceArray     avgt        10     5323.516      105.982    ns/op
>>>> 
>>>> The array size is 8192, which means that filling an int array works at
>>>> above 10 slots/nanosecond (2.66 GHZ Intel Core i7)---this sound like
>>>> fantastic performance.
>>>> 
>>>> My question is, what is stopping HotSpot from applying the same
>>>> optimization to a reference array?
>>>> 
>>>> One guess was the maintenance of the card table, but I couldn't get
>>>> enough information to confirm that. A naive view of optimization
>>>> opportunities seems to indicate that the card table could be updated
>>>> wholesale either before or after writing out the array.
>>>> 
>>>> Printing assembly code, all I can see in the fill(Object[],Object) case
>>>> is the explicit loop involving the write barrier on each slot write:
>>>> 
>>>> lea eax, [edx+ebx*4+0x10]
>>>> mov [eax], edx
>>>> shr eax, 9
>>>> mov [edi+eax], ah
>>>> 
>>>> whereas for fill(int[],int) I see an opaque
>>>> 
>>>> mov edx, 0x0000000011145ca0
>>>> call edx
>>>> 
>>>> for the whole filling operation.
>>>> 
>>>> 
>>>> I was testing on 64-Bit Server VM (build 24.0-b56, mixed mode) with
>>>> default settings.
>>>> 
>>>> For reference, this is the code I have used:
>>>> 
>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>> @BenchmarkMode(Mode.AverageTime)
>>>> @Warmup(iterations = 5, time = 1)
>>>> @Measurement(iterations = 10, time = 2)
>>>> @State(Scope.Thread)
>>>> @Threads(1)
>>>> @Fork(1)
>>>> public class Writing
>>>> {
>>>>   static final int TARGET_SIZE = 1<<13;
>>>> 
>>>>   static final int[] intArray = new int[TARGET_SIZE];
>>>>   static final Object[] referenceArray = new Object[TARGET_SIZE];
>>>> 
>>>>   int intVal = 1;
>>>>   @GenerateMicroBenchmark
>>>>   public void fillIntArray() {
>>>>     Arrays.fill(intArray, intVal++);
>>>>   }
>>>> 
>>>>   @GenerateMicroBenchmark
>>>>   public void fillReferenceArray() {
>>>>     Arrays.fill(referenceArray, new Object());
>>>>   }
>>>> }
>>