Blackhole.consume(Object) has different semantics to Blackhole.consume(primitive)

Wed Nov 12 12:49:17 UTC 2014

Hi Nitsan,

On 11/12/2014 01:46 AM, Nitsan Wakart wrote:
> I'm not saying one is better than the other, or that the unlikely
> deoptimization is a massive issue, but the slight semantic
> differences can lead to surprising effects caused by switching from
> one consume method to the other. Can we settle on a method? is there
> a good reason to maintain special treatment?

There is a choice between performance and consistency. Primitive
consumes avoid the writes completely, and that's their benefit.
Reference consumes cannot employ the same trick, and so they do the
second best option: PRNG-predicated heap write.

Given the code that produces either primitives or references is already
quite different, it feels odd to trade in the performance for already
broken consistency. This is how much you will pay for consistency:

$ java -jar jmh-core-benchmarks/target/benchmarks.jar BlackholeBench -wi
5 -i 5 -f 1

x86_64, i7-4790K:

 Benchmark               Mode  Samples  Score   Error  Units
 baseline                avgt        5  0.252 ± 0.002  ns/op
 implicit_testArray      avgt        5  2.271 ± 0.049  ns/op
 implicit_testBoolean    avgt        5  0.629 ± 0.009  ns/op
 implicit_testByte       avgt        5  0.629 ± 0.005  ns/op
 implicit_testChar       avgt        5  0.631 ± 0.005  ns/op
 implicit_testDouble     avgt        5  0.747 ± 0.032  ns/op
 implicit_testFloat      avgt        5  0.678 ± 0.015  ns/op
 implicit_testInt        avgt        5  0.629 ± 0.005  ns/op
 implicit_testLong       avgt        5  0.633 ± 0.013  ns/op
 implicit_testObject     avgt        5  2.267 ± 0.037  ns/op
 implicit_testShort      avgt        5  0.629 ± 0.002  ns/op

That is, reference consumes cost 3x-4x more than primitive ones on x86.
Switching to PRNG-predicated writes in primitive cases seem odd with
data like that.

ARMv7, Cortex-A9:

 Benchmark               Mode  Samples   Score   Error  Units
 baseline                avgt        5   5.291 ± 0.000  ns/op
 implicit_testArray      avgt        5  11.757 ± 0.001  ns/op
 implicit_testBoolean    avgt        5  13.524 ± 0.002  ns/op
 implicit_testByte       avgt        5  14.109 ± 0.001  ns/op
 implicit_testChar       avgt        5  13.521 ± 0.001  ns/op
 implicit_testDouble     avgt        5  14.109 ± 0.001  ns/op
 implicit_testFloat      avgt        5  14.110 ± 0.011  ns/op
 implicit_testInt        avgt        5  13.524 ± 0.002  ns/op
 implicit_testLong       avgt        5  18.815 ± 0.007  ns/op
 implicit_testObject     avgt        5  11.757 ± 0.001  ns/op
 implicit_testShort      avgt        5  14.109 ± 0.001  ns/op

ARM actually ends up more or less consistent because the costs of
volatile reads in primitive cases compensate the cost of PRNG writes in
reference cases.

If you are concerned with the absence of volatile reads for reference
consumes, we may add the volatile "spoiler" there to get the same
effect. That will break the perceived consistency from ARM case -- seems
to be the lesser evil.

Thanks,
-Aleksey.