Performance costs of untriggered UL log points?

Thu Feb 8 13:42:56 UTC 2018

Hi Thomas,

On 2018-02-07 17:23, Thomas Stüfe wrote:
> Hi guys,
>
> short question, when creating UL, did anyone measure what an untriggered
> instrumentation point does cost?

No, AFAIK no such detailed measurements were done.
Although, no regressions were found with logging disabled (or even 
enabled) on the larger benchmark suites.

>
> I was the other day comparing performance of a patch of mine with the
> unpatched jdk. To my dismay my patched version was consistently ~5-10%
> slower. But I had logging points in that function, for example:
>
> 1865 log_trace(gc, metaspace, freelist)("Coalescation (%s): created
> coalesced chunk at %p, size " SIZE_FORMAT_HEX ".",
> 1866                (is_class() ? "class space" : "metaspace"),
> 1867                 p_new_chunk, p_new_chunk->word_size() *
> sizeof(MetaWord));
>
> The function did run ~3000 times during my test - so not an indecent amount
> of times, but something which one might conceivably want to do. Logging was
> switched off. I found that commenting out this log line would make the
> difference to the unpatched jdk disappear (runtime went down from ~90ms to
> ~80ms).

I'm assuming this was on a product/fastdebug build? It sounds a bit much 
to me, but I've never really measured the overhead on that level, so I 
can't really say if it is expected or not.

>
> Of course I am happy that my patch is performing well after all, but I am
> surprised too. I would have naively assumed that tracing not switched on
> costs next to nothing.
>
> I am aware that one needs to access some global information in order to
> find out the current tracing level, but maybe there is still room for
> improvement?
>
> In our port we have a very old and simple tracing system which is triggered
> by a simple global bit array. Accessing that is quite cheap, at least I
> never noticed any negative effects like described above.
>
> Kind Regards, Thomas

Right now, an is_level() check boils down to a single compare 
instruction followed by a jump (at least on my workstation (Linux x64)). 
Different tag sets use different addresses, but otherwise it seems 
pretty optimal to me. A bitmap or something like a single global word 
with on/off status for multiple tag sets would improve cache locality, 
but it seems overly complicated IMHO.

Thanks,
Marcus