New allocation profiler

Tue Apr 14 22:45:31 UTC 2015

Hi Vladimir,

Good stuff!

I know you have already submitted an OCA, and OpenJDK registrar should
contact you when it is processed. We can push the change in only after
OCA is in place. The merge window for current release closes in a few
days -- would be great to push this in on Wednesday-Thursday.

On 04/12/2015 04:50 PM, Vladimir Sitnikov wrote:
> The idea is to sum allocated bytes over the threads excluding current one.
> This intentionally includes allocation made by background threads, and
> I believe that is expected.

Yes, I would expect profiler to catch background threads.

> The attached implementation uses just two snapshots (before and
> after). I'm not sure if spinning a background thread doing snapshots
> to capture newly created/destroyed threads is worth doing.

That's the systematic deficiency of such an approach, we can't really
help it. Even the background sampling thread can miss auxiliary threads
come and go. Those threads that die without pushing the allocation data
back to us are especially frustrating. In other words, having no way to
be notified about thread creation and destruction, there is no way to be
absolutely accurate.

I wonder if we should fold this new profiler into -prof gc, that reports
churn rates. Users will then have a more complete picture of what is
going on:
 a) when allocation and churn rates match (even though churn rates will
have much larger errors), you can be arguably sure about the whole thing;
 b) when allocation rate is lower than churn rate, you know some
allocations are missing from the profiling, prompting the investigation;
 c) when allocation rate is higher than churn rate, you know there is
either a memory leak, or some other kind of problem.

> It looks like "single shot" mode has some issues and it reports to
> allocate 480 bytes. Does it mean profilers catch a bit of
> setup/teardown in SS mode? 

If you look at the generated stub for SingleShot mode (somewhere at
target/generated-sources/...), you will see the allocations of
RawResults, BenchmarkTaskResult, SingleShotResult. Other modes also
allocate similar things, but those minuscule allocations drown in
millions of "benchmark" allocations.

> Can this be improved?

Requires significant changes. Either minimize the allocations in the
generated stubs (that requires reflowing the stub interface and probably
rethinking the general result handling), or make the generated code to
call profilers right before/after calling into the measurement stub
(that requires dealing with synchronization, abrupt exceptions, etc.)

Given SingleShot is a special non-steady state mode that already has
lots of issues for tiny benchmarks, I think fixing the allocation
profile is not something prudent to do. Don't use that mode for tiny
benchmarks, and then allocation profile skew would be minimal.

-----------------------------------------------------------------------

Code comments:

 * I would think counters should be "gc.alloc.rate" and
"gc.alloc.rate.norm". It would send the message that "norm" is the
derivative, not measured directly. It also aligns with what -prof gc does.

 * Do we actually need static initializers? That will init the profiler
even when we don't use it. E.g. it will be inited on any
ProfilerFactory.getAvailableProfilers() call.

 * "// To avoid allocation while returning two values from
getTotalAllocatedMemory" comment references an old method name :) It
took me a while to realize you are using two fields to avoid
constructing a tuple-like wrapper class -- should the comment call that
out better?

 * estimateAllocatedBytes() seems to skip current thread from
estimation. Why bother doing the profiler code garbage-free then?

 * Returning -1 as "absent" result. Double.NaN is universally used to
denote the cases like that.

 * "threadMxBean" and "getThreadAllocatedBytes" should be shorter and/or
capitalized?

Thanks,
-Aleksey.