Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess
mandy chung
mandy.chung at oracle.com
Fri Oct 20 20:18:34 UTC 2017
I created https://bugs.openjdk.java.net/browse/JDK-8189752 to track this.
Mandy
On 10/20/17 10:35 AM, Rafael Winterhalter wrote:
> Hei Mandy,
> thank you for your quick reply.
>
> The benchmark is already only capturing the stack frame object. I ran
> the benchmark with the transformation as a comparison and it is indeed
> slower due to extra work. The JIT code suggests the native method plus
> monitor to be the major cost here but it is of course necessary to
> have it there to fetch the stack information.
>
> Without this transformation in the user thread, capturing stack frame
> objects of 100 frames is however still about 2.5 times as expensive as
> constructing a throwable as a snapshot.
>
> Just to not be misunderstood: the total cost is reduced by using the
> stack walker, even compared to using java lang secrets under Java 8
> but the inability to process the stack outside of a worker thread
> makes it inapplicable for my purposes due to latency requirements.
>
> Thank you for considering this use case and all your great work!
>
> Cheers, Rafael
>
>
> Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung at oracle.com
> <mailto:mandy.chung at oracle.com>>:
>
> Hi Rafael,
>
> Thanks for the feedback. We did some investigation in
> understanding the overhead of Throwable if it used StackWalker API
> [1]. It did come to mind whether the StackWalker API should
> provide a way to walk the backtrace which we should do the
> investigation with JDK-8141239.
>
> The benchmark compares capturing 35 StackTraceElements for
> apple-to-apple comparison which is fair. I am curious on the perf
> difference when you capture only StackFrame objects? This would
> save the overhead to construct StackTraceElement objects (and its
> associated string objects).
>
> Mandy
> [1] https://bugs.openjdk.java.net/browse/JDK-8141239
> <https://bugs.openjdk.java.net/browse/JDK-8141239>
>
>
> On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
>
> Hello,
>
> a typical patern when reading the stack of the current thread
> in tooling
> like performance monitoring used to imply the creation of an
> instance of
> Throwable and to process this instance's attached stack in
> another thread.
> The performance cost is shared about 10/90 for creating a new
> throwable
> compared to reading its frames, so this is really a worthy
> optimization.
>
> It is also common to use the JavaLangAccess API which offers
> selective
> access of single frames.
>
> This API does no longer exist as it was superseeded by the
> Stack Walker API
> which is of course much safer and even a more performant
> alternative when
> looking at the total performance. However, using a stack
> walker, it is no
> longer possible to move the stack processing out of the user
> thread but it
> must be done at the moment the snapshot of the stack is taken.
> It turns out
> that this increases latency dramatically when processing
> stacks compared to
> the asyncronous alternative.
>
> In a quick benchmark, it seems like walking 35 frames of a 100
> frames stack
> allows me 70k operations per second whereas creating a new
> throwable yields
> about 200k operations per second. Also, within a less isolated
> test, I can
> infer this additional overhead from the actual latency numbers
> of a web
> service when using the stack walker API to extract the top 35
> frames
> compared to the "old" solution using JavaLangAccess.
>
> For this reason, it seems to be the best solution to avoid the
> stack walker
> when aiming for latency at the moment if the stack is not required
> immediately and if access resources are available in other
> threads.
>
> I would therefore like to propose to extend the stack walker
> API to allow
> walking the stack of an existing throwable to allow for
> similar performance
> as with JavaLangAccess. I understand that the VM must do more work
> altogether. When receving the full stack from a throwable,
> this takes about
> three times as long. In practice, for a product I am involved
> in, this
> casues a noticable overhead when running a Java 9 VM compared
> to Java 8.
>
> Alternatively, it would of course even be better if one could
> take a
> snapshot of only the top x frames to walk on this object
> rather then a
> throwable.
>
> I have added my benchmarks (snapshot for the current user
> thread operation,
> complete for the entire processing) into this Gist:
> https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
> <https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70>
>
> Thank you for any feedback on my finding!
>
> Best regards, Rafael
>
>
>
More information about the core-libs-dev
mailing list