Calls to native methods setCallSiteTargetNormal and defineAnonymousClass taking very long

Fri Sep 20 00:42:43 PDT 2013

(sending this again because I wasn't subscribed the 1st time and hence it
probably got spamfiltered)

Hello everyone,

I have been sent over her from the nashorn-dev mailing list. We're having
trouble with our application getting (rather randomly) very busy with calls
to the methods mentioned in the topic. On jdk8, that is. On jdk7, we're
exclusively seeing setCallSiteTargetNormal. This probably has to do with
LambdaForms not being in use on jdk7. Anyway, we're running roughly the
following setup:

- application mantle in JRuby
- application core in Java
- at the VERY core: highly repetitive javascript run via java (switched
from rhino to nashorn recently there)

It's a kind of webserver handling about 2-3k reqeusts per second. Each
request will result in a few dozen calls to our javascript. Since we need
our response times to be as low as possible, we'll not do the
javascript-calls serially, but commit them all to an executor running on a
number of threads equal to the number of available cores (usually 16 or 24
on our machines, 64bit-linux), so that ideally we'll use all cores for
every single request.

In summary: High level of concurrency, huge number of classes and nmethods
generated at runtime (both by jruby and nashorn).
(Some numbers from 65+ hrs uptime. jdk8: 140.000 classes live, 1M unloaded.
800mb metaspace exhausted; jdk7: 14.000 live, 2.000 unloaded. 500mb permgen
exhausted)

As for our actual problem, like I said, it's that sometimes all our worker
threads will start to be 100% busy and, when looking at the stack traces,
will all be in one of those two:

1.
at java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Native
Method)
at java.lang.invoke.CallSite.setTargetNormal(CallSite.java:245)
at java.lang.invoke.MutableCallSite.setTarget(MutableCallSite.java:154)
at
jdk.internal.dynalink.support.AbstractRelinkableCallSite.initialize(AbstractRelinkableCallSite.java:115)
at jdk.internal.dynalink.DynamicLinker.link(DynamicLinker.java:184)
at
jdk.nashorn.internal.runtime.linker.Bootstrap.bootstrap(Bootstrap.java:126)
at
java.lang.invoke.LambdaForm$DMH/272755642.invokeStatic_LLLI_L(LambdaForm$DMH)
... (omitting a few invocations of LambdaForm methods)
at java.lang.invoke.LambdaForm$MH/839784351.invoke_MT(LambdaForm$MH)
at java.lang.invoke.CallSite.makeSite(CallSite.java:283)
at
java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:294)
at jdk.nashorn.internal.scripts.Script$\^eval\_._L196(<eval>:231)
...

2.
at sun.misc.Unsafe.defineAnonymousClass(Native Method)
at
java.lang.invoke.InvokerBytecodeGenerator.loadAndInitializeInvokerClass(InvokerBytecodeGenerator.java:262)
at
java.lang.invoke.InvokerBytecodeGenerator.loadMethod(InvokerBytecodeGenerator.java:250)
at
java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCode(InvokerBytecodeGenerator.java:509)
at java.lang.invoke.LambdaForm.compileToBytecode(LambdaForm.java:454)
at java.lang.invoke.LambdaForm.checkInvocationCounter(LambdaForm.java:634)
... (omitting a few invocations of LambdaForm methods)
at java.lang.invoke.LambdaForm$LFI/1682860686.interpret_L(LambdaForm$LFI)
at java.lang.invoke.LambdaForm$MH/1997512181.linkToCallSite(LambdaForm$MH)
at jdk.nashorn.internal.scripts.Script$\^eval\_._L196(<eval>:232)
...

Obviously, the application is near unresponsive under these conditions
(i.e. doesn't respond to actual requests from the outside, though you can
connect with jconsole / visualvm / jprofiler just fine and observe it). CPU
will be near 100% bound by these calls, too.

Now there are a few interesting things to note that might help with
diagnosis.
Maybe most importantly, a full gc (we're using CMS, but
CMSClassUnloadingEnabled does not need to be set) will stop the
misbehaviour and will return the application to its normal state.
Obviously, we'd rather not run full gcs all the time, but it's maybe also
interesting to note that this can indeed be triggered by setting (on jdk7)
MaxPermSize low enough or not at all. Apparently the garbage collector will
always find enough space to clean up on PermGen that it doesn't really need
to grow above some threshold -- though the graph showing our PermGen
consumption will then look like the one showing our Heap consumption (i.e.
up and down all the time) and we don't really have good response times,
understandably. So this is not an option to fix the problem, but might be a
clue to what's going on.
A kind of additional problem in this regard is that exhausted PermGen /
MetaSpace will nearly stop increasing under the problematic condition.
Which, in turn, makes a full gc less likely to take place.

Another clue is that we have been able to make the problem happen less
often by dialing back on the amount of reference-replacement we do on our
nashorn-javascript-engines. I.e. those engines have engine-wide bindings
(as according to the javax.script api) you can replace just by calling
.put(bindingName, object) on the engine, even while other threads are
involved in calls inside it. Now apparently this causes much relinking of
callsites to be done and thus fosters our problem.

Still, it's hard to tell what's going on for sure, as we have no stack
traces for what's happening inside the native code. I hope I could give you
some idea.

Best regards
Benjamin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20130920/bbc6153e/attachment.html