RFR(S): 8037842: Failing to allocate MethodCounters and MDO causes a serious performance drop

Tue Oct 21 19:52:24 UTC 2014

Hi,

I suggested option 1 because if we OOM trying to allocate metaspace for 
MethodCounters and MethodData, then the system is pretty close to 
running into the metaspace limit and is not going to be able to make 
very much progress anyway.

If we run out of Metaspace while allocating metadata for a class, we 
throw OOM.  In the class loading case, a user can maybe catch OOM and 
stop loading so many classes (???).   During method compilation this is 
more surprising, so I'm going to change my vote for #2 now.

Coleen

On 10/21/14, 3:46 PM, Vladimir Kozlov wrote:
> Okay, I see your and others points.
>
> Albert, I will agree without your solution 2.
>
> Thanks,
> Vladimir
>
> On 10/21/14 5:35 AM, Vitaly Davidovich wrote:
>> +1, as a user.  I think an OOME from the VM should come if something
>> critical cannot proceed due to some space exhaustion.  PermGen was a
>> hotspot implementation detail (I.e. it's not a java or even a JVM
>> standard,  of course) so I don't think it sets any sort of precedent 
>> that
>> needs to be followed.  A loud warning to stdout/stderr indicating the 
>> issue
>> and stating that performance may degrade is sufficient.
>>
>> Sent from my phone
>> On Oct 21, 2014 5:10 AM, "Albert Noll" <albert.noll at oracle.com> wrote:
>>
>>> Hi,
>>>
>>> I agree with David.
>>>
>>> Executing a Java program with 'java myProgram' does not imply that we
>>> *must* be able to execute the program using a JIT compiler.
>>> Throwing an OOME because we are unable to compile a method (due to
>>> insufficient metaspace) can cause much more problems than it solves. 
>>> For
>>> example,
>>> I tried to execute the JVMSPEC2008 compiler.compiler benchmark as 
>>> follows:
>>>
>>> java -XX:+ExitOnMetaSpaceAllocFail -XX:MaxMetaspaceSize=16m -jar
>>> SPECjvm2008.jar -ikv -wt 10 -i 5 -it 60 compiler.compiler
>>>
>>> If the "ExitOnMetaSpaceAllocFail" is enabled, the JVM exits if we are
>>> unable to allocate MethodCounters and MDO. On my laptop, 
>>> compiler.compiler
>>> finishes *without a performance regression* if 
>>> ExitOnMetaSpaceAllocFail is
>>> false. If ExitOnMetaSpaceAllocFail is true, the run does not finish 
>>> because
>>> we are out of metaspace. Such a behavior is clearly a regression. 
>>> Given the
>>> large number of customers we have, it seems likely that throwing an 
>>> OOME
>>> will cause trouble.
>>>
>>> In addition, I think that throwing an OOME exposes implementation 
>>> details
>>> of hotspot to the user that are not easy to understand. To 
>>> understand the
>>> cause of the OOME (and I think it is very important for the user to
>>> understand the behavior of the JVM) the user must know that Hotspot 
>>> uses
>>> JIT compilers that store method profiles in metaspace. If we decide to
>>> throw an OOME, it will be hard to debug the program, since compilers 
>>> are
>>> not deterministic in when a method is compiled. I.e., the customer 
>>> can get
>>> OOMEs at random (asynchronous) places. From a serviceability point 
>>> of view,
>>> throwing an OOME is a poor choice.
>>>
>>> I think we could add an option that lets the user decide on the 
>>> behavior
>>> (ExitOnMetaSpaceAllocFail). However, the default behavior should be
>>> according to the Spec, i.e., ExitOnMetaSpaceAllocFail should be 
>>> 'false' by
>>> default. I think it is reasonable to assume that someone who knows 
>>> about
>>> -XX:MaxMetaspaceSize will know about -XX:ExitOnMetaSpaceAllocFail. The
>>> argument that performance problems are 'hidden' by not throwing an 
>>> OOME is
>>> valid, but can be mitigated by such a flag.
>>>
>>> Thanks,
>>> Albert
>>>
>>>
>>> On 10/21/2014 03:17 AM, David Holmes wrote:
>>>
>>>> On 21/10/2014 5:11 AM, Vladimir Kozlov wrote:
>>>>
>>>>> Inability to allocate in metaspace is different from allocation in
>>>>> codecache.
>>>>>
>>>>> The last one is JIT specific and needs only warning since whole java
>>>>> process is (almost) not impacted.
>>>>>
>>>>> The first one should produce OOM the same ways as it was when we had
>>>>> PermGen.
>>>>>
>>>>> I think the solution 1 is correct.
>>>>>
>>>>
>>>> Throwing an asynchronous exception is very bad and should always be a
>>>> last resort. Such exceptions can lead to corrupt state very easily.
>>>>
>>>> IMHO problems encountered by the JIT should not manifest as Java-level
>>>> exceptions. I would also consider them (regardless of whether they 
>>>> have
>>>> always been there) a violation of the spec as quoted by Albert.
>>>>
>>>> David
>>>>
>>>>   Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 10/17/14 6:18 AM, Albert Noll wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> could I get reviews for this patch:
>>>>>>
>>>>>> Bug:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8037842
>>>>>>
>>>>>> Problem:
>>>>>> If the interpreter (or the compilers) fail to allocate from 
>>>>>> metaspace
>>>>>> (e.g., to allocate a MDO), the exception
>>>>>> is cleared and - as a result - not reported to the Java 
>>>>>> application. Not
>>>>>> propagating the OOME to the Java application
>>>>>> can lead to a serious performance regression, since every attempt to
>>>>>> allocate from metaspace (if we have run out
>>>>>> of metaspace and a full GC cannot free memory) triggers another 
>>>>>> full GC.
>>>>>> Consequently, the application continues
>>>>>> to run and schedules full GCs until (1) a critical allocation 
>>>>>> (one that
>>>>>> throws an OOME) fails, or (2) the application finishes
>>>>>> normally (successfully). Note that the VM can continue to execute
>>>>>> without allocating MethodCounters or MDOs.
>>>>>>
>>>>>> Solution 1:
>>>>>> Report OOME to the Java application. This solution avoids 
>>>>>> handling the
>>>>>> problem (running a large number of full GCs)
>>>>>> in the VM by passing the problem over to the the Java 
>>>>>> application. I.e.,
>>>>>> the performance regression is solved by
>>>>>> throwing an OOME. The only way to make the application run is to 
>>>>>> re-run
>>>>>> the application with a larger (yet unknown)
>>>>>> metaspace size. However, the application could have continued to run
>>>>>> (with an undefined performance drop).
>>>>>>
>>>>>> Note that the metaspace size in the failing test case is 
>>>>>> artificially
>>>>>> small (20m). Should we change the default behavior of Hotspot
>>>>>> to fix such a corner case?
>>>>>>
>>>>>> Also, I am not sure if throwing an OOME in such a case makes Hotspot
>>>>>> conform with the Java Language Specification.
>>>>>> The Specification says:
>>>>>>
>>>>>> "Asynchronous exceptions occur only as a result of:
>>>>>>
>>>>>> An internal error or resource limitation in the Java Virtual Machine
>>>>>> that prevents
>>>>>> it from implementing the semantics of the Java programming 
>>>>>> language. In
>>>>>> this
>>>>>> case, the asynchronous exception that is thrown is an instance of a
>>>>>> subclass of
>>>>>> VirtualMachineError"
>>>>>>
>>>>>> An OOME is an asynchronous exception. As I understand the paragraph
>>>>>> above, we are only allowed to throw an asynchronous
>>>>>> exception, if we are not able to "implement the semantics of the 
>>>>>> Java
>>>>>> programming language". Not being able to run the JIT
>>>>>> compiler does not seem to constrain the semantics of the Java 
>>>>>> language.
>>>>>>
>>>>>> Solution 2:
>>>>>> If allocation from metaspace fails, we (1) report a warning to 
>>>>>> the user
>>>>>> and (2) do not try to allocate MethodCounters and MDO
>>>>>> (as well as all other non-critical metaspace allocations) and 
>>>>>> thereby
>>>>>> avoid the overhead from running full GCs. As a result, the
>>>>>> application can continue to run. I have not yet worked on such a
>>>>>> solution. I just bring this up for discussion.
>>>>>>
>>>>>> Testing:
>>>>>> JPRT
>>>>>>
>>>>>> Webrev:
>>>>>> Here is the webrev for Solution 1. Please note that I am not 
>>>>>> familiar
>>>>>> with this part of the code.
>>>>>>
>>>>>> http://cr.openjdk.java.net/~anoll/8037842/webrev.00/
>>>>>>
>>>>>> May thanks in advance,
>>>>>> Albert
>>>>>>
>>>>>>
>>>