bottleneck by java.lang.Class.getAnnotations() - rebased patch

Fri Nov 30 18:36:09 UTC 2012

Hi Alan, David, Alexander and others,

I noticed the push of Joe Darcy's code for repeating annotations which 
makes my proposed patch not entirely compatible with new code. So I 
rebased the patch to current code in jdk8/tl/jdk repository. I had to 
revert my last improvement (the replacement of Map with array for 
Class's declaredAnnotations) since repeating annotations need per-key 
access to declared annotations.

I also did a simplification in the area of delegating the caching of 
annotations on Field/Method/Constructor to a root instance.

Here's the webrev of this patch:

http://dl.dropbox.com/u/101777488/jdk8-tl/JEP-149/webrev.01/index.html

And here're the results of the benchmarks:

https://raw.github.com/plevart/jdk8-tl/JEP-149/test/benchmark_results_i7-2600K.txt

It is interesting to compare these results with the results I got on the 
codebase prior to the repeating annotations push:

https://raw.github.com/plevart/jdk8-hacks/annotation-arrays/benchmark_results_i7-2600K.txt

Test2 results for the unpatched code (first run in each file) show that 
this test running on the repeating annotations is almost 4x slower in 
the single threaded case and almost 20x slower in the heavy-threaded 
(128 threads on 4 core i7) case then the pre-repeating annotations 
version. May I restate that this is comparing unpatched code. I must 
also mention that the pre-repeating annotations results were obtained by 
running on a JVM that was build from the jdk8/jdk8 forest:

openjdk version "1.8.0-internal"
OpenJDK Runtime Environment (build 
1.8.0-internal-peter_2012_11_21_08_55-b00)
OpenJDK 64-Bit Server VM (build 25.0-b09, mixed mode)

while the repeating annotations version results were obtained by running 
on the todays build of jdk8/tl forest:

openjdk version "1.8.0-internal"
OpenJDK Runtime Environment (build 
1.8.0-internal-peter_2012_11_30_14_19-b00)
OpenJDK 64-Bit Server VM (build 25.0-b09, mixed mode)

Test2 results for the patched code (second run in each file) show that 
this test running on the repeating annotations is about 2x slower in the 
single threaded case and also 2x slower in the heavy-threaded (128 
threads on 4 core i7) case.

I haven't yet established why this is so. It might be that code 
differences (for example another method call in the code-path with 
repeating annotations) cause different optimizations by JIT. It is also 
interesting to note that with repeating annotations, Test 2 is even less 
scalable for the unpatched code while with the patched code it shows 
same scalability.

Test3 shows the same speed for the unpatched variants of pre-repeating 
annotations vs. repeating annotations, while with patched code it shows 
2x increase in speed. The scalability of test3 is about the same 
pre-repeating annotations vs. repeating annotations.

Test1 is out of the league because it only demonstrates the cache hit 
improvements when keeping the caches of annotations on the root 
instances of Field/Constructor/Method instead of on the copied instances.

So, what do you think? What kind of tests should I prepare in addidion 
to those 3 so that the patch might get a consideration?

Regards, Peter

P.S.

The source for tests is here:

https://raw.github.com/plevart/jdk8-tl/JEP-149/test/src/test/ReflectionTest.java

On 11/23/2012 06:09 PM, Peter Levart wrote:
> Hi again,
>
> I have found an inconsistency in the last patch I posted. Specifically 
> the Method.getAnnotation(Class) and Constructor.getAnnotation(Class) 
> did not delegate to the root instance as .getAnnotations() did. I 
> corrected that in new revision of the patch:
>
> http://dl.dropbox.com/u/101777488/jdk8-hacks/JEP-149/webrev.03/index.html
>
> I also modified the benchmarks somewhat so that now they exercise the 
> .getAnnotation(Class) methods on various objects instead of bulk 
> getAnnotations() methods:
>
> https://raw.github.com/plevart/jdk8-hacks/annotation-arrays/src/test/ReflectionTest.java
>
> I got results for 2 different machines:
>
> Desktop PC (one socket i7-2600K, 4 cores), Linux 3.6 64bit:
> https://raw.github.com/plevart/jdk8-hacks/annotation-arrays/benchmark_results_i7-2600K.txt
>
> and:
>
> Sun Blade T6320 (one socket T2, 8 cores), Solaris 10 64bit:
> https://raw.github.com/plevart/jdk8-hacks/annotation-arrays/benchmark_results_T2.txt
>
> On the T2, the concurrency bottleneck is even more visible.
>
>
> Regards, Peter
>
>
> On 11/23/2012 12:08 PM, Peter Levart wrote:
>> Hi David, Alan, Alexander and others,
>>
>> Here's a refinement of a patch I proposed in this thread a couple of 
>> weeks ago:
>>
>> http://dl.dropbox.com/u/101777488/jdk8-hacks/JEP-149/webrev.02/index.html
>>
>> The changed sources can be browsed here:
>>
>> https://github.com/plevart/jdk8-hacks/tree/annotation-arrays/src/java/lang
>>
>> The main improvement is the replacement of a filed:
>>
>> Map<Class<? extends Annotation>, Annotation> declaredAnnotations;
>>
>> with plain:
>>
>> Annotation[] declaredAnnotations;
>>
>> ...in the Class.VolatileData structure. It can be seen from the 
>> public API that caching a HashMap of declared annotations is never 
>> needed since the only public method (getDeclaredAnnotations()) 
>> returns an array. Therefore it is much more efficient to cache the 
>> array and only clone it on every public method invocation.
>>
>> Unfortunately it is not so with "annotations" field. There is a 
>> public method (getAnnotation(Class<? extends Annotation>)) that has 
>> O(1) time complexity because of keeping a HashMap in the cache.
>>
>> I did an experiment with keeping annotations (declared + inherited) 
>> in an array, sorted by the hashCode of annotationType and the lookup 
>> then was a binary search by hashCode of annotationType followed by 
>> linear search of the equal hashCode neighborhood. This has O(log N) 
>> time complexity, which is not so bad, but the constant factor was 
>> pretty high because invoking Annotation.annotationType() method goes 
>> through dynamic Proxy to an InvocationHandler...
>>
>> Speaking of that, I think there's plenty of opportunity for JEP-149 
>> related optimization in the field of annotations. For example, 
>> generating a special class for every annotationType that would hold 
>> it's own annotation data and answer to method requests directly would 
>> give a large time and space gain.
>>
>> It can also be noticed that there's a discrepancy of how lookup is 
>> implemented for example:
>> - lookup of Annotation by annotationType on a class is a 
>> HashMap.get() which has O(1) time complexity
>> - lookup of a of Field by fieldName on a class is a linear search in 
>> an array which has O(N) time complexity
>> - similar for Method by methodName and argumentTypes - O(N)
>>
>> The time complexity improvement for Field and Method lookup could be 
>> performed by keeping some cached arrays pre-sorted by the name of 
>> Field or Method and employing a binary search to locate or 
>> almost-locate the object. Since public API explicitly states that 
>> methods returning arrays of Fields and Methods do not specify any 
>> order, pre-sorting the arrays would not break the contract.
>>
>> Regards, Peter
>>
>> On 11/07/2012 11:39 AM, Peter Levart wrote:
>>> On 11/07/2012 03:10 AM, David Holmes wrote:
>>>> Hi Peter,
>>>>
>>>> The movement of the reflection caches to a helper object is exactly 
>>>> what I had previously proposed here (some differences in the 
>>>> details of course):
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/JEP-149/webrev/
>>>>
>>>> and discussed here:
>>>>
>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-April/009749.html 
>>>>
>>>>
>>>> but this did not touch the annotations fields.
>>>>
>>>> David
>>>
>>> Hi David,
>>>
>>> Thanks for the pointer. There is a discussion between Brian and you 
>>> (to quote some of it):
>>>
>>> On 5/04/2012 1:28 PM, Brian Goetz wrote:
>>> >/  Reducing the number of SoftReferences in ReflectionHelper also seems an
>>> />/  attractive target for memory reduction. Rather than eight soft
>>> />/  references (eight extra objects), maintaining a SoftRef to the entire
>>> />/  RH, or at least to the part of the RH that is currently SR'ed if the two
>>> />/  non-SR'ed fields can't be recomputed, would save you a whole pile of
>>> />/  objects per class (and might also reduce pressure on GC, having 8x fewer
>>> />/  SRs to process.)
>>> /
>>> I'd have to consider the intended semantics of these soft references
>>> before considering any change here. It would hard to predict how this
>>> might impact runtime performance if we have coarser-grained soft
>>> references. The current changes should be semantically null.
>>>
>>> >/  Finally, you may be able save an extra field per Class by storing the
>>> />/  ReflectionHelper in a ClassValue on Java SE 8, rather than a field.
>>> /
>>> ClassValue is something I'm keeping an eye on, but an entry in
>>> ClassValue is going to consume more dynamic memory than a single direct
>>> field.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> ...the 8 SoftReferences refer to arrays which are never hard 
>>> referenced by the outside world (arrays are always defensively 
>>> copied), so it's reasonable to expect that all SoftReferences would 
>>> be cleared at the same time anyway. And if 8 SoftReferences are 
>>> replaced with 1, the worst case scenario (to quote Hinkmond Wong):
>>>
>>> Hi Brian,
>>>
>>> One of the issues we have in the Java Embedded group (as David points
>>> out in his summary), is that while on paper the theoretical max savings
>>> seems great (as you point out also), in practice as David points out in
>>> his note, this might be a wash if there are a lot more reflection using
>>> classes vs. non-reflection using classes in "typical" real-world
>>> applications, not the low or zero reflection using class ratio that
>>> happens in the theoretical "best case".
>>>
>>> So, a question comes up if we should judge the merit of this change on
>>> the theoretical "best case" scenario, or should we judge it on
>>> real-world applicability to "typical" apps (such as a finite set of
>>> customer surveyed embedded apps that we feel represent a real-world
>>> scenario).
>>>
>>>
>>> Thanks,
>>> Hinkmond
>>>
>>>
>>> ...actually becomes even more favourable. We reduce huge overhead 
>>> (each SoftReference is 4 OOPs and 1 long). And if this single 
>>> SoftReference is ever cleared, more memory is released - the whole 
>>> structure (ReflectionHelper / VolatileData)
>>>
>>> Other differences in details between your proposal and mine:
>>>
>>> In your proposal, the method ReflectionHelper rh() is equivalent to 
>>> mine VolatileData volatileData() - it lazily constructs the 
>>> structure and returns it. My implementation also incorporates the 
>>> logic of clearCachesOnClassRedefinition() by returning and 
>>> installing a new instance of the structure in case a redefinition is 
>>> detected. This has a profound impact on the correctness of the 
>>> cached data in face of races that can occur.
>>>
>>> In your proposal, even if the VM could atomically publish changes to 
>>> raw reflection data and the classRedefinitionCount at the same time 
>>> (we hope that at least the order of publishing is such that 
>>> classRedefinitionCount is updated last), it can theoretically be 
>>> that 2 or more redefinitions of the same class happen in close 
>>> proximity:
>>>
>>> VM thread: redefines the class to version=1
>>> thread 1: clears the cache and takes  version=1 raw data and 
>>> computes derived data but gets pre-empted before installing it
>>> VM thread: redefines the class to version=2
>>> thread 2: clears the cache and takes  version=2 raw data and 
>>> computes derived data and installs it
>>> thread 1: ...gets back and installs version=1 derived data over 
>>> version=2 data
>>>
>>> ...if there are no more class redefinitions, the stale version of 
>>> derived data can persist indefinitely.
>>>
>>> In my proposal, each thread will get it's own copy of the structure 
>>> in the above scenario and install the derived data into it. It can 
>>> happen that a particular instance of the structure does not 
>>> represent a "snapshot" view of the world, but that is not important, 
>>> since that particular inconsistent instance is only used for the 
>>> in-flight call and only in that part that is consistent. Other 
>>> callers will get a fresh instance.
>>>
>>> There is also one thing I overlooked and you haven't: the 
>>> cachedConstructor and newInstanceCallerCache fields.
>>>
>>> I'll have to look at how to incorporate them into my scheme. They 
>>> are currently neither SoftReferenced nor cleared at class 
>>> redefinition. As the cachedConstructor is only used to implement the 
>>> .newInstance() method, I wonder if it is safe not to clear it when 
>>> the class is redefined. Are old versions of Constructors still valid 
>>> for invoking in a redefined class? I guess they must be, since user 
>>> code is free to cache it's own versions and class redefinition 
>>> should not prevent invoking them...
>>>
>>> Since cachedConstructor/newInstanceCallerCache are used to optimize 
>>> .newInstance() method. That alone suggests that calling this method 
>>> is more common use-case than others. Perhaps leaving this pair of 
>>> fields out of the game is a better approach space-saving wise.
>>>
>>> Regards, Peter
>>>
>>>>
>>>> On 6/11/2012 11:12 PM, Peter Levart wrote:
>>>>> On 11/05/2012 06:23 AM, David Holmes wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> Moving the annotations fields into a helper object would tie in with
>>>>>> the Class-instance size reduction effort that was investigated as 
>>>>>> part
>>>>>> of "JEP 149: Reduce Core-Library Memory Usage":
>>>>>>
>>>>>> http://openjdk.java.net/jeps/149
>>>>>>
>>>>>> The investigations there to date only looked at relocating the
>>>>>> reflection related fields, though the JEP mentions the 
>>>>>> annotations as
>>>>>> well.
>>>>>>
>>>>>> Any such effort requires extensive benchmarking and performance
>>>>>> analysis before being accepted though.
>>>>>>
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>
>>>>> On 11/05/2012 10:25 AM, Alan Bateman wrote:
>>>>>> Here's a good starting place on the interaction of runtime visible
>>>>>> attributes and RedefineClasses/RetransformClasses:
>>>>>>
>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5002251
>>>>>>
>>>>>> -Alan.
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Presented here is a patch mainly against java.lang.Class and also
>>>>> against java.lang.reflect.[Field,Method,Constructor,Executable] 
>>>>> classes.
>>>>>
>>>>> Currently java.lang.Class uses the following fields to maintain 
>>>>> caches
>>>>> of reflection data that are invalidated as a result of class or
>>>>> superclass redefinition/re-transformation:
>>>>>
>>>>> private volatile transient SoftReference<Field[]> declaredFields;
>>>>> private volatile transient SoftReference<Field[]> publicFields;
>>>>> private volatile transient SoftReference<Method[]> declaredMethods;
>>>>> private volatile transient SoftReference<Method[]> publicMethods;
>>>>> private volatile transient SoftReference<Constructor<T>[]>
>>>>> declaredConstructors;
>>>>> private volatile transient SoftReference<Constructor<T>[]>
>>>>> publicConstructors;
>>>>> private volatile transient SoftReference<Field[]> 
>>>>> declaredPublicFields;
>>>>> private volatile transient SoftReference<Method[]> 
>>>>> declaredPublicMethods;
>>>>>
>>>>> // Value of classRedefinedCount when we last cleared the cached 
>>>>> values
>>>>> // that are sensitive to class redefinition.
>>>>> private volatile transient int lastRedefinedCount = 0;
>>>>>
>>>>> // Annotations cache
>>>>> private transient Map<Class<? extends Annotation>, Annotation> 
>>>>> annotations;
>>>>> private transient Map<Class<? extends Annotation>, Annotation>
>>>>> declaredAnnotations;
>>>>>
>>>>> If I understand Alan's references correctly, current VM can 
>>>>> redefine the
>>>>> class in a way that changes method bodies. Also new methods can be
>>>>> added. And the set of annotations can also be altered. And future
>>>>> improvements could allow even more.
>>>>>
>>>>> Because annotations are cached on Field/Method/Constructor instances,
>>>>> all the above fields must be invalidated when the class or 
>>>>> superclass is
>>>>> redefined.
>>>>>
>>>>> It can also be observed that Field/Method/Constructor caches are
>>>>> maintained using SoftReferences but annotations are hard 
>>>>> references. I
>>>>> don't know if this is intentional. I believe that annotations 
>>>>> could also
>>>>> be SoftReferenced, so that in the event of memory pressure they get
>>>>> cleared. Many applications retrieve annotations only in the early 
>>>>> stages
>>>>> of their life-cycle and then either cache them themselves or forget
>>>>> about them.
>>>>>
>>>>> So I designed the patch to equalize this. If this is undesirable, the
>>>>> patch could be modified to make a distinction again.
>>>>>
>>>>> The patch replaces the above-mentioned java.lang.Class fields with a
>>>>> single field:
>>>>>
>>>>> private volatile transient SoftReference<VolatileData<T>> 
>>>>> volatileData;
>>>>>
>>>>> ...which is a SoftReference to the following structure:
>>>>>
>>>>> // volatile data that might get invalid when JVM TI 
>>>>> RedefineClasses() is
>>>>> called
>>>>> static class VolatileData<T> {
>>>>> volatile Field[] declaredFields;
>>>>> volatile Field[] publicFields;
>>>>> volatile Method[] declaredMethods;
>>>>> volatile Method[] publicMethods;
>>>>> volatile Constructor<T>[] declaredConstructors;
>>>>> volatile Constructor<T>[] publicConstructors;
>>>>> // Intermediate results for getFields and getMethods
>>>>> volatile Field[] declaredPublicFields;
>>>>> volatile Method[] declaredPublicMethods;
>>>>> // Annotations
>>>>> volatile Map<Class<? extends Annotation>, Annotation> annotations;
>>>>> volatile Map<Class<? extends Annotation>, Annotation> 
>>>>> declaredAnnotations;
>>>>> // Value of classRedefinedCount when we created this VolatileData 
>>>>> instance
>>>>> final int redefinedCount;
>>>>>
>>>>> So let's look at static overhead differences using 64 bit addressing
>>>>> (useful load - arrays, Maps not counted since the patched code 
>>>>> uses the
>>>>> same amount of same types of each).
>>>>>
>>>>> * Fresh java.lang.Class instance:
>>>>>
>>>>> current JDK8 code:
>>>>>
>>>>> 10 OOPs + 1 int = 10*8+4 = 84 bytes in 1 instance
>>>>>
>>>>> vs. patched code :
>>>>>
>>>>> 1 OOP = 8 bytes in 1 instance
>>>>>
>>>>> * Fully loaded java.lang.Class (Fields, Methods, Constructors, 
>>>>> annotations):
>>>>>
>>>>> current JDK8 code:
>>>>>
>>>>> 10 OOPs + 1 int = 84 bytes
>>>>> 8 SoftReference instances = 8*(header + 4 OOPs + 1 long) = 
>>>>> 8*(24+32+8) =
>>>>> 8*64 = 512 bytes
>>>>> total: 84+512 = 596 bytes in 9 instances
>>>>>
>>>>> vs. patched code :
>>>>>
>>>>> 1 OOP = 8 bytes
>>>>> 1 SoftReference = 64 bytes
>>>>> 1 VolatileData = header + 10 OOPs + 1 int = 24+84 = 108 bytes
>>>>> total: 8+64+108 = 180 bytes in 3 instances
>>>>>
>>>>> So we have 84 vs. 8 (empty) or 596 vs. 180 (fully loaded) byte 
>>>>> overheads and
>>>>> 1 vs. 1 (empty) or 9 vs. 3 (fully loaded) instance overheads
>>>>>
>>>>> Other than that, the patch also removes synchronized blocks for lazy
>>>>> initialization of annotations in Class, Field, Method and Constructor
>>>>> and replaces them with volatile fields. In case of 
>>>>> Class.volatileData,
>>>>> this field is initialized using a CAS so there is no race which could
>>>>> install an already stale instance over more recent. Although such 
>>>>> race
>>>>> would quickly be corrected at next call to any retrieval method, 
>>>>> because
>>>>> redefinedCount is now an integral part of the cached structure not an
>>>>> individual volatile field.
>>>>>
>>>>> There is also a change in how annotations are cached in Field, Method
>>>>> and Constructor. Originally they are cached in each copy of the
>>>>> Field/Method/Constructor that is returned to the outside world at 
>>>>> each
>>>>> invocation of Class.getFields() etc. Such caching is not very 
>>>>> effective
>>>>> if the annotations are only retrieved once per instance. The patch
>>>>> changes this and delegates caching to the "root" instance which is 
>>>>> held
>>>>> inside Class so caching becomes more effective in certain usage
>>>>> patterns. There's now a possible hot-spot on the "root" instance but
>>>>> that seems not to be a bottleneck since the fast-path does not 
>>>>> involve
>>>>> blocking synchronization (just volatile read). The effects of this
>>>>> change are clearly visible in one of the benchmarks.
>>>>>
>>>>> I have tried to create 3 micro benchmarks which exercise 
>>>>> concurrent load
>>>>> on 3 Class instances.
>>>>>
>>>>> Here's the benchmark code:
>>>>>
>>>>> https://raw.github.com/plevart/jdk8-hacks/master/src/test/ReflectionTest.java 
>>>>>
>>>>>
>>>>> And here are the results when run on an Intel i7 CPU (4 cores, 2
>>>>> threads/core) Linux machine using -Xmx4G VM option:
>>>>>
>>>>> https://raw.github.com/plevart/jdk8-hacks/master/benchmark_results.txt 
>>>>>
>>>>>
>>>>>
>>>>> The huge difference of Test1 results is a direct consequence of 
>>>>> patched
>>>>> code delegating caching of annotations in Field/Method/Constructor to
>>>>> the "root" instance.
>>>>>
>>>>> Test2 results show no noticeable difference between original and 
>>>>> patched
>>>>> code. This, I believe, is the most common usage of the API, so 
>>>>> another
>>>>> level of indirection does not appear to present any noticeable
>>>>> performance overhead.
>>>>>
>>>>> The Test3 on the other hand shows the synchronization overhead of
>>>>> current jdk8 code in comparison with non-blocking synchronization in
>>>>> patched code.
>>>>>
>>>>> JEP 149 also mentions testing with SPECjbb2005 and SPECjvm98, but 
>>>>> that
>>>>> exceeds my possibilities.
>>>>>
>>>>> The patch against jdk8/jdk8/jdk hg repository is here:
>>>>>
>>>>> https://raw.github.com/plevart/jdk8-hacks/master/volatile_class_data_caching.patch 
>>>>>
>>>>>
>>>>> You can also browse the changed sources:
>>>>>
>>>>> https://github.com/plevart/jdk8-hacks
>>>>>
>>>>>
>>>>> Regards, Peter
>>>>>
>>
>