Performance of instanceof with interfaces is multiple times slower than with classes

Thu Aug 6 13:56:22 UTC 2020

Hi John,

I haven't heard back from you after my last mail. Do you have any further thoughts on the topic?

I'd appreciate any input.

Cheers,
Christoph

==== ORIGINAL MESSAGE ====

Hi John,

thanks for your answer. I think I need to elaborate a bit more on the background of this question.

I was looking at Set.copyOf() and noticed that it copies the collection into a HashSet if it isn’t an immutable collection.
Yet, this doesn’t seem to be necessary in case the passed collection is already a Set, which I think is not that uncommon.
So I changed it slightly to the following:

    static <E> Set<E> copyOf(Collection<? extends E> coll) {
        if (coll instanceof ImmutableCollections.AbstractImmutableSet) {
            return (Set<E>)coll;
        } else if (coll instanceof Set) { // that is the new bit
            return (Set<E>)Set.of(coll.toArray());
        } else {
            return (Set<E>)Set.of(new HashSet<>(coll).toArray());
        }
    }

Doing that showed the wanted performance boost for Sets. When passed a Collections.emptySet() this seems to make it allocation free even.
But for other inputs, like a List, it degraded drastically.

I tried to benchmark this change with the following benchmark:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class MyBenchmark {

	@State(Scope.Thread)
	public static class BenchmarkState {
		private Collection<String> emptySet = Collections.emptySet();
		private Collection<String> emptyList = Collections.emptyList();
	}
	@Benchmark
	public Set<String> testCopyEmptySet(BenchmarkState state) {
		return Set.copyOf(state.emptySet);
	}

	@Benchmark
	public Set<String> testCopyEmptyList(BenchmarkState state) {
		return Set.copyOf(state.emptyList);
	}
}

Which showed the following results

Before
Benchmark                                                        Mode  Cnt     Score     Error   Units
MyBenchmark.testCopyEmptyList                                   avgt   10     5,436 ±   0,286   ns/op
MyBenchmark.testCopyEmptyList:·gc.alloc.rate.norm               avgt   10    48,002 ±   0,001    B/op
MyBenchmark.testCopyEmptySet                                    avgt   10     5,319 ±   0,472   ns/op
MyBenchmark.testCopyEmptySet:·gc.alloc.rate.norm                avgt   10    48,002 ±   0,001    B/op

After
Benchmark                                                       Mode  Cnt     Score     Error   Units
MyBenchmark.testCopyEmptyList                                   avgt   10    24,835 ±   0,621   ns/op
MyBenchmark.testCopyEmptyList:·gc.alloc.rate.norm               avgt   10    48,004 ±   0,001    B/op
MyBenchmark.testCopyEmptySet                                    avgt   10     2,494 ±   0,179   ns/op
MyBenchmark.testCopyEmptySet:·gc.alloc.rate.norm                avgt   10    ≈ 10⁻⁶              B/op

I was surprised to see such a big difference, so I played around with it.

My first suspect was inlining problems with the new solution, but that didn't turn out to be true.
I ended up pinning it down to the instanceof check, because when I changed the additional instanceof check to AbstractSet for a test I saw no regression for the list case and still saw my improvement for the empty set.

All I was trying to do with the provided benchmark in my first mail was to isolate the "problem" and to showcase that this seems to be independent from hierarchy, which exists in the collection API test above. In retrospect, this didn't help here, which I'm sorry about.

> You ask “point me to the code” which suggests you haven’t looked at the code yet.  When you do (and I hope you do!) you will find that there are many, many algorithms and decisions in the JVM that affect the treatment of runtime > type tests.  You could start by grepping around for “subtype” (case insensitive) in the *.[ch]pp and *.ad files under src/hotspot.

I have in fact looked at the code already before the first mail and found emit_typecheck_helper in src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp, which I think is involved - please correct me if I'm wrong.
And MacroAssembler::check_klass_subtype_fast_path in src/hotspot/cpu/x86/macroAssembler_x86.cpp etc. Depending on the architecture (x86, aarch etc.) of course.

Like you said, there are many things involved and I was hoping for a good starting point really from an experienced developer or a rough explanation what might be involved in the example. I understand now that there doesn't seem to be a "simple" explanation. Unfortunately, the C++ side of things is not as good documented as the Java side of things and that makes it relatively complicated for people like me who aren't familiar with the code to follow the flows sometimes. I can assure you: I do want to look at code, hence the - admittedly clumsy - question.

I didn't do a great job of explaining that I looked upfront before asking, so that's on me.

Nonetheless, I hope this sheds some light on the background of this and I hope the collection case earlier justifies as a more real-life example.

Cheers,
Christoph

Von: John Rose <john.r.rose at oracle.com>
Datum: Samstag, 18. Juli 2020 um 01:13
An: Christoph Dreis <christoph.dreis at freenet.de>
Cc: hotspot-runtime-dev <hotspot-runtime-dev at openjdk.java.net>
Betreff: Re: Performance of instanceof with interfaces is multiple times slower than with classes

On Jul 15, 2020, at 2:08 AM, Christoph Dreis <mailto:christoph.dreis at freenet.de> wrote:

Could you enlighten me what the cause for this is and maybe point me to the code where this is done?

A microbenchmark like that doesn’t demonstrate much.

The JVM optimizations which are designed for real applications may or may not apply to this code.  They will certainly apply differently in real-life application execution.  In real life, class hierarchies are more complex, and so the JVM expects to do more work telling types apart; at that point differences between single inheritance and multiple inheritance requires differing algorithms with different complexities and costs.  Also, in real life, inputs to such hot paths are not just all one type (as in your micro), which is another reason you get into a different set of costs and complexities.

If this were related to a real-world application, or a realistic benchmark, I might get assembly code for the hot paths in the two methods, and see what the CPU is doing that makes a difference in speed.  Then I might meditate on what decisions the JVM made to choose that code, and see if there is an improvement possible.

You ask “point me to the code” which suggests you haven’t looked at the code yet.  When you do (and I hope you do!) you will find that there are many, many algorithms and decisions in the JVM that affect the treatment of runtime type tests.  You could start by grepping around for “subtype” (case insensitive) in the *.[ch]pp and *.ad files under src/hotspot.

Is this maybe even a bug/regression? Can we maybe do something to improve the interface case?

I will stick my neck out to say that microbenchmarks alone by definition never demonstrate performance regressions or bugs.

Yes, we can always do something to improve it.  Improving this micro could involve attempting to treat trivial interface hierarchies like trivial class hierarchies:  But who would this benefit?  Our investment of effort is surely driven by a hope to spend limited effort to get the most benefit on real applications.

— John