Dismal performance of String.intern()

Mon Aug 5 05:21:10 UTC 2013

On Tue, 11 Jun 2013 10:28:14 +0100
Alan Bateman <Alan.Bateman at oracle.com> wrote:

> On 10/06/2013 19:06, Steven Schlansker wrote:
> > Hi core-libs-dev,
> >
> > While doing performance profiling of my application, I discovered
> > that nearly 50% of the time deserializing JSON was spent within
> > String.intern().  I understand that in general interning Strings is
> > not the best approach for things, but I think I have a decent use
> > case -- the value of a certain field is one of a very limited
> > number of valid values (that are not known at compile time, so I
> > cannot use an Enum), and is repeated many millions of times in the
> > JSON stream.
> >
> Have you run with -XX:+PrintStringTableStatistics? Might be
> interesting if you can share the output (it is printed just before
> the VM terminates).
> 
> There are also tuning knobs such as StringTableSize and would be 
> interesting to know if you've experimented with.
> 
> -Alan.

Hi everyone,

Thanks again for your useful advice.  I definitely misjudged the
difficulty and complexity of working with methods that directly bridge
the Java <-> C++ "gap"!  As such, it took me way longer to get to this
than I expected...

That said, I think I have very good results to report.  OpenJDK8 is
already *significantly* better than OpenJDK 7 was, but still can be
better.

Here is the patch I have at the moment:
https://gist.github.com/stevenschlansker/6153643

I used oprofile with oprofile-jit to identify the hot spots in the
benchmark code as being java_lang_String::equals() and
java_lang_String::as_unicode_string.

Both of these methods have hand-written loops that copy or compare
jchar arrays.

The problem is that in -fastdebug or -slowdebug releases, this is
one method call per character (the function is not inlined).  Even in
-release builds, it seems that this is significantly slower than the
libc memcpy() or memcmp() functions which can use SSE4 (or other
related technologies).

My patch adds new methods, char_cmp and char_cpy, which delegate to the
mem* functions instead of using hand-written loops.

The micro-benchmark results are very good for such a small change.

On fastdebug, before:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2819
5        1.780        0.184  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1    343
5       14.571        0.310  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   8712
5        0.526        0.138  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4427
5        1.133        0.121  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1    603
5        8.319        0.161  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  17185
5        0.274        0.048  msec/op

After:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2898
5        1.812        0.208  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1138
5        4.397        0.136  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   9035
5        0.519        0.146  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4538
5        1.094        0.107  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1363
5        3.686        0.100  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  16686
5        0.316        0.081  msec/op

On release, before:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4030
5        1.240        0.002  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1024
5        4.894        0.042  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
5        0.185        0.002  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6143
5        0.814        0.005  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1852
5        2.702        0.016  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
5        0.102        0.001  msec/op

After:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4040
5        1.236        0.002  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   2733
5        1.832        0.010  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
5        0.181        0.002  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6170
5        0.809        0.001  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   3577
5        1.396        0.007  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
5        0.102        0.000  msec/op

This is almost a 3.5x improvement on fastdebug builds, and more than a
2.5x improvement on release builds.  It is now only ~50% slower than
ConcurrentHashMap, at least for the low-contention case!  (I did not
benchmark higher numbers of threads thoroughly, but I do not think that
my changes could make that worse...)

Finally, the benchmark code:
https://github.com/stevenschlansker/jvm-intern-benchmark/blob/master/src/main/java/org/sugis/benchmark/InternBenchmark.java

It is not the best ever benchmark, but I'm hopeful that it's "good
enough" to demonstrate the wins I have.  Please let me know if you
believe the benchmark invalidates my conclusions.  It is run with JMH,
as that seems to be the standard way of doing things around here.

Thank you again for your time and input; I am hopeful that I have not
erred terribly :-)

Best,
Steven