Dismal performance of String.intern()
Steven Schlansker
stevenschlansker at gmail.com
Mon Aug 5 05:21:10 UTC 2013
On Tue, 11 Jun 2013 10:28:14 +0100
Alan Bateman <Alan.Bateman at oracle.com> wrote:
> On 10/06/2013 19:06, Steven Schlansker wrote:
> > Hi core-libs-dev,
> >
> > While doing performance profiling of my application, I discovered
> > that nearly 50% of the time deserializing JSON was spent within
> > String.intern(). I understand that in general interning Strings is
> > not the best approach for things, but I think I have a decent use
> > case -- the value of a certain field is one of a very limited
> > number of valid values (that are not known at compile time, so I
> > cannot use an Enum), and is repeated many millions of times in the
> > JSON stream.
> >
> Have you run with -XX:+PrintStringTableStatistics? Might be
> interesting if you can share the output (it is printed just before
> the VM terminates).
>
> There are also tuning knobs such as StringTableSize and would be
> interesting to know if you've experimented with.
>
> -Alan.
Hi everyone,
Thanks again for your useful advice. I definitely misjudged the
difficulty and complexity of working with methods that directly bridge
the Java <-> C++ "gap"! As such, it took me way longer to get to this
than I expected...
That said, I think I have very good results to report. OpenJDK8 is
already *significantly* better than OpenJDK 7 was, but still can be
better.
Here is the patch I have at the moment:
https://gist.github.com/stevenschlansker/6153643
I used oprofile with oprofile-jit to identify the hot spots in the
benchmark code as being java_lang_String::equals() and
java_lang_String::as_unicode_string.
Both of these methods have hand-written loops that copy or compare
jchar arrays.
The problem is that in -fastdebug or -slowdebug releases, this is
one method call per character (the function is not inlined). Even in
-release builds, it seems that this is significantly slower than the
libc memcpy() or memcmp() functions which can use SSE4 (or other
related technologies).
My patch adds new methods, char_cmp and char_cpy, which delegate to the
mem* functions instead of using hand-written loops.
The micro-benchmark results are very good for such a small change.
On fastdebug, before:
Benchmark Mode Thr Cnt
Sec Mean Mean error Units
o.s.b.InternBenchmark.testLongStringChmIntern sample 1 2819
5 1.780 0.184 msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 343
5 14.571 0.310 msec/op
o.s.b.InternBenchmark.testLongStringNoIntern sample 1 8712
5 0.526 0.138 msec/op
o.s.b.InternBenchmark.testShortStringChmIntern sample 1 4427
5 1.133 0.121 msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 603
5 8.319 0.161 msec/op
o.s.b.InternBenchmark.testShortStringNoIntern sample 1 17185
5 0.274 0.048 msec/op
After:
Benchmark Mode Thr Cnt
Sec Mean Mean error Units
o.s.b.InternBenchmark.testLongStringChmIntern sample 1 2898
5 1.812 0.208 msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 1138
5 4.397 0.136 msec/op
o.s.b.InternBenchmark.testLongStringNoIntern sample 1 9035
5 0.519 0.146 msec/op
o.s.b.InternBenchmark.testShortStringChmIntern sample 1 4538
5 1.094 0.107 msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 1363
5 3.686 0.100 msec/op
o.s.b.InternBenchmark.testShortStringNoIntern sample 1 16686
5 0.316 0.081 msec/op
On release, before:
Benchmark Mode Thr Cnt
Sec Mean Mean error Units
o.s.b.InternBenchmark.testLongStringChmIntern sample 1 4030
5 1.240 0.002 msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 1024
5 4.894 0.042 msec/op
o.s.b.InternBenchmark.testLongStringNoIntern sample 1 20000
5 0.185 0.002 msec/op
o.s.b.InternBenchmark.testShortStringChmIntern sample 1 6143
5 0.814 0.005 msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 1852
5 2.702 0.016 msec/op
o.s.b.InternBenchmark.testShortStringNoIntern sample 1 20000
5 0.102 0.001 msec/op
After:
Benchmark Mode Thr Cnt
Sec Mean Mean error Units
o.s.b.InternBenchmark.testLongStringChmIntern sample 1 4040
5 1.236 0.002 msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 2733
5 1.832 0.010 msec/op
o.s.b.InternBenchmark.testLongStringNoIntern sample 1 20000
5 0.181 0.002 msec/op
o.s.b.InternBenchmark.testShortStringChmIntern sample 1 6170
5 0.809 0.001 msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 3577
5 1.396 0.007 msec/op
o.s.b.InternBenchmark.testShortStringNoIntern sample 1 20000
5 0.102 0.000 msec/op
This is almost a 3.5x improvement on fastdebug builds, and more than a
2.5x improvement on release builds. It is now only ~50% slower than
ConcurrentHashMap, at least for the low-contention case! (I did not
benchmark higher numbers of threads thoroughly, but I do not think that
my changes could make that worse...)
Finally, the benchmark code:
https://github.com/stevenschlansker/jvm-intern-benchmark/blob/master/src/main/java/org/sugis/benchmark/InternBenchmark.java
It is not the best ever benchmark, but I'm hopeful that it's "good
enough" to demonstrate the wins I have. Please let me know if you
believe the benchmark invalidates my conclusions. It is run with JMH,
as that seems to be the standard way of doing things around here.
Thank you again for your time and input; I am hopeful that I have not
erred terribly :-)
Best,
Steven
More information about the core-libs-dev
mailing list