Dismal performance of String.intern()

Mon Aug 5 18:15:43 UTC 2013

Hi Steven,

This looks like a promising patch. If all goes well I can sponsor this 
patch into hotspot-rt.

I have problems building your test. Could you send a stand-alone test 
that can be build with plain JDK (i.e., exclude things like 
com.google.common. and  org.openjdk.jmh)?

Also, I saw that you tested with strings that are the same length as 
UUID, which are 36 chars long:

    2544f85f-1845-42c4-aba1-7a431e6edb99
    eff61458-32f8-45cd-a77d-fddb14951cab
    d685f1a2-0b95-4021-b9b3-b0fb3c02a17f

We do need to handle strings with shorter length, and making calls to 
memcmp() may cause degradation with lots of short strings. Can you add a 
sub-test that tests for strings length <= 4?

I would also recommend running a few non-trivial apps (such as eclipse) 
and dump the interned string table to see the distribution of lengths. I 
know this is a hassle, but it would be a good last step to wrap up your 
work.

Meanwhile, I will try to run our benchmarks (aka "refworkload") to check 
for any performance impact.

Thanks
- Ioi

On 08/04/2013 10:21 PM, Steven Schlansker wrote:
> On Tue, 11 Jun 2013 10:28:14 +0100
> Alan Bateman <Alan.Bateman at oracle.com> wrote:
>
>> On 10/06/2013 19:06, Steven Schlansker wrote:
>>> Hi core-libs-dev,
>>>
>>> While doing performance profiling of my application, I discovered
>>> that nearly 50% of the time deserializing JSON was spent within
>>> String.intern().  I understand that in general interning Strings is
>>> not the best approach for things, but I think I have a decent use
>>> case -- the value of a certain field is one of a very limited
>>> number of valid values (that are not known at compile time, so I
>>> cannot use an Enum), and is repeated many millions of times in the
>>> JSON stream.
>>>
>> Have you run with -XX:+PrintStringTableStatistics? Might be
>> interesting if you can share the output (it is printed just before
>> the VM terminates).
>>
>> There are also tuning knobs such as StringTableSize and would be
>> interesting to know if you've experimented with.
>>
>> -Alan.
> Hi everyone,
>
> Thanks again for your useful advice.  I definitely misjudged the
> difficulty and complexity of working with methods that directly bridge
> the Java <-> C++ "gap"!  As such, it took me way longer to get to this
> than I expected...
>
> That said, I think I have very good results to report.  OpenJDK8 is
> already *significantly* better than OpenJDK 7 was, but still can be
> better.
>
> Here is the patch I have at the moment:
> https://gist.github.com/stevenschlansker/6153643
>
> I used oprofile with oprofile-jit to identify the hot spots in the
> benchmark code as being java_lang_String::equals() and
> java_lang_String::as_unicode_string.
>
> Both of these methods have hand-written loops that copy or compare
> jchar arrays.
>
> The problem is that in -fastdebug or -slowdebug releases, this is
> one method call per character (the function is not inlined).  Even in
> -release builds, it seems that this is significantly slower than the
> libc memcpy() or memcmp() functions which can use SSE4 (or other
> related technologies).
>
> My patch adds new methods, char_cmp and char_cpy, which delegate to the
> mem* functions instead of using hand-written loops.
>
> The micro-benchmark results are very good for such a small change.
>
> On fastdebug, before:
>
> Benchmark                                          Mode Thr    Cnt
> Sec         Mean   Mean error    Units
> o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2819
> 5        1.780        0.184  msec/op
> o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1    343
> 5       14.571        0.310  msec/op
> o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   8712
> 5        0.526        0.138  msec/op
> o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4427
> 5        1.133        0.121  msec/op
> o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1    603
> 5        8.319        0.161  msec/op
> o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  17185
> 5        0.274        0.048  msec/op
>
> After:
>
> Benchmark                                          Mode Thr    Cnt
> Sec         Mean   Mean error    Units
> o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2898
> 5        1.812        0.208  msec/op
> o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1138
> 5        4.397        0.136  msec/op
> o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   9035
> 5        0.519        0.146  msec/op
> o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4538
> 5        1.094        0.107  msec/op
> o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1363
> 5        3.686        0.100  msec/op
> o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  16686
> 5        0.316        0.081  msec/op
>
> On release, before:
>
> Benchmark                                          Mode Thr    Cnt
> Sec         Mean   Mean error    Units
> o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4030
> 5        1.240        0.002  msec/op
> o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1024
> 5        4.894        0.042  msec/op
> o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
> 5        0.185        0.002  msec/op
> o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6143
> 5        0.814        0.005  msec/op
> o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1852
> 5        2.702        0.016  msec/op
> o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
> 5        0.102        0.001  msec/op
>
> After:
>
> Benchmark                                          Mode Thr    Cnt
> Sec         Mean   Mean error    Units
> o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4040
> 5        1.236        0.002  msec/op
> o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   2733
> 5        1.832        0.010  msec/op
> o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
> 5        0.181        0.002  msec/op
> o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6170
> 5        0.809        0.001  msec/op
> o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   3577
> 5        1.396        0.007  msec/op
> o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
> 5        0.102        0.000  msec/op
>
>
> This is almost a 3.5x improvement on fastdebug builds, and more than a
> 2.5x improvement on release builds.  It is now only ~50% slower than
> ConcurrentHashMap, at least for the low-contention case!  (I did not
> benchmark higher numbers of threads thoroughly, but I do not think that
> my changes could make that worse...)
>
> Finally, the benchmark code:
> https://github.com/stevenschlansker/jvm-intern-benchmark/blob/master/src/main/java/org/sugis/benchmark/InternBenchmark.java
>
> It is not the best ever benchmark, but I'm hopeful that it's "good
> enough" to demonstrate the wins I have.  Please let me know if you
> believe the benchmark invalidates my conclusions.  It is run with JMH,
> as that seems to be the standard way of doing things around here.
>
> Thank you again for your time and input; I am hopeful that I have not
> erred terribly :-)
>
> Best,
> Steven