RFR JDK-8059510 Compact symbol table layout inside shared archive

Wed Dec 3 21:26:14 UTC 2014

On 12/03/2014 01:20 PM, Jiangli Zhou wrote:
> Hi Ioi,
>
> Thanks for the comments.
>
> On 12/03/2014 12:06 PM, Ioi Lam wrote:
>> Hi Jiangli,
>>
>> The changes look good. Just a few nits:
>>
>> (1) src/share/vm/services/diagnosticCommand.cpp:
>>
>> This seems to be not needed
>> 27 #include "classfile/compactHashtable.hpp"
>
> The #include of the header file is needed. The VM.symboltable and 
> VM.stringtable definitions are moved to compactHashtable.hpp to be 
> more close to their usage.
>
>>
>> (2) test/runtime/SharedArchiveFile/DumpSymbolAndStringTable.java:
>> 43         output.shouldContain("DumpSymbolAndStringTable");
>> 48         output.shouldContain("java.lang.String");Maybe these two 
>> should include the full line, and the terminating \n?
>
> Sounds good. Added.
>
>> Also, why is this necessary for VM.stringtable but not done for 
>> VM.symboltable?
>>
>>  49         } catch (RuntimeException e) {
>>  50             output.shouldContain("Unknown diagnostic command");
>
> Fixed for VM.symboltable. Thanks for catching that.
>
>>
>> (3) src/share/vm/classfile/compactHashtable.cpp:
>>
>>   50   stats->hashentry_bytes = _num_entries * 8;  // estimates only
>>
>> Is this always an conservative estimate (more than actually needed)?
>
> Yes. Some entries will have only the 4-byte offset, but we don't know 
> how many entries at this point. So we use a conservative estimate.
>
>> This should be explained in the comments.
>
> Will do.
>
>> Also, if the actual number is different, stats->hashentry_bytes 
>> should be adjusted so that the statistics can be printed correctly.
>
> Ok.
>
>>
>> Indentation: 4->2
>>  134         assert(bucket_type == COMPACT_BUCKET_TYPE, "Bad bucket 
>> type");
>
> Fixed.
>
>>
>> Is the a reason to use guarantee here and not assert (I realize that 
>> I wrote these two lines, and I don't remember why :-)?
>>  142       guarantee(deltax < max_delta, "range check");
>>  147     guarantee(count == _bucket_sizes[index], "sanity");
>
> I'm guessing you wanted to have the check in non-debug build also.
>
>>
>> I think these two blocks can be rewritten to avoid the use of the #ifdef
>>  162 #ifdef _LP64
>>  163   *p++ = juint(base_address >> 32);
>>  164 #else
>>  165   *p++ = 0;
>>  166 #endif
>>  167   *p++ = juint(base_address & 0xffffffff); // base address
>>
>>  205   juint upper = *p++;
>>  206   juint lower = *p++;
>>  207 #ifdef _LP64
>>  208   _base_address = (uintx(upper) << 32 ) + uintx(lower);
>>  209 #else
>>  210   _base_address = uintx(lower);
>>  211 #endif
>>
>> ->
>>
>> 163   *p++ = juint(base_address >> 32);
>>  167   *p++ = juint(base_address & 0xffffffff);
>>
>>  205   juint upper = *p++;
>>  206   juint lower = *p++;
>>  208   _base_address = (uintx(upper) << 32 ) + uintx(lower);
>
> Sounds good.
>
>>
>> Also, do you have statistics of the percentage of COMPACT_BUCKET_TYPE 
>> vs REGULAR_BUCKET_TYPE?
>
> Yes. :) With the default classlist, the compact bucket is about 6%.

Here is better formatted table:

empty buckets 	1 entry buckets 	2 entries buckets 	3 entries buckets 	4 
entries buckets 	5 entries buckets 	6 entries buckets 	7 entries 
buckets 	8 entries buckets 	>8 entries
25 	71
	172
	247 	229 	158 	108 	87 	29 	26

Thanks,
Jiangli

>
>
>     empty buckets     1 entry buckets
>     2 entries buckets
>     3 entries buckets
>     4 entries buckets
>     5 entries buckets
>     6 entries buckets
>     7 entries buckets
>     8 entries buckets
>     >8 entries
> Bucket size 4
>     25
>     71
>     172
>     247
>     229
>     158
>     108
>     87
>     29
>     26
>
>
>     Compressed
>
>
>
>
>
>
>
>
>
>
> Thanks,
> Jiangli
>
>>
>> Thanks
>> - Ioi
>>
>>
>> On 12/2/14, 1:44 PM, Jiangli Zhou wrote:
>>> Hi,
>>>
>>> I finally got back to this, sorry for the delay. Please see the 
>>> following new webre.
>>>
>>> http://cr.openjdk.java.net/~jiangli/8059510/webrev.06/
>>>
>>> New in the webrev:
>>>
>>> 1. Further compression of the compact table
>>>
>>>   * Remove the bucket_size table. With the sequential layout of the
>>>     buckets, lookup process can seek to the start of the next bucket
>>>     without the need of the current bucket size. For the last bucket,
>>>     it can seek to the end of the table. The table end offset is
>>>     added to the archived data.
>>>   * Bucket with exactly one entry is marked as 'compact' bucket,
>>>     whose entry only contains the symbol offset. The symbol hash is
>>>     eliminated for 'compact' buckets. Lookup compares the symbol
>>>     directly in that case.
>>>
>>> 2. The shared symbol table is not always looked up first. The last 
>>> table that fetches symbol successfully is used for lookup.
>>>
>>> 3. Added a lot more comments in compactHashtable.hpp with details of 
>>> the compact table layout and dump/lookup process.
>>>
>>> I measured using the classloading benchmark that Aleksey pointed to 
>>> me. This benchmark loads classes using user defined classloader. 
>>> There is a very small degradation shown in the benchmark comparing 
>>> 'before' and 'after' with archive dumped with the default 
>>> configuration. When symbols from the test is added to the shared 
>>> table, there is an observable speedup in the benchmark. The speedup 
>>> is also very small.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>
>>> On 10/29/2014 02:55 PM, Jiangli Zhou wrote:
>>>> Hi John,
>>>>
>>>> Thank you for the thoughts on this! Yes, it's a good time to have 
>>>> these conversations. Please see some quick responses from me below, 
>>>> with more details to follow.
>>>>
>>>> On 10/29/2014 12:46 PM, John Rose wrote:
>>>>> I have a few points about the impact of this change on startup 
>>>>> performance, and on trends in our code base:
>>>>>
>>>>> 1. We can live with small performance regressions on some 
>>>>> benchmarks.  Otherwise we'd never get anywhere.  So I am not 
>>>>> saying that the current (very interesting and multi-facted) 
>>>>> conversation must continue a long time before we can push any code.
>>>>>
>>>>> 2. Aleksey's challenge is valid, and I don't see a strong reply to 
>>>>> it yet.  Work like this, that emphasizes compactness and 
>>>>> sharability can usually be given the benefit of the doubt for 
>>>>> startup.  But if we observe a problem we should make a more 
>>>>> careful measurement.  If this change goes in with a measurable 
>>>>> regression, we need a follow-up conversation (centered around a 
>>>>> tracking bug) about quantifying the performance regression and 
>>>>> fixing it.  (It's possible the conversation ends by saying "we 
>>>>> decided we don't care and here's why", but I doubt it will end 
>>>>> that way.)
>>>>
>>>> Besides my classloading benchmark results posted in earlier 
>>>> message, we have asked performance team to help us measure startup, 
>>>> classloding, and memory saving regarding this change, and verified 
>>>> there was no regression in startup/classloading. The results were 
>>>> not posted in the thread however. They were added to the bug report 
>>>> for JDK-8059510 <https://bugs.openjdk.java.net/browse/JDK-8059510>.
>>>>
>>>>> 3. The benchmark Aleksey chose, Nashorn startup, may well be an 
>>>>> outlier.  Dynamic language runtimes create lots of tiny methods, 
>>>>> and are likely to put heavy loads on symbol resolution, including 
>>>>> resolution of symbols that are not known statically.  For these, 
>>>>> the first-look (front-end) table being proposed might be a loss, 
>>>>> especially if the Nashorn runtime is not in the first-look table.
>>>>>
>>>>> 4. Possible workarounds:  #1 Track hit rates and start looking at 
>>>>> the static table *second* if it fails to produce enough hits, 
>>>>> compared to the dynamic table.  #2 When that happens, 
>>>>> incrementally migrate (rehash) frequently found symbols in the 
>>>>> static table into the main table.  #3 Re-evaluate the hashing 
>>>>> algorithm.  (More on this below.)
>>>>
>>>> That's interesting thinking. I'll follow up on this.
>>>>
>>>>> 5. I strongly support moving towards position-independent, 
>>>>> shareable, pointer-free data structures, and I want us to learn to 
>>>>> do it well over time.  (Ioi, you are one of our main explorers 
>>>>> here!)  To me, a performance regression is a suggestion that we 
>>>>> have something more to learn.  And, it's not a surprise that we 
>>>>> don't have it all figured out yet.
>>>>
>>>> Point taken! Position-independent is one of our goal for the 
>>>> archived data. We'll be start looking into removing direct pointers 
>>>> in the shared class data.
>>>>
>>>>> 6. For work like this, we need to agree upon at least one or two 
>>>>> startup performance tests to shake out bottlenecks early and give 
>>>>> confidence. People who work on startup performance should know how 
>>>>> to run them and be able to quote them.
>>>>
>>>> Thank you for bring this up. Totally agree. For me, personally I've 
>>>> played with different benchmarks and applications with different 
>>>> focus over time. It would be good to have some commonly agreed  
>>>> startup performance tests for this. A standalone classloading 
>>>> benchmark, HelloWorld and Nashorn startup probably are good 
>>>> choices. We also have an internal application for startup and 
>>>> memory saving measurement.
>>>>
>>>>> 7. There's a vibrant literature about offline (statically created) 
>>>>> hash tables, and lots of tricks floating around, such as perfect 
>>>>> or semi-perfect hash functions, and multiple-choice hashing, and 
>>>>> locality-aware structures.  I can think of several algorithmic 
>>>>> tweaks I'd like to try on this code.  (If they haven't already 
>>>>> been tried or discarded: I assume Ioi has already tried a number 
>>>>> of things.)  Moreover, this is not just doorknob-polishing, 
>>>>> because (as said in point 5) we want to develop our competency 
>>>>> with these sorts of data structures.
>>>>>
>>>>> 8. I found the code hard to read, so it was hard to reason about 
>>>>> the algorithm.  The core routine, "lookup", has no comments and 
>>>>> only one assert.  It uses only primitive C types and (perhaps in 
>>>>> the name of performance) is not factored into sub-functions.  The 
>>>>> code which generates the static table is also hard to reason about 
>>>>> in similar ways.  The bucket size parameter choice (a crucial 
>>>>> compromise between performance and compactness) is 4 but the 
>>>>> motivation and implications are left for the reader to puzzle 
>>>>> out.  Likewise, the use of hardware division instead of 
>>>>> power-of-two table size is presented without comment, despite the 
>>>>> fact that we favor power-of-two in other parts of our stack, and a 
>>>>> power-of-two solution would be reasonable here (given the bucket 
>>>>> size).
>>>>>
>>>>> 9. Perhaps I wouldn't care as much about the code style if the 
>>>>> code were not performance sensitive or if it were a unique and 
>>>>> isolated part of the JVM.  In this case, I expect this code (and 
>>>>> other similar code) to improve, over time, in readability and 
>>>>> robustness, as we learn to work with this new kind of data 
>>>>> structure.  So even if we decide there is no significant 
>>>>> regression here, and decide to push it as-is, we still need to use 
>>>>> it as an example to help us get better at writing easy-to-read 
>>>>> code which works with pointer-free data.
>>>>>
>>>>> 10. I would like to see (posted somewhere or attached to the bug) 
>>>>> a sample list of the symbols in a typical symbol table.  Perhaps 
>>>>> this already exists and I missed it.  I think it would be friendly 
>>>>> to put some comments in the code that help the reader estimate 
>>>>> numbers like table size, bucket length, number of queries, number 
>>>>> of probes per query, symbol length statistics (mean, histogram).  
>>>>> Of course such comments go stale over time, but so does the 
>>>>> algorithm, and being coy about the conditions of the moment 
>>>>> doesn't help us in the long run.  Even a short comment is better 
>>>>> than none, for example:
>>>>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/classfile/vmSymbols.cpp#l206
>>>>
>>>> Those are good suggestions. I'll add a sample list o the symbols to 
>>>> the bug report and try to put more comments in the code.
>>>>
>>>> Thanks!
>>>>
>>>> Jiangli
>>>>
>>>>> It is a good time to have these conversations.
>>>>>
>>>>> Best wishes,
>>>>> — John
>>>>>
>>>>> On Oct 13, 2014, at 11:46 AM, Aleksey 
>>>>> Shipilev<aleksey.shipilev at oracle.com>  wrote:
>>>>>
>>>>>> Hi Jiangli,
>>>>>>
>>>>>> On 13.10.2014 18:26, Jiangli Zhou wrote:
>>>>>>> On 10/13/2014 03:18 AM, Aleksey Shipilev wrote:
>>>>>>>> On 13.10.2014 03:32, David Holmes wrote:
>>>>>>>>> On 11/10/2014 1:47 PM, Jiangli Zhou wrote:
>>>>>>>>> Also is the benchmarking being done on dedicated systems?
>>>>>>>> Also, specjvm98 is meaningless to estimate the classloading costs.
>>>>>>>> Please try specjvm2008:startup.* tests?
>>>>>>> The specjvm run was for Gerard's question about standard 
>>>>>>> benchmarks.
>>>>>> SPECjvm2008 is a standard benchmark. In fact, it is a benchmark that
>>>>>> deprecates SPECjvm98.
>>>>>>
>>>>>>> These are not benchmarks specifically for classloading.
>>>>>> There are benchmarks that try to estimate the startup costs.
>>>>>> SPECjvm2008:startup.* tests are one of them.
>>>>>>
>>>>>>> However, I agree it's a good idea to run standard benchmarks to
>>>>>>> confirm there is no overall performance degradation. From all the
>>>>>>> benchmarks including classloading measurements, we have confirmed
>>>>>>> that this specific change does not have negative impact on
>>>>>>> classloading itself and the overall performance.
>>>>>> Excellent. What are those benchmarks? May we see those? Because I 
>>>>>> have a
>>>>>> counter-example in this thread that this change *does* negatively 
>>>>>> impact
>>>>>> classloading.
>>>>>>
>>>>>> -Aleksey.
>>>>>>
>>>>
>>>
>>
>