RFR JDK-8059510 Compact symbol table layout inside shared archive

Wed Dec 3 21:28:56 UTC 2014

On 12/3/14, 1:20 PM, Jiangli Zhou wrote:
>
>>
>> Is the a reason to use guarantee here and not assert (I realize that 
>> I wrote these two lines, and I don't remember why :-)?
>>  142       guarantee(deltax < max_delta, "range check");
>>  147     guarantee(count == _bucket_sizes[index], "sanity");
>
> I'm guessing you wanted to have the check in non-debug build also.

I think it's better to switch to assert for uniformity. The checks 
aren't necessary for non-debug build.

Thanks
- Ioi
>
>>
>> I think these two blocks can be rewritten to avoid the use of the #ifdef
>>  162 #ifdef _LP64
>>  163   *p++ = juint(base_address >> 32);
>>  164 #else
>>  165   *p++ = 0;
>>  166 #endif
>>  167   *p++ = juint(base_address & 0xffffffff); // base address
>>
>>  205   juint upper = *p++;
>>  206   juint lower = *p++;
>>  207 #ifdef _LP64
>>  208   _base_address = (uintx(upper) << 32 ) + uintx(lower);
>>  209 #else
>>  210   _base_address = uintx(lower);
>>  211 #endif
>>
>> ->
>>
>> 163   *p++ = juint(base_address >> 32);
>>  167   *p++ = juint(base_address & 0xffffffff);
>>
>>  205   juint upper = *p++;
>>  206   juint lower = *p++;
>>  208   _base_address = (uintx(upper) << 32 ) + uintx(lower);
>
> Sounds good.
>
>>
>> Also, do you have statistics of the percentage of COMPACT_BUCKET_TYPE 
>> vs REGULAR_BUCKET_TYPE?
>
> Yes. :) With the default classlist, the compact bucket is about 6% for.
>
>
> 	empty buckets 	1 entry buckets
> 	2 entries buckets
> 	3 entries buckets
> 	4 entries buckets
> 	5 entries buckets
> 	6 entries buckets
> 	7 entries buckets
> 	8 entries buckets
> 	>8 entries
> Bucket size 4
> 	25
> 	71
> 	172
> 	247
> 	229
> 	158
> 	108
> 	87
> 	29
> 	26
>
> 	
> 	Compressed
> 	
> 	
> 	
> 	
> 	
> 	
> 	
>
>
>
> Thanks,
> Jiangli
>
>>
>> Thanks
>> - Ioi
>>
>>
>> On 12/2/14, 1:44 PM, Jiangli Zhou wrote:
>>> Hi,
>>>
>>> I finally got back to this, sorry for the delay. Please see the 
>>> following new webre.
>>>
>>> http://cr.openjdk.java.net/~jiangli/8059510/webrev.06/
>>>
>>> New in the webrev:
>>>
>>> 1. Further compression of the compact table
>>>
>>>   * Remove the bucket_size table. With the sequential layout of the
>>>     buckets, lookup process can seek to the start of the next bucket
>>>     without the need of the current bucket size. For the last
>>>     bucket, it can seek to the end of the table. The table end
>>>     offset is added to the archived data.
>>>   * Bucket with exactly one entry is marked as 'compact' bucket,
>>>     whose entry only contains the symbol offset. The symbol hash is
>>>     eliminated for 'compact' buckets. Lookup compares the symbol
>>>     directly in that case.
>>>
>>> 2. The shared symbol table is not always looked up first. The last 
>>> table that fetches symbol successfully is used for lookup.
>>>
>>> 3. Added a lot more comments in compactHashtable.hpp with details of 
>>> the compact table layout and dump/lookup process.
>>>
>>> I measured using the classloading benchmark that Aleksey pointed to 
>>> me. This benchmark loads classes using user defined classloader. 
>>> There is a very small degradation shown in the benchmark comparing 
>>> 'before' and 'after' with archive dumped with the default 
>>> configuration. When symbols from the test is added to the shared 
>>> table, there is an observable speedup in the benchmark. The speedup 
>>> is also very small.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>
>>> On 10/29/2014 02:55 PM, Jiangli Zhou wrote:
>>>> Hi John,
>>>>
>>>> Thank you for the thoughts on this! Yes, it's a good time to have 
>>>> these conversations. Please see some quick responses from me below, 
>>>> with more details to follow.
>>>>
>>>> On 10/29/2014 12:46 PM, John Rose wrote:
>>>>> I have a few points about the impact of this change on startup performance, and on trends in our code base:
>>>>>
>>>>> 1. We can live with small performance regressions on some benchmarks.  Otherwise we'd never get anywhere.  So I am not saying that the current (very interesting and multi-facted) conversation must continue a long time before we can push any code.
>>>>>
>>>>> 2. Aleksey's challenge is valid, and I don't see a strong reply to it yet.  Work like this, that emphasizes compactness and sharability can usually be given the benefit of the doubt for startup.  But if we observe a problem we should make a more careful measurement.  If this change goes in with a measurable regression, we need a follow-up conversation (centered around a tracking bug) about quantifying the performance regression and fixing it.  (It's possible the conversation ends by saying "we decided we don't care and here's why", but I doubt it will end that way.)
>>>>
>>>> Besides my classloading benchmark results posted in earlier 
>>>> message, we have asked performance team to help us measure startup, 
>>>> classloding, and memory saving regarding this change, and verified 
>>>> there was no regression in startup/classloading. The results were 
>>>> not posted in the thread however. They were added to the bug report 
>>>> for JDK-8059510 <https://bugs.openjdk.java.net/browse/JDK-8059510>.
>>>>
>>>>> 3. The benchmark Aleksey chose, Nashorn startup, may well be an outlier.  Dynamic language runtimes create lots of tiny methods, and are likely to put heavy loads on symbol resolution, including resolution of symbols that are not known statically.  For these, the first-look (front-end) table being proposed might be a loss, especially if the Nashorn runtime is not in the first-look table.
>>>>>
>>>>> 4. Possible workarounds:  #1 Track hit rates and start looking at the static table *second* if it fails to produce enough hits, compared to the dynamic table.  #2 When that happens, incrementally migrate (rehash) frequently found symbols in the static table into the main table.  #3 Re-evaluate the hashing algorithm.  (More on this below.)
>>>>
>>>> That's interesting thinking. I'll follow up on this.
>>>>
>>>>> 5. I strongly support moving towards position-independent, shareable, pointer-free data structures, and I want us to learn to do it well over time.  (Ioi, you are one of our main explorers here!)  To me, a performance regression is a suggestion that we have something more to learn.  And, it's not a surprise that we don't have it all figured out yet.
>>>>
>>>> Point taken! Position-independent is one of our goal for the 
>>>> archived data. We'll be start looking into removing direct pointers 
>>>> in the shared class data.
>>>>
>>>>> 6. For work like this, we need to agree upon at least one or two startup performance tests to shake out bottlenecks early and give confidence.  People who work on startup performance should know how to run them and be able to quote them.
>>>>
>>>> Thank you for bring this up. Totally agree. For me, personally I've 
>>>> played with different benchmarks and applications with different 
>>>> focus over time. It would be good to have some commonly agreed  
>>>> startup performance tests for this. A standalone classloading 
>>>> benchmark, HelloWorld and Nashorn startup probably are good 
>>>> choices. We also have an internal application for startup and 
>>>> memory saving measurement.
>>>>
>>>>> 7. There's a vibrant literature about offline (statically created) hash tables, and lots of tricks floating around, such as perfect or semi-perfect hash functions, and multiple-choice hashing, and locality-aware structures.  I can think of several algorithmic tweaks I'd like to try on this code.  (If they haven't already been tried or discarded: I assume Ioi has already tried a number of things.)  Moreover, this is not just doorknob-polishing, because (as said in point 5) we want to develop our competency with these sorts of data structures.
>>>>>
>>>>> 8. I found the code hard to read, so it was hard to reason about the algorithm.  The core routine, "lookup", has no comments and only one assert.  It uses only primitive C types and (perhaps in the name of performance) is not factored into sub-functions.  The code which generates the static table is also hard to reason about in similar ways.  The bucket size parameter choice (a crucial compromise between performance and compactness) is 4 but the motivation and implications are left for the reader to puzzle out.  Likewise, the use of hardware division instead of power-of-two table size is presented without comment, despite the fact that we favor power-of-two in other parts of our stack, and a power-of-two solution would be reasonable here (given the bucket size).
>>>>>
>>>>> 9. Perhaps I wouldn't care as much about the code style if the code were not performance sensitive or if it were a unique and isolated part of the JVM.  In this case, I expect this code (and other similar code) to improve,  over time, in readability and robustness, as we learn to work with this new kind of data structure.  So even if we decide there is no significant regression here, and decide to push it as-is, we still need to use it as an example to help us get better at writing easy-to-read code which works with pointer-free data.
>>>>>
>>>>> 10. I would like to see (posted somewhere or attached to the bug) a sample list of the symbols in a typical symbol table.  Perhaps this already exists and I missed it.  I think it would be friendly to put some comments in the code that help the reader estimate numbers like table size, bucket length, number of queries, number of probes per query, symbol length statistics (mean, histogram).  Of course such comments go stale over time, but so does the algorithm, and being coy about the conditions of the moment doesn't help us in the long run.  Even a short comment is better than none, for example:
>>>>>    http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/classfile/vmSymbols.cpp#l206
>>>>
>>>> Those are good suggestions. I'll add a sample list o the symbols to 
>>>> the bug report and try to put more comments in the code.
>>>>
>>>> Thanks!
>>>>
>>>> Jiangli
>>>>
>>>>> It is a good time to have these conversations.
>>>>>
>>>>> Best wishes,
>>>>> — John
>>>>>
>>>>> On Oct 13, 2014, at 11:46 AM, Aleksey Shipilev<aleksey.shipilev at oracle.com>  wrote:
>>>>>
>>>>>> Hi Jiangli,
>>>>>>
>>>>>> On 13.10.2014 18:26, Jiangli Zhou wrote:
>>>>>>> On 10/13/2014 03:18 AM, Aleksey Shipilev wrote:
>>>>>>>> On 13.10.2014 03:32, David Holmes wrote:
>>>>>>>>> On 11/10/2014 1:47 PM, Jiangli Zhou wrote:
>>>>>>>>> Also is the benchmarking being done on dedicated systems?
>>>>>>>> Also, specjvm98 is meaningless to estimate the classloading costs.
>>>>>>>> Please try specjvm2008:startup.* tests?
>>>>>>> The specjvm run was for Gerard's question about standard benchmarks.
>>>>>> SPECjvm2008 is a standard benchmark. In fact, it is a benchmark that
>>>>>> deprecates SPECjvm98.
>>>>>>
>>>>>>> These are not benchmarks specifically for classloading.
>>>>>> There are benchmarks that try to estimate the startup costs.
>>>>>> SPECjvm2008:startup.* tests are one of them.
>>>>>>
>>>>>>> However, I agree it's a good idea to run standard benchmarks to
>>>>>>> confirm there is no overall performance degradation. From all the
>>>>>>> benchmarks including classloading measurements, we have confirmed
>>>>>>> that this specific change does not have negative impact on
>>>>>>> classloading itself and the overall performance.
>>>>>> Excellent. What are those benchmarks? May we see those? Because I have a
>>>>>> counter-example in this thread that this change *does* negatively impact
>>>>>> classloading.
>>>>>>
>>>>>> -Aleksey.
>>>>>>
>>>>
>>>
>>
>