RFR(L): 8029075 - String deduplication in G1

Thu Mar 13 15:02:59 UTC 2014

Hi Coleen,

Thanks for looking at the patch! Comment below.

On 2014-03-12 23:10, Coleen Phillimore wrote:
>
> Hi Per,
>
> Why not add the call for StringDedup::unlink() inside the function 
> unlink_string_and_symbol_table rather than just above in two places, 
> since they go together.
>
> +  // Delete dead entries in string deduplication queue/table
> +  StringDedup::unlink(&GenMarkSweep::is_alive);
> +
>    // Delete entries for dead interned string and clean up 
> unreferenced symbols in symbol table.
> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive);

Yep, I'll move that call into unlink_string_and_symbol_table(). The 
unlink_string_and_symbol_table() is a fairly new addition, there used to 
be StringTable/SymbolTable calls there when I added the StrDedup call, 
that's why it looks like that.

>
>
> I didn't see anything else specific to change.  Some general comments 
> which are similar to Thomas's.  There seems to be a lack of comments 
> in the deduplication file.   I hate huge block comments too but some 
> comments about what it does in general would be nice. Like 3 sentences 
> per class.  The file is overwhelming (ie, I completely lose interest)  
> with the statistics printing at the front.   Maybe you could put the 
> most interesting classes at the front of the file?   Maybe statistics 
> printing could be a separate file.
>
> One of the conventions that we have is that accessors for classes are 
> on one line, so you could make the file smaller by doing that 
> reformatting.
>
> It is also a general convention to have comments above the declaration 
> of a class to say generally what it's for, when it's used, etc.  It's 
> not easy to tell by the name or we could guess wrong.   What is 
> StringDedupQueue for, eg?  That would make reviewing the code easier.

Agree, I'll be adding comments to each class and a general description 
of what dedup does.

>
> There seems to be a lot of duplication of code with the hashtable in 
> hashtable.hpp, only different.  We have lots of hashtables in 
> hotspot.  Did you try to reinstantiate this hashtable and why didn't 
> that not work?  There are interesting memory allocation functions for 
> the entries in this normal hashtable that might help with 
> fragmentation, for instance.

Very early on I actually did use an instance of Hashtable and even 
played with the idea of having a combined StringTable/DedupTable. I 
decided not to use that though, mostly because I felt I would need to 
make too many modifications to non-dedup related code and because 
Hashtable<> carries concepts, like shared entries, which didn't quite 
fit with dedup. Using the existing StringTable would mean it would have 
to have logic to manage entries which point to either Strings or char[], 
which again seemed like an too intrusive change. The Hashtable 
allocation strategy (with allocating blocks of entries) seems nice from 
a performance perspective, but freeing those becomes a bit of a 
challenge, which I guess is why it never frees any entries. Since the 
dedup table is there to in the end save memory, I felt I needed better 
control over the table's memory usage. At one point I actually started 
writing a WeakOopsHashTable<>, which I hoped could serve as the base for 
both StringTable and DedupTable, but again I felt I was losing focus a 
bit as it would mean modifying large parts of non-GC and non-dedeup 
related stuff. The potential connection to CDS here also made me a bit 
scared of going in that direction, as I don't have a complete 
understanding of the implications of that.

>
> Lastly, why is the hash code in the string written to and used? The 
> StringTable uses the same hash algorithm until it has to change the 
> hash code, but the new hash code isn't stored in the string because it 
> would be incompatible.

I think StringTable and the dedup table is doing the same thing here. 
The hash code in the String object is only used/updated in the case 
we're using the standard/compatible hash function. As soon as the table 
is rehashed and switches to another hash function it no longer updates 
the hash code in the String. use_java_hash() tells use if we're using 
the standard/compatible hash and it's used here:

   if (use_java_hash()) {
     // Get hash code from cache
     hash = java_lang_String::hash(java_string);
   }

and later here:

   if (use_java_hash() && hash != 0) {
     // Store hash code in cache
     java_lang_String::set_hash(java_string, hash);
   }

Concerns?

>
> I would like to see this be a few different files.   It seems like 
> this file does three things conceptually, interacts with G1 somehow to 
> determine which strings are candidates for deduplication, stores these 
> things in a rehashable, resizable hash table and gathers statistics.
>
> I'm sorry to have so many comments so late and I haven't really tried 
> to read the code for correctness.   I think some reformatting and 
> reorganization would make the reviewers job easier.  It's a big change 
> and worthwhile.

Great, I've no problems splitting this up. I'm more of a "one class per 
file" person myself, but I had the (apparently incorrect) impression 
that wasn't the hotspot way. I'm thinking I'll split this up into the 
following files (while I'm at it I'll also add the G1 prefix as Thomas 
suggested):

g1StringDedup - Main interface for the GC, would contain entry point and 
things like the task and closure classes
g1StringDedupTable - The table, cache and entry classes
g1StringDedupQueue - The queue
g1StringDedupStat - The statistics sutff
g1StringDedupThread - The dedup thread

Sounds good?

Will send out and updated webrev.

Thanks,
/Per

>
> Thanks,
> Coleen
>
>
> On 3/12/14 9:34 AM, Per Liden wrote:
>> Hi Thomas,
>>
>> Thanks for reviewing. Comments inline.
>>
>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>> Hi Per,
>>>
>>>
>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>> Hi,
>>>>
>>>> Here's an updated webrev based on the feedback from Bengt and Thomas.
>>>>
>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>
>>>
>>> See further below.
>>>
>>>>
>>>> And as suggested, two changes in the initial webrev were broken out 
>>>> into
>>>> separate CRs, available here:
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>
>>> Looks good.
>>>
>>> Did you check the impact on something like specjbb2005 which does
>>> nothing but object copying? That code is somewhat performance 
>>> sensitive.
>>
>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
>> difference.
>>
>>>
>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>
>>> Looks good.
>>>
>>> Here are my comments: there are a few minor issues, and a single major:
>>> basically the documentation for this feature is lacking imo. I listed a
>>> few particular issues of mine with it.
>>>
>>> First the minor issues:
>>>
>>> - could you check on copyright dates on the modified files? They have
>>> not been updated anywhere afaics.
>>>
>>> - not sure if good, but maybe make PrintStringDeduplicationStatistics
>>> diagnostic? It seems to really be some diagnosis functions.
>>>
>>> - maybe make the new *ALot flags notproduct. These seem to be debugging
>>> helpers similar to other *ALot flags.
>>
>> I had a conversion with Bengt about this some time ago and I think 
>> the options are categorized correctly. The ALot flags needs be 
>> available in product builds because they are used by tests. Most 
>> Print*Statistics are product therefore it makes sense to follow that 
>> pattern for PrintStringDeduplicationStatistics.
>>
>>>
>>> - StringDeduplicationAgeThreshold: maybe also needs to mention that if
>>> the object is promoted, that it is always considered as candidate.
>>
>> Check.
>>
>>>
>>> marksweep.inline.hpp
>>>
>>> - MarkSweep::mark_object(): did you measure impact on other collectors
>>> using this code? It is enabled unconditionally. Also I would prefer
>>> that !UseG1GC automatically disables UseStringDeduplication (or use a
>>> global variable for that) to avoid the two checks. (The latter 
>>> option is
>>> probably better).
>>
>> Yes, ran jbb2005/2013.
>>
>>>
>>> symbolTable.cpp:
>>>
>>> - again both UseG1GC and UseStringDedupliation are paired. Could 
>>> this be
>>> put in a method/global variable somewhere that is set up correctly at
>>> startup?
>>
>> I think using the options is much more readable. Adding another state 
>> which tells us if G1 & Dedup is enabled means we need a pre-init 
>> stage to set up that state, which I don't think is very nice.
>>
>>>
>>> g1CollectedHeap.cpp:
>>>
>>> - StringDedup::initialize() is run unconditionally (the check for
>>> UseStringDeduplication is inside the method) while it's outside for
>>> everything else. Just a note, keep it if wanted.
>>
>> They are hidden inside dedup everywhere, except where the performance 
>> overhead of a call might be an issue (like mark and copy).
>>
>>>
>>> TestStringDeduplicationTools.java:
>>>
>>> - maybe put retrieval of the unsafe object into the test library.
>>> Browsing through the code shows quite a lot copy+paste here already.
>>>
>>> - "qeueue" -> "queue"
>>
>> Check.
>>
>>>
>>> all tests:
>>>
>>> - missing bug ids
>>
>> Check.
>>
>>>
>>>  From here it is mostly about comments/documentation:
>>
>> I agree that adding a feature overview and some additional comments 
>> is probably a good idea. However, I also know that Bengt and StefanK 
>> aren't too crazy about large comment blobs so I tried to kept the 
>> noise down. I also disagree with some of the other comments below 
>> which tend to be about personal taste rather than code quality or 
>> breaking some coding convention we have, e.g. the static helpers, 
>> readability, naming, etc.
>>
>> I'll add some comments and make some other changes and send out an 
>> updated webrev shortly.
>>
>> cheers,
>> /Per
>>
>>>
>>> stringdedup.?pp:
>>>
>>> - is it possible to put an overview about how the string deduplication
>>> works here? There is absolutely no place in the code where e.g. the
>>> string deduplication cycle is explained. There is no comprehensive
>>> information about this feature anywhere, just bits about corner 
>>> cases or
>>> particular here and there, but nothing that can be used as entry point
>>> to understand what this is actually doing.
>>>
>>> To some extent, the comments seem to be repetitions of what the code
>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be power
>>> of 2)" or "// Store hash code in cache" followed by
>>> "java_lang_String::set_hash(java_string, hash);".
>>>
>>> There is also no reference to the JEP (I am not sure if it is customary
>>> to put references to outside documents - I would prefer some); however
>>> the JEP does not replace documentation in the source as it is not
>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>> interesting for the alternatives considered though.
>>>
>>> Ideally, to reasonably understand the feature I would like to not need
>>> to spend hours looking through the implementation.
>>>
>>> - how many threads does this feature add, and are supported by this 
>>> code
>>> (for which parts)? Synchronization between them?
>>>
>>> - the comments "For XY" in the StringDedup header file do not give any
>>> hint about what they are doing (not sure if it is interesting as it is
>>> easy to find out the only callers). It seems that this information has
>>> been put at the place where these methods are invoked (or just 
>>> somewhere
>>> in the cpp file). Imo the methods and members (except trivial) 
>>> should be
>>> documented in the hpp file in the class body, i.e. in the interface.
>>>
>>> - there are a few methods (also in the StringDedupTable class) in
>>> StringDedup where the results of the methods do not seem obvious and
>>> should be commented.
>>>
>>> - I really dislike the location of the "Candidate selection" 
>>> explanation
>>> within the cpp file. I would not have found it if I were not looking 
>>> for
>>> some documentation ;) It would probably fit very well in the above
>>> mentioned overview paragraphs.
>>>
>>> - please add the static helper functions that implement the actual
>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) to the
>>> StringDedup class (as private static methods) so that they can be found
>>> easily. Static non-class helper functions in cpp files should only be
>>> used for stuff that is really unimportant and not relevant to the code
>>> imo. However these seem to be _the_ central policy methods for string
>>> dedup.
>>> So it would be nice to have them mentioned in the class/hpp file and
>>> documented there.
>>>
>>> - facts about the code are repeated, e.g. that the table size must be a
>>> power of two. This is nice, but has a few drawbacks: the compiler does
>>> not care about comments at all, so it might be useful to put asserts in
>>> these places instead, and second, it's just duplication of text with 
>>> the
>>> usual problem of getting out of date.
>>>
>>> - I am not completely sure why the class has not been called
>>> G1StringDedup. It won't work without G1 anyway and calls G1 internals
>>> all the time. There does not seem to be a way to make it more generic
>>> without considerable refactoring. It is fine to keep the name though,
>>> just asking. I also know that this feature does not want to be a 
>>> generic
>>> solution.
>>>
>>> - is it possible to separate out the statistics variables of
>>> StringDedupTable into either a separate class instead of using globals
>>> or separate them a little from the other unrelated members? (And
>>> describe them) This would improve readability.
>>>
>>> I *think* this is everything from StringDedupTable::_entries_added to
>>> _rehash_count? I have no idea without searching for every use and
>>> deducing from that.
>>>
>>> - when implementing the hash table for the StringDedupTable, did you
>>> consider trying to reuse one of the existing hashmap implementations?
>>> And potentially extend them.
>>>
>>> Imo we already have way too many ad-hoc implementations of often used
>>> data structures (particularly) in G1 and GC code.
>>>
>>> - please document the properties of the StringDedupTable (type, 
>>> hashing,
>>> automatic resizing behavior, thread safety, ...).
>>>
>>> - please explain what StringDedupTable/Cache are and what they are used
>>> for in context of the feature.
>>>
>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>
>