RFR(L): 8029075 - String deduplication in G1

Fri Mar 14 12:19:31 UTC 2014

Per,  3 quick comments below.  Thank you for the reorganization and 
additional comments.

On 3/13/14 11:02 AM, Per Liden wrote:
> Hi Coleen,
>
> Thanks for looking at the patch! Comment below.
>
> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>
>> Hi Per,
>>
>> Why not add the call for StringDedup::unlink() inside the function 
>> unlink_string_and_symbol_table rather than just above in two places, 
>> since they go together.
>>
>> +  // Delete dead entries in string deduplication queue/table
>> +  StringDedup::unlink(&GenMarkSweep::is_alive);
>> +
>>    // Delete entries for dead interned string and clean up 
>> unreferenced symbols in symbol table.
>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive); 
>>
>
> Yep, I'll move that call into unlink_string_and_symbol_table(). The 
> unlink_string_and_symbol_table() is a fairly new addition, there used 
> to be StringTable/SymbolTable calls there when I added the StrDedup 
> call, that's why it looks like that.
>
>>
>>
>> I didn't see anything else specific to change.  Some general comments 
>> which are similar to Thomas's.  There seems to be a lack of comments 
>> in the deduplication file.   I hate huge block comments too but some 
>> comments about what it does in general would be nice. Like 3 
>> sentences per class.  The file is overwhelming (ie, I completely lose 
>> interest)  with the statistics printing at the front.   Maybe you 
>> could put the most interesting classes at the front of the file?   
>> Maybe statistics printing could be a separate file.
>>
>> One of the conventions that we have is that accessors for classes are 
>> on one line, so you could make the file smaller by doing that 
>> reformatting.
>>
>> It is also a general convention to have comments above the 
>> declaration of a class to say generally what it's for, when it's 
>> used, etc.  It's not easy to tell by the name or we could guess 
>> wrong.   What is StringDedupQueue for, eg?  That would make reviewing 
>> the code easier.
>
> Agree, I'll be adding comments to each class and a general description 
> of what dedup does.
>
>>
>> There seems to be a lot of duplication of code with the hashtable in 
>> hashtable.hpp, only different.  We have lots of hashtables in 
>> hotspot.  Did you try to reinstantiate this hashtable and why didn't 
>> that not work?  There are interesting memory allocation functions for 
>> the entries in this normal hashtable that might help with 
>> fragmentation, for instance.
>
> Very early on I actually did use an instance of Hashtable and even 
> played with the idea of having a combined StringTable/DedupTable. I 
> decided not to use that though, mostly because I felt I would need to 
> make too many modifications to non-dedup related code and because 
> Hashtable<> carries concepts, like shared entries, which didn't quite 
> fit with dedup. Using the existing StringTable would mean it would 
> have to have logic to manage entries which point to either Strings or 
> char[], which again seemed like an too intrusive change. The Hashtable 
> allocation strategy (with allocating blocks of entries) seems nice 
> from a performance perspective, but freeing those becomes a bit of a 
> challenge, which I guess is why it never frees any entries. Since the 
> dedup table is there to in the end save memory, I felt I needed better 
> control over the table's memory usage. At one point I actually started 
> writing a WeakOopsHashTable<>, which I hoped could serve as the base 
> for both StringTable and DedupTable, but again I felt I was losing 
> focus a bit as it would mean modifying large parts of non-GC and 
> non-dedeup related stuff. The potential connection to CDS here also 
> made me a bit scared of going in that direction, as I don't have a 
> complete understanding of the implications of that.

Ok, I'd figured they didn't map that well.   There's a lot of 
duplication already in symbol and string tables also.
>
>>
>> Lastly, why is the hash code in the string written to and used? The 
>> StringTable uses the same hash algorithm until it has to change the 
>> hash code, but the new hash code isn't stored in the string because 
>> it would be incompatible.
>
> I think StringTable and the dedup table is doing the same thing here. 
> The hash code in the String object is only used/updated in the case 
> we're using the standard/compatible hash function. As soon as the 
> table is rehashed and switches to another hash function it no longer 
> updates the hash code in the String. use_java_hash() tells use if 
> we're using the standard/compatible hash and it's used here:
>
>   if (use_java_hash()) {
>     // Get hash code from cache
>     hash = java_lang_String::hash(java_string);
>   }
>
> and later here:
>
>   if (use_java_hash() && hash != 0) {
>     // Store hash code in cache
>     java_lang_String::set_hash(java_string, hash);
>   }
>
> Concerns?

I have to look again but wanted to answer quickly (forgot to last 
night).  This is good, but why do we have to update the hash code in the 
java/lang/String instance at all?

>
>>
>> I would like to see this be a few different files.   It seems like 
>> this file does three things conceptually, interacts with G1 somehow 
>> to determine which strings are candidates for deduplication, stores 
>> these things in a rehashable, resizable hash table and gathers 
>> statistics.
>>
>> I'm sorry to have so many comments so late and I haven't really tried 
>> to read the code for correctness.   I think some reformatting and 
>> reorganization would make the reviewers job easier.  It's a big 
>> change and worthwhile.
>
> Great, I've no problems splitting this up. I'm more of a "one class 
> per file" person myself, but I had the (apparently incorrect) 
> impression that wasn't the hotspot way. I'm thinking I'll split this 
> up into the following files (while I'm at it I'll also add the G1 
> prefix as Thomas suggested):
>
> g1StringDedup - Main interface for the GC, would contain entry point 
> and things like the task and closure classes
> g1StringDedupTable - The table, cache and entry classes
> g1StringDedupQueue - The queue
> g1StringDedupStat - The statistics sutff
> g1StringDedupThread - The dedup thread

Yes, this would be nice.   symbolTable.cpp should be split up too if 
that is what gave you the impression.   I will file an RFE to do that, 
it's been at the back of my mind to do so.

Thanks,
Coleen
>
> Sounds good?
>
> Will send out and updated webrev.
>
> Thanks,
> /Per
>
>>
>> Thanks,
>> Coleen
>>
>>
>> On 3/12/14 9:34 AM, Per Liden wrote:
>>> Hi Thomas,
>>>
>>> Thanks for reviewing. Comments inline.
>>>
>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>> Hi Per,
>>>>
>>>>
>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>> Hi,
>>>>>
>>>>> Here's an updated webrev based on the feedback from Bengt and Thomas.
>>>>>
>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>
>>>>
>>>> See further below.
>>>>
>>>>>
>>>>> And as suggested, two changes in the initial webrev were broken 
>>>>> out into
>>>>> separate CRs, available here:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>
>>>> Looks good.
>>>>
>>>> Did you check the impact on something like specjbb2005 which does
>>>> nothing but object copying? That code is somewhat performance 
>>>> sensitive.
>>>
>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
>>> difference.
>>>
>>>>
>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>
>>>> Looks good.
>>>>
>>>> Here are my comments: there are a few minor issues, and a single 
>>>> major:
>>>> basically the documentation for this feature is lacking imo. I 
>>>> listed a
>>>> few particular issues of mine with it.
>>>>
>>>> First the minor issues:
>>>>
>>>> - could you check on copyright dates on the modified files? They have
>>>> not been updated anywhere afaics.
>>>>
>>>> - not sure if good, but maybe make PrintStringDeduplicationStatistics
>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>
>>>> - maybe make the new *ALot flags notproduct. These seem to be 
>>>> debugging
>>>> helpers similar to other *ALot flags.
>>>
>>> I had a conversion with Bengt about this some time ago and I think 
>>> the options are categorized correctly. The ALot flags needs be 
>>> available in product builds because they are used by tests. Most 
>>> Print*Statistics are product therefore it makes sense to follow that 
>>> pattern for PrintStringDeduplicationStatistics.
>>>
>>>>
>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention that if
>>>> the object is promoted, that it is always considered as candidate.
>>>
>>> Check.
>>>
>>>>
>>>> marksweep.inline.hpp
>>>>
>>>> - MarkSweep::mark_object(): did you measure impact on other collectors
>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>> that !UseG1GC automatically disables UseStringDeduplication (or use a
>>>> global variable for that) to avoid the two checks. (The latter 
>>>> option is
>>>> probably better).
>>>
>>> Yes, ran jbb2005/2013.
>>>
>>>>
>>>> symbolTable.cpp:
>>>>
>>>> - again both UseG1GC and UseStringDedupliation are paired. Could 
>>>> this be
>>>> put in a method/global variable somewhere that is set up correctly at
>>>> startup?
>>>
>>> I think using the options is much more readable. Adding another 
>>> state which tells us if G1 & Dedup is enabled means we need a 
>>> pre-init stage to set up that state, which I don't think is very nice.
>>>
>>>>
>>>> g1CollectedHeap.cpp:
>>>>
>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>> UseStringDeduplication is inside the method) while it's outside for
>>>> everything else. Just a note, keep it if wanted.
>>>
>>> They are hidden inside dedup everywhere, except where the 
>>> performance overhead of a call might be an issue (like mark and copy).
>>>
>>>>
>>>> TestStringDeduplicationTools.java:
>>>>
>>>> - maybe put retrieval of the unsafe object into the test library.
>>>> Browsing through the code shows quite a lot copy+paste here already.
>>>>
>>>> - "qeueue" -> "queue"
>>>
>>> Check.
>>>
>>>>
>>>> all tests:
>>>>
>>>> - missing bug ids
>>>
>>> Check.
>>>
>>>>
>>>>  From here it is mostly about comments/documentation:
>>>
>>> I agree that adding a feature overview and some additional comments 
>>> is probably a good idea. However, I also know that Bengt and StefanK 
>>> aren't too crazy about large comment blobs so I tried to kept the 
>>> noise down. I also disagree with some of the other comments below 
>>> which tend to be about personal taste rather than code quality or 
>>> breaking some coding convention we have, e.g. the static helpers, 
>>> readability, naming, etc.
>>>
>>> I'll add some comments and make some other changes and send out an 
>>> updated webrev shortly.
>>>
>>> cheers,
>>> /Per
>>>
>>>>
>>>> stringdedup.?pp:
>>>>
>>>> - is it possible to put an overview about how the string deduplication
>>>> works here? There is absolutely no place in the code where e.g. the
>>>> string deduplication cycle is explained. There is no comprehensive
>>>> information about this feature anywhere, just bits about corner 
>>>> cases or
>>>> particular here and there, but nothing that can be used as entry point
>>>> to understand what this is actually doing.
>>>>
>>>> To some extent, the comments seem to be repetitions of what the code
>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be power
>>>> of 2)" or "// Store hash code in cache" followed by
>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>
>>>> There is also no reference to the JEP (I am not sure if it is 
>>>> customary
>>>> to put references to outside documents - I would prefer some); however
>>>> the JEP does not replace documentation in the source as it is not
>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>> interesting for the alternatives considered though.
>>>>
>>>> Ideally, to reasonably understand the feature I would like to not need
>>>> to spend hours looking through the implementation.
>>>>
>>>> - how many threads does this feature add, and are supported by this 
>>>> code
>>>> (for which parts)? Synchronization between them?
>>>>
>>>> - the comments "For XY" in the StringDedup header file do not give any
>>>> hint about what they are doing (not sure if it is interesting as it is
>>>> easy to find out the only callers). It seems that this information has
>>>> been put at the place where these methods are invoked (or just 
>>>> somewhere
>>>> in the cpp file). Imo the methods and members (except trivial) 
>>>> should be
>>>> documented in the hpp file in the class body, i.e. in the interface.
>>>>
>>>> - there are a few methods (also in the StringDedupTable class) in
>>>> StringDedup where the results of the methods do not seem obvious and
>>>> should be commented.
>>>>
>>>> - I really dislike the location of the "Candidate selection" 
>>>> explanation
>>>> within the cpp file. I would not have found it if I were not 
>>>> looking for
>>>> some documentation ;) It would probably fit very well in the above
>>>> mentioned overview paragraphs.
>>>>
>>>> - please add the static helper functions that implement the actual
>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) to the
>>>> StringDedup class (as private static methods) so that they can be 
>>>> found
>>>> easily. Static non-class helper functions in cpp files should only be
>>>> used for stuff that is really unimportant and not relevant to the code
>>>> imo. However these seem to be _the_ central policy methods for string
>>>> dedup.
>>>> So it would be nice to have them mentioned in the class/hpp file and
>>>> documented there.
>>>>
>>>> - facts about the code are repeated, e.g. that the table size must 
>>>> be a
>>>> power of two. This is nice, but has a few drawbacks: the compiler does
>>>> not care about comments at all, so it might be useful to put 
>>>> asserts in
>>>> these places instead, and second, it's just duplication of text 
>>>> with the
>>>> usual problem of getting out of date.
>>>>
>>>> - I am not completely sure why the class has not been called
>>>> G1StringDedup. It won't work without G1 anyway and calls G1 internals
>>>> all the time. There does not seem to be a way to make it more generic
>>>> without considerable refactoring. It is fine to keep the name though,
>>>> just asking. I also know that this feature does not want to be a 
>>>> generic
>>>> solution.
>>>>
>>>> - is it possible to separate out the statistics variables of
>>>> StringDedupTable into either a separate class instead of using globals
>>>> or separate them a little from the other unrelated members? (And
>>>> describe them) This would improve readability.
>>>>
>>>> I *think* this is everything from StringDedupTable::_entries_added to
>>>> _rehash_count? I have no idea without searching for every use and
>>>> deducing from that.
>>>>
>>>> - when implementing the hash table for the StringDedupTable, did you
>>>> consider trying to reuse one of the existing hashmap implementations?
>>>> And potentially extend them.
>>>>
>>>> Imo we already have way too many ad-hoc implementations of often used
>>>> data structures (particularly) in G1 and GC code.
>>>>
>>>> - please document the properties of the StringDedupTable (type, 
>>>> hashing,
>>>> automatic resizing behavior, thread safety, ...).
>>>>
>>>> - please explain what StringDedupTable/Cache are and what they are 
>>>> used
>>>> for in context of the feature.
>>>>
>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>
>>
>