RFR(L): 8029075 - String deduplication in G1

Fri Mar 14 19:07:31 UTC 2014

Hi,

Here's an updated webrev, based on feedback from primarily Thomas and 
Coleen (Thanks!).

http://cr.openjdk.java.net/~pliden/8029075/webrev.2/

I hope to have addressed all issues that were raised. The main theme of 
course is the addition of documentation/comments and the splitting of 
some classes into separate files. I think I've addressed most, if not 
all, of the other minor issues Thomas mentioned. The only thing I can 
think of that wasn't changed was the categorization of the VM flags, 
which I motivated in a different mail.

I didn't think it made sense to upload a diff against the previous 
webrev as that diff turned out to be a lot bigger than real webrev 
itself. Sorry if this makes reviewing harder.

cheers,
/Per

On 2014-03-14 14:19, Per Liden wrote:
> Hi Coleen,
>
> Inline below...
>
> On 2014-03-14 13:19, Coleen Phillimore wrote:
>>
>> Per,  3 quick comments below.  Thank you for the reorganization and 
>> additional comments.
>>
>> On 3/13/14 11:02 AM, Per Liden wrote:
>>> Hi Coleen,
>>>
>>> Thanks for looking at the patch! Comment below.
>>>
>>> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>>>
>>>> Hi Per,
>>>>
>>>> Why not add the call for StringDedup::unlink() inside the function 
>>>> unlink_string_and_symbol_table rather than just above in two 
>>>> places, since they go together.
>>>>
>>>> +  // Delete dead entries in string deduplication queue/table
>>>> +  StringDedup::unlink(&GenMarkSweep::is_alive);
>>>> +
>>>>    // Delete entries for dead interned string and clean up 
>>>> unreferenced symbols in symbol table.
>>>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive); 
>>>>
>>>
>>> Yep, I'll move that call into unlink_string_and_symbol_table(). The 
>>> unlink_string_and_symbol_table() is a fairly new addition, there 
>>> used to be StringTable/SymbolTable calls there when I added the 
>>> StrDedup call, that's why it looks like that.
>>>
>>>>
>>>>
>>>> I didn't see anything else specific to change.  Some general 
>>>> comments which are similar to Thomas's.  There seems to be a lack 
>>>> of comments in the deduplication file.   I hate huge block comments 
>>>> too but some comments about what it does in general would be nice. 
>>>> Like 3 sentences per class.  The file is overwhelming (ie, I 
>>>> completely lose interest)  with the statistics printing at the 
>>>> front.   Maybe you could put the most interesting classes at the 
>>>> front of the file?   Maybe statistics printing could be a separate 
>>>> file.
>>>>
>>>> One of the conventions that we have is that accessors for classes 
>>>> are on one line, so you could make the file smaller by doing that 
>>>> reformatting.
>>>>
>>>> It is also a general convention to have comments above the 
>>>> declaration of a class to say generally what it's for, when it's 
>>>> used, etc.  It's not easy to tell by the name or we could guess 
>>>> wrong.   What is StringDedupQueue for, eg?  That would make 
>>>> reviewing the code easier.
>>>
>>> Agree, I'll be adding comments to each class and a general 
>>> description of what dedup does.
>>>
>>>>
>>>> There seems to be a lot of duplication of code with the hashtable 
>>>> in hashtable.hpp, only different.  We have lots of hashtables in 
>>>> hotspot.  Did you try to reinstantiate this hashtable and why 
>>>> didn't that not work?  There are interesting memory allocation 
>>>> functions for the entries in this normal hashtable that might help 
>>>> with fragmentation, for instance.
>>>
>>> Very early on I actually did use an instance of Hashtable and even 
>>> played with the idea of having a combined StringTable/DedupTable. I 
>>> decided not to use that though, mostly because I felt I would need 
>>> to make too many modifications to non-dedup related code and because 
>>> Hashtable<> carries concepts, like shared entries, which didn't 
>>> quite fit with dedup. Using the existing StringTable would mean it 
>>> would have to have logic to manage entries which point to either 
>>> Strings or char[], which again seemed like an too intrusive change. 
>>> The Hashtable allocation strategy (with allocating blocks of 
>>> entries) seems nice from a performance perspective, but freeing 
>>> those becomes a bit of a challenge, which I guess is why it never 
>>> frees any entries. Since the dedup table is there to in the end save 
>>> memory, I felt I needed better control over the table's memory 
>>> usage. At one point I actually started writing a 
>>> WeakOopsHashTable<>, which I hoped could serve as the base for both 
>>> StringTable and DedupTable, but again I felt I was losing focus a 
>>> bit as it would mean modifying large parts of non-GC and non-dedeup 
>>> related stuff. The potential connection to CDS here also made me a 
>>> bit scared of going in that direction, as I don't have a complete 
>>> understanding of the implications of that.
>>
>> Ok, I'd figured they didn't map that well.   There's a lot of 
>> duplication already in symbol and string tables also.
>>>
>>>>
>>>> Lastly, why is the hash code in the string written to and used? The 
>>>> StringTable uses the same hash algorithm until it has to change the 
>>>> hash code, but the new hash code isn't stored in the string because 
>>>> it would be incompatible.
>>>
>>> I think StringTable and the dedup table is doing the same thing 
>>> here. The hash code in the String object is only used/updated in the 
>>> case we're using the standard/compatible hash function. As soon as 
>>> the table is rehashed and switches to another hash function it no 
>>> longer updates the hash code in the String. use_java_hash() tells 
>>> use if we're using the standard/compatible hash and it's used here:
>>>
>>>   if (use_java_hash()) {
>>>     // Get hash code from cache
>>>     hash = java_lang_String::hash(java_string);
>>>   }
>>>
>>> and later here:
>>>
>>>   if (use_java_hash() && hash != 0) {
>>>     // Store hash code in cache
>>>     java_lang_String::set_hash(java_string, hash);
>>>   }
>>>
>>> Concerns?
>>
>> I have to look again but wanted to answer quickly (forgot to last 
>> night).  This is good, but why do we have to update the hash code in 
>> the java/lang/String instance at all?
>
> We don't strictly have to, but it mimics the behavior of 
> String.hashCode() and it is seems like the natural thing to do to 
> given that the hash field in String is a cache for this purpose so 
> that future users of String.hashCode() don't have to do the same 
> calculation. Do you have a reason in mind why we wouldn't want to 
> write it back?
>
> cheers,
> /Per
>
>>
>>>
>>>>
>>>> I would like to see this be a few different files.   It seems like 
>>>> this file does three things conceptually, interacts with G1 somehow 
>>>> to determine which strings are candidates for deduplication, stores 
>>>> these things in a rehashable, resizable hash table and gathers 
>>>> statistics.
>>>>
>>>> I'm sorry to have so many comments so late and I haven't really 
>>>> tried to read the code for correctness.   I think some reformatting 
>>>> and reorganization would make the reviewers job easier.  It's a big 
>>>> change and worthwhile.
>>>
>>> Great, I've no problems splitting this up. I'm more of a "one class 
>>> per file" person myself, but I had the (apparently incorrect) 
>>> impression that wasn't the hotspot way. I'm thinking I'll split this 
>>> up into the following files (while I'm at it I'll also add the G1 
>>> prefix as Thomas suggested):
>>>
>>> g1StringDedup - Main interface for the GC, would contain entry point 
>>> and things like the task and closure classes
>>> g1StringDedupTable - The table, cache and entry classes
>>> g1StringDedupQueue - The queue
>>> g1StringDedupStat - The statistics sutff
>>> g1StringDedupThread - The dedup thread
>>
>> Yes, this would be nice.   symbolTable.cpp should be split up too if 
>> that is what gave you the impression.   I will file an RFE to do 
>> that, it's been at the back of my mind to do so.
>>
>> Thanks,
>> Coleen
>>>
>>> Sounds good?
>>>
>>> Will send out and updated webrev.
>>>
>>> Thanks,
>>> /Per
>>>
>>>>
>>>> Thanks,
>>>> Coleen
>>>>
>>>>
>>>> On 3/12/14 9:34 AM, Per Liden wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> Thanks for reviewing. Comments inline.
>>>>>
>>>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>>>> Hi Per,
>>>>>>
>>>>>>
>>>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here's an updated webrev based on the feedback from Bengt and 
>>>>>>> Thomas.
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>>>
>>>>>>
>>>>>> See further below.
>>>>>>
>>>>>>>
>>>>>>> And as suggested, two changes in the initial webrev were broken 
>>>>>>> out into
>>>>>>> separate CRs, available here:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>>>
>>>>>> Looks good.
>>>>>>
>>>>>> Did you check the impact on something like specjbb2005 which does
>>>>>> nothing but object copying? That code is somewhat performance 
>>>>>> sensitive.
>>>>>
>>>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
>>>>> difference.
>>>>>
>>>>>>
>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>>>
>>>>>> Looks good.
>>>>>>
>>>>>> Here are my comments: there are a few minor issues, and a single 
>>>>>> major:
>>>>>> basically the documentation for this feature is lacking imo. I 
>>>>>> listed a
>>>>>> few particular issues of mine with it.
>>>>>>
>>>>>> First the minor issues:
>>>>>>
>>>>>> - could you check on copyright dates on the modified files? They 
>>>>>> have
>>>>>> not been updated anywhere afaics.
>>>>>>
>>>>>> - not sure if good, but maybe make 
>>>>>> PrintStringDeduplicationStatistics
>>>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>>>
>>>>>> - maybe make the new *ALot flags notproduct. These seem to be 
>>>>>> debugging
>>>>>> helpers similar to other *ALot flags.
>>>>>
>>>>> I had a conversion with Bengt about this some time ago and I think 
>>>>> the options are categorized correctly. The ALot flags needs be 
>>>>> available in product builds because they are used by tests. Most 
>>>>> Print*Statistics are product therefore it makes sense to follow 
>>>>> that pattern for PrintStringDeduplicationStatistics.
>>>>>
>>>>>>
>>>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention 
>>>>>> that if
>>>>>> the object is promoted, that it is always considered as candidate.
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>> marksweep.inline.hpp
>>>>>>
>>>>>> - MarkSweep::mark_object(): did you measure impact on other 
>>>>>> collectors
>>>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>>>> that !UseG1GC automatically disables UseStringDeduplication (or 
>>>>>> use a
>>>>>> global variable for that) to avoid the two checks. (The latter 
>>>>>> option is
>>>>>> probably better).
>>>>>
>>>>> Yes, ran jbb2005/2013.
>>>>>
>>>>>>
>>>>>> symbolTable.cpp:
>>>>>>
>>>>>> - again both UseG1GC and UseStringDedupliation are paired. Could 
>>>>>> this be
>>>>>> put in a method/global variable somewhere that is set up 
>>>>>> correctly at
>>>>>> startup?
>>>>>
>>>>> I think using the options is much more readable. Adding another 
>>>>> state which tells us if G1 & Dedup is enabled means we need a 
>>>>> pre-init stage to set up that state, which I don't think is very 
>>>>> nice.
>>>>>
>>>>>>
>>>>>> g1CollectedHeap.cpp:
>>>>>>
>>>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>>>> UseStringDeduplication is inside the method) while it's outside for
>>>>>> everything else. Just a note, keep it if wanted.
>>>>>
>>>>> They are hidden inside dedup everywhere, except where the 
>>>>> performance overhead of a call might be an issue (like mark and 
>>>>> copy).
>>>>>
>>>>>>
>>>>>> TestStringDeduplicationTools.java:
>>>>>>
>>>>>> - maybe put retrieval of the unsafe object into the test library.
>>>>>> Browsing through the code shows quite a lot copy+paste here already.
>>>>>>
>>>>>> - "qeueue" -> "queue"
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>> all tests:
>>>>>>
>>>>>> - missing bug ids
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>>  From here it is mostly about comments/documentation:
>>>>>
>>>>> I agree that adding a feature overview and some additional 
>>>>> comments is probably a good idea. However, I also know that Bengt 
>>>>> and StefanK aren't too crazy about large comment blobs so I tried 
>>>>> to kept the noise down. I also disagree with some of the other 
>>>>> comments below which tend to be about personal taste rather than 
>>>>> code quality or breaking some coding convention we have, e.g. the 
>>>>> static helpers, readability, naming, etc.
>>>>>
>>>>> I'll add some comments and make some other changes and send out an 
>>>>> updated webrev shortly.
>>>>>
>>>>> cheers,
>>>>> /Per
>>>>>
>>>>>>
>>>>>> stringdedup.?pp:
>>>>>>
>>>>>> - is it possible to put an overview about how the string 
>>>>>> deduplication
>>>>>> works here? There is absolutely no place in the code where e.g. the
>>>>>> string deduplication cycle is explained. There is no comprehensive
>>>>>> information about this feature anywhere, just bits about corner 
>>>>>> cases or
>>>>>> particular here and there, but nothing that can be used as entry 
>>>>>> point
>>>>>> to understand what this is actually doing.
>>>>>>
>>>>>> To some extent, the comments seem to be repetitions of what the code
>>>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be 
>>>>>> power
>>>>>> of 2)" or "// Store hash code in cache" followed by
>>>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>>>
>>>>>> There is also no reference to the JEP (I am not sure if it is 
>>>>>> customary
>>>>>> to put references to outside documents - I would prefer some); 
>>>>>> however
>>>>>> the JEP does not replace documentation in the source as it is not
>>>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>>>> interesting for the alternatives considered though.
>>>>>>
>>>>>> Ideally, to reasonably understand the feature I would like to not 
>>>>>> need
>>>>>> to spend hours looking through the implementation.
>>>>>>
>>>>>> - how many threads does this feature add, and are supported by 
>>>>>> this code
>>>>>> (for which parts)? Synchronization between them?
>>>>>>
>>>>>> - the comments "For XY" in the StringDedup header file do not 
>>>>>> give any
>>>>>> hint about what they are doing (not sure if it is interesting as 
>>>>>> it is
>>>>>> easy to find out the only callers). It seems that this 
>>>>>> information has
>>>>>> been put at the place where these methods are invoked (or just 
>>>>>> somewhere
>>>>>> in the cpp file). Imo the methods and members (except trivial) 
>>>>>> should be
>>>>>> documented in the hpp file in the class body, i.e. in the interface.
>>>>>>
>>>>>> - there are a few methods (also in the StringDedupTable class) in
>>>>>> StringDedup where the results of the methods do not seem obvious and
>>>>>> should be commented.
>>>>>>
>>>>>> - I really dislike the location of the "Candidate selection" 
>>>>>> explanation
>>>>>> within the cpp file. I would not have found it if I were not 
>>>>>> looking for
>>>>>> some documentation ;) It would probably fit very well in the above
>>>>>> mentioned overview paragraphs.
>>>>>>
>>>>>> - please add the static helper functions that implement the actual
>>>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) 
>>>>>> to the
>>>>>> StringDedup class (as private static methods) so that they can be 
>>>>>> found
>>>>>> easily. Static non-class helper functions in cpp files should 
>>>>>> only be
>>>>>> used for stuff that is really unimportant and not relevant to the 
>>>>>> code
>>>>>> imo. However these seem to be _the_ central policy methods for 
>>>>>> string
>>>>>> dedup.
>>>>>> So it would be nice to have them mentioned in the class/hpp file and
>>>>>> documented there.
>>>>>>
>>>>>> - facts about the code are repeated, e.g. that the table size 
>>>>>> must be a
>>>>>> power of two. This is nice, but has a few drawbacks: the compiler 
>>>>>> does
>>>>>> not care about comments at all, so it might be useful to put 
>>>>>> asserts in
>>>>>> these places instead, and second, it's just duplication of text 
>>>>>> with the
>>>>>> usual problem of getting out of date.
>>>>>>
>>>>>> - I am not completely sure why the class has not been called
>>>>>> G1StringDedup. It won't work without G1 anyway and calls G1 
>>>>>> internals
>>>>>> all the time. There does not seem to be a way to make it more 
>>>>>> generic
>>>>>> without considerable refactoring. It is fine to keep the name 
>>>>>> though,
>>>>>> just asking. I also know that this feature does not want to be a 
>>>>>> generic
>>>>>> solution.
>>>>>>
>>>>>> - is it possible to separate out the statistics variables of
>>>>>> StringDedupTable into either a separate class instead of using 
>>>>>> globals
>>>>>> or separate them a little from the other unrelated members? (And
>>>>>> describe them) This would improve readability.
>>>>>>
>>>>>> I *think* this is everything from 
>>>>>> StringDedupTable::_entries_added to
>>>>>> _rehash_count? I have no idea without searching for every use and
>>>>>> deducing from that.
>>>>>>
>>>>>> - when implementing the hash table for the StringDedupTable, did you
>>>>>> consider trying to reuse one of the existing hashmap 
>>>>>> implementations?
>>>>>> And potentially extend them.
>>>>>>
>>>>>> Imo we already have way too many ad-hoc implementations of often 
>>>>>> used
>>>>>> data structures (particularly) in G1 and GC code.
>>>>>>
>>>>>> - please document the properties of the StringDedupTable (type, 
>>>>>> hashing,
>>>>>> automatic resizing behavior, thread safety, ...).
>>>>>>
>>>>>> - please explain what StringDedupTable/Cache are and what they 
>>>>>> are used
>>>>>> for in context of the feature.
>>>>>>
>>>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>