RFR(L): 8029075 - String deduplication in G1

Wed Mar 12 22:10:19 UTC 2014

Hi Per,

Why not add the call for StringDedup::unlink() inside the function 
unlink_string_and_symbol_table rather than just above in two places, 
since they go together.

+  // Delete dead entries in string deduplication queue/table
+  StringDedup::unlink(&GenMarkSweep::is_alive);
+
    // Delete entries for dead interned string and clean up unreferenced symbols in symbol table.
    G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive);

I didn't see anything else specific to change.  Some general comments 
which are similar to Thomas's.  There seems to be a lack of comments in 
the deduplication file.   I hate huge block comments too but some 
comments about what it does in general would be nice. Like 3 sentences 
per class.  The file is overwhelming (ie, I completely lose interest)  
with the statistics printing at the front.   Maybe you could put the 
most interesting classes at the front of the file?   Maybe statistics 
printing could be a separate file.

One of the conventions that we have is that accessors for classes are on 
one line, so you could make the file smaller by doing that reformatting.

It is also a general convention to have comments above the declaration 
of a class to say generally what it's for, when it's used, etc.  It's 
not easy to tell by the name or we could guess wrong.   What is 
StringDedupQueue for, eg?  That would make reviewing the code easier.

There seems to be a lot of duplication of code with the hashtable in 
hashtable.hpp, only different.  We have lots of hashtables in hotspot.  
Did you try to reinstantiate this hashtable and why didn't that not 
work?  There are interesting memory allocation functions for the entries 
in this normal hashtable that might help with fragmentation, for instance.

Lastly, why is the hash code in the string written to and used?  The 
StringTable uses the same hash algorithm until it has to change the hash 
code, but the new hash code isn't stored in the string because it would 
be incompatible.

I would like to see this be a few different files.   It seems like this 
file does three things conceptually, interacts with G1 somehow to 
determine which strings are candidates for deduplication, stores these 
things in a rehashable, resizable hash table and gathers statistics.

I'm sorry to have so many comments so late and I haven't really tried to 
read the code for correctness.   I think some reformatting and 
reorganization would make the reviewers job easier.  It's a big change 
and worthwhile.

Thanks,
Coleen

On 3/12/14 9:34 AM, Per Liden wrote:
> Hi Thomas,
>
> Thanks for reviewing. Comments inline.
>
> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>> Hi Per,
>>
>>
>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>> Hi,
>>>
>>> Here's an updated webrev based on the feedback from Bengt and Thomas.
>>>
>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>
>>
>> See further below.
>>
>>>
>>> And as suggested, two changes in the initial webrev were broken out 
>>> into
>>> separate CRs, available here:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>
>> Looks good.
>>
>> Did you check the impact on something like specjbb2005 which does
>> nothing but object copying? That code is somewhat performance sensitive.
>
> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any 
> difference.
>
>>
>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>
>> Looks good.
>>
>> Here are my comments: there are a few minor issues, and a single major:
>> basically the documentation for this feature is lacking imo. I listed a
>> few particular issues of mine with it.
>>
>> First the minor issues:
>>
>> - could you check on copyright dates on the modified files? They have
>> not been updated anywhere afaics.
>>
>> - not sure if good, but maybe make PrintStringDeduplicationStatistics
>> diagnostic? It seems to really be some diagnosis functions.
>>
>> - maybe make the new *ALot flags notproduct. These seem to be debugging
>> helpers similar to other *ALot flags.
>
> I had a conversion with Bengt about this some time ago and I think the 
> options are categorized correctly. The ALot flags needs be available 
> in product builds because they are used by tests. Most 
> Print*Statistics are product therefore it makes sense to follow that 
> pattern for PrintStringDeduplicationStatistics.
>
>>
>> - StringDeduplicationAgeThreshold: maybe also needs to mention that if
>> the object is promoted, that it is always considered as candidate.
>
> Check.
>
>>
>> marksweep.inline.hpp
>>
>> - MarkSweep::mark_object(): did you measure impact on other collectors
>> using this code? It is enabled unconditionally. Also I would prefer
>> that !UseG1GC automatically disables UseStringDeduplication (or use a
>> global variable for that) to avoid the two checks. (The latter option is
>> probably better).
>
> Yes, ran jbb2005/2013.
>
>>
>> symbolTable.cpp:
>>
>> - again both UseG1GC and UseStringDedupliation are paired. Could this be
>> put in a method/global variable somewhere that is set up correctly at
>> startup?
>
> I think using the options is much more readable. Adding another state 
> which tells us if G1 & Dedup is enabled means we need a pre-init stage 
> to set up that state, which I don't think is very nice.
>
>>
>> g1CollectedHeap.cpp:
>>
>> - StringDedup::initialize() is run unconditionally (the check for
>> UseStringDeduplication is inside the method) while it's outside for
>> everything else. Just a note, keep it if wanted.
>
> They are hidden inside dedup everywhere, except where the performance 
> overhead of a call might be an issue (like mark and copy).
>
>>
>> TestStringDeduplicationTools.java:
>>
>> - maybe put retrieval of the unsafe object into the test library.
>> Browsing through the code shows quite a lot copy+paste here already.
>>
>> - "qeueue" -> "queue"
>
> Check.
>
>>
>> all tests:
>>
>> - missing bug ids
>
> Check.
>
>>
>>  From here it is mostly about comments/documentation:
>
> I agree that adding a feature overview and some additional comments is 
> probably a good idea. However, I also know that Bengt and StefanK 
> aren't too crazy about large comment blobs so I tried to kept the 
> noise down. I also disagree with some of the other comments below 
> which tend to be about personal taste rather than code quality or 
> breaking some coding convention we have, e.g. the static helpers, 
> readability, naming, etc.
>
> I'll add some comments and make some other changes and send out an 
> updated webrev shortly.
>
> cheers,
> /Per
>
>>
>> stringdedup.?pp:
>>
>> - is it possible to put an overview about how the string deduplication
>> works here? There is absolutely no place in the code where e.g. the
>> string deduplication cycle is explained. There is no comprehensive
>> information about this feature anywhere, just bits about corner cases or
>> particular here and there, but nothing that can be used as entry point
>> to understand what this is actually doing.
>>
>> To some extent, the comments seem to be repetitions of what the code
>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be power
>> of 2)" or "// Store hash code in cache" followed by
>> "java_lang_String::set_hash(java_string, hash);".
>>
>> There is also no reference to the JEP (I am not sure if it is customary
>> to put references to outside documents - I would prefer some); however
>> the JEP does not replace documentation in the source as it is not
>> detailed enough, and the JEP cannot be changed any more. The JEP is
>> interesting for the alternatives considered though.
>>
>> Ideally, to reasonably understand the feature I would like to not need
>> to spend hours looking through the implementation.
>>
>> - how many threads does this feature add, and are supported by this code
>> (for which parts)? Synchronization between them?
>>
>> - the comments "For XY" in the StringDedup header file do not give any
>> hint about what they are doing (not sure if it is interesting as it is
>> easy to find out the only callers). It seems that this information has
>> been put at the place where these methods are invoked (or just somewhere
>> in the cpp file). Imo the methods and members (except trivial) should be
>> documented in the hpp file in the class body, i.e. in the interface.
>>
>> - there are a few methods (also in the StringDedupTable class) in
>> StringDedup where the results of the methods do not seem obvious and
>> should be commented.
>>
>> - I really dislike the location of the "Candidate selection" explanation
>> within the cpp file. I would not have found it if I were not looking for
>> some documentation ;) It would probably fit very well in the above
>> mentioned overview paragraphs.
>>
>> - please add the static helper functions that implement the actual
>> policies (is_candidate_from_mark, is_candidate_from_evacuation) to the
>> StringDedup class (as private static methods) so that they can be found
>> easily. Static non-class helper functions in cpp files should only be
>> used for stuff that is really unimportant and not relevant to the code
>> imo. However these seem to be _the_ central policy methods for string
>> dedup.
>> So it would be nice to have them mentioned in the class/hpp file and
>> documented there.
>>
>> - facts about the code are repeated, e.g. that the table size must be a
>> power of two. This is nice, but has a few drawbacks: the compiler does
>> not care about comments at all, so it might be useful to put asserts in
>> these places instead, and second, it's just duplication of text with the
>> usual problem of getting out of date.
>>
>> - I am not completely sure why the class has not been called
>> G1StringDedup. It won't work without G1 anyway and calls G1 internals
>> all the time. There does not seem to be a way to make it more generic
>> without considerable refactoring. It is fine to keep the name though,
>> just asking. I also know that this feature does not want to be a generic
>> solution.
>>
>> - is it possible to separate out the statistics variables of
>> StringDedupTable into either a separate class instead of using globals
>> or separate them a little from the other unrelated members? (And
>> describe them) This would improve readability.
>>
>> I *think* this is everything from StringDedupTable::_entries_added to
>> _rehash_count? I have no idea without searching for every use and
>> deducing from that.
>>
>> - when implementing the hash table for the StringDedupTable, did you
>> consider trying to reuse one of the existing hashmap implementations?
>> And potentially extend them.
>>
>> Imo we already have way too many ad-hoc implementations of often used
>> data structures (particularly) in G1 and GC code.
>>
>> - please document the properties of the StringDedupTable (type, hashing,
>> automatic resizing behavior, thread safety, ...).
>>
>> - please explain what StringDedupTable/Cache are and what they are used
>> for in context of the feature.
>>
>> - stringdedup.cpp, line 1309: candicate -> candidate
>>
>> Thanks,
>> Thomas
>>
>>
>