RFR(L): 8029075 - String deduplication in G1
Coleen Phillimore
coleen.phillimore at oracle.com
Fri Mar 14 12:19:31 UTC 2014
Per, 3 quick comments below. Thank you for the reorganization and
additional comments.
On 3/13/14 11:02 AM, Per Liden wrote:
> Hi Coleen,
>
> Thanks for looking at the patch! Comment below.
>
> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>
>> Hi Per,
>>
>> Why not add the call for StringDedup::unlink() inside the function
>> unlink_string_and_symbol_table rather than just above in two places,
>> since they go together.
>>
>> + // Delete dead entries in string deduplication queue/table
>> + StringDedup::unlink(&GenMarkSweep::is_alive);
>> +
>> // Delete entries for dead interned string and clean up
>> unreferenced symbols in symbol table.
>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive);
>>
>
> Yep, I'll move that call into unlink_string_and_symbol_table(). The
> unlink_string_and_symbol_table() is a fairly new addition, there used
> to be StringTable/SymbolTable calls there when I added the StrDedup
> call, that's why it looks like that.
>
>>
>>
>> I didn't see anything else specific to change. Some general comments
>> which are similar to Thomas's. There seems to be a lack of comments
>> in the deduplication file. I hate huge block comments too but some
>> comments about what it does in general would be nice. Like 3
>> sentences per class. The file is overwhelming (ie, I completely lose
>> interest) with the statistics printing at the front. Maybe you
>> could put the most interesting classes at the front of the file?
>> Maybe statistics printing could be a separate file.
>>
>> One of the conventions that we have is that accessors for classes are
>> on one line, so you could make the file smaller by doing that
>> reformatting.
>>
>> It is also a general convention to have comments above the
>> declaration of a class to say generally what it's for, when it's
>> used, etc. It's not easy to tell by the name or we could guess
>> wrong. What is StringDedupQueue for, eg? That would make reviewing
>> the code easier.
>
> Agree, I'll be adding comments to each class and a general description
> of what dedup does.
>
>>
>> There seems to be a lot of duplication of code with the hashtable in
>> hashtable.hpp, only different. We have lots of hashtables in
>> hotspot. Did you try to reinstantiate this hashtable and why didn't
>> that not work? There are interesting memory allocation functions for
>> the entries in this normal hashtable that might help with
>> fragmentation, for instance.
>
> Very early on I actually did use an instance of Hashtable and even
> played with the idea of having a combined StringTable/DedupTable. I
> decided not to use that though, mostly because I felt I would need to
> make too many modifications to non-dedup related code and because
> Hashtable<> carries concepts, like shared entries, which didn't quite
> fit with dedup. Using the existing StringTable would mean it would
> have to have logic to manage entries which point to either Strings or
> char[], which again seemed like an too intrusive change. The Hashtable
> allocation strategy (with allocating blocks of entries) seems nice
> from a performance perspective, but freeing those becomes a bit of a
> challenge, which I guess is why it never frees any entries. Since the
> dedup table is there to in the end save memory, I felt I needed better
> control over the table's memory usage. At one point I actually started
> writing a WeakOopsHashTable<>, which I hoped could serve as the base
> for both StringTable and DedupTable, but again I felt I was losing
> focus a bit as it would mean modifying large parts of non-GC and
> non-dedeup related stuff. The potential connection to CDS here also
> made me a bit scared of going in that direction, as I don't have a
> complete understanding of the implications of that.
Ok, I'd figured they didn't map that well. There's a lot of
duplication already in symbol and string tables also.
>
>>
>> Lastly, why is the hash code in the string written to and used? The
>> StringTable uses the same hash algorithm until it has to change the
>> hash code, but the new hash code isn't stored in the string because
>> it would be incompatible.
>
> I think StringTable and the dedup table is doing the same thing here.
> The hash code in the String object is only used/updated in the case
> we're using the standard/compatible hash function. As soon as the
> table is rehashed and switches to another hash function it no longer
> updates the hash code in the String. use_java_hash() tells use if
> we're using the standard/compatible hash and it's used here:
>
> if (use_java_hash()) {
> // Get hash code from cache
> hash = java_lang_String::hash(java_string);
> }
>
> and later here:
>
> if (use_java_hash() && hash != 0) {
> // Store hash code in cache
> java_lang_String::set_hash(java_string, hash);
> }
>
> Concerns?
I have to look again but wanted to answer quickly (forgot to last
night). This is good, but why do we have to update the hash code in the
java/lang/String instance at all?
>
>>
>> I would like to see this be a few different files. It seems like
>> this file does three things conceptually, interacts with G1 somehow
>> to determine which strings are candidates for deduplication, stores
>> these things in a rehashable, resizable hash table and gathers
>> statistics.
>>
>> I'm sorry to have so many comments so late and I haven't really tried
>> to read the code for correctness. I think some reformatting and
>> reorganization would make the reviewers job easier. It's a big
>> change and worthwhile.
>
> Great, I've no problems splitting this up. I'm more of a "one class
> per file" person myself, but I had the (apparently incorrect)
> impression that wasn't the hotspot way. I'm thinking I'll split this
> up into the following files (while I'm at it I'll also add the G1
> prefix as Thomas suggested):
>
> g1StringDedup - Main interface for the GC, would contain entry point
> and things like the task and closure classes
> g1StringDedupTable - The table, cache and entry classes
> g1StringDedupQueue - The queue
> g1StringDedupStat - The statistics sutff
> g1StringDedupThread - The dedup thread
Yes, this would be nice. symbolTable.cpp should be split up too if
that is what gave you the impression. I will file an RFE to do that,
it's been at the back of my mind to do so.
Thanks,
Coleen
>
> Sounds good?
>
> Will send out and updated webrev.
>
> Thanks,
> /Per
>
>>
>> Thanks,
>> Coleen
>>
>>
>> On 3/12/14 9:34 AM, Per Liden wrote:
>>> Hi Thomas,
>>>
>>> Thanks for reviewing. Comments inline.
>>>
>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>> Hi Per,
>>>>
>>>>
>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>> Hi,
>>>>>
>>>>> Here's an updated webrev based on the feedback from Bengt and Thomas.
>>>>>
>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>
>>>>
>>>> See further below.
>>>>
>>>>>
>>>>> And as suggested, two changes in the initial webrev were broken
>>>>> out into
>>>>> separate CRs, available here:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>
>>>> Looks good.
>>>>
>>>> Did you check the impact on something like specjbb2005 which does
>>>> nothing but object copying? That code is somewhat performance
>>>> sensitive.
>>>
>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any
>>> difference.
>>>
>>>>
>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>
>>>> Looks good.
>>>>
>>>> Here are my comments: there are a few minor issues, and a single
>>>> major:
>>>> basically the documentation for this feature is lacking imo. I
>>>> listed a
>>>> few particular issues of mine with it.
>>>>
>>>> First the minor issues:
>>>>
>>>> - could you check on copyright dates on the modified files? They have
>>>> not been updated anywhere afaics.
>>>>
>>>> - not sure if good, but maybe make PrintStringDeduplicationStatistics
>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>
>>>> - maybe make the new *ALot flags notproduct. These seem to be
>>>> debugging
>>>> helpers similar to other *ALot flags.
>>>
>>> I had a conversion with Bengt about this some time ago and I think
>>> the options are categorized correctly. The ALot flags needs be
>>> available in product builds because they are used by tests. Most
>>> Print*Statistics are product therefore it makes sense to follow that
>>> pattern for PrintStringDeduplicationStatistics.
>>>
>>>>
>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention that if
>>>> the object is promoted, that it is always considered as candidate.
>>>
>>> Check.
>>>
>>>>
>>>> marksweep.inline.hpp
>>>>
>>>> - MarkSweep::mark_object(): did you measure impact on other collectors
>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>> that !UseG1GC automatically disables UseStringDeduplication (or use a
>>>> global variable for that) to avoid the two checks. (The latter
>>>> option is
>>>> probably better).
>>>
>>> Yes, ran jbb2005/2013.
>>>
>>>>
>>>> symbolTable.cpp:
>>>>
>>>> - again both UseG1GC and UseStringDedupliation are paired. Could
>>>> this be
>>>> put in a method/global variable somewhere that is set up correctly at
>>>> startup?
>>>
>>> I think using the options is much more readable. Adding another
>>> state which tells us if G1 & Dedup is enabled means we need a
>>> pre-init stage to set up that state, which I don't think is very nice.
>>>
>>>>
>>>> g1CollectedHeap.cpp:
>>>>
>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>> UseStringDeduplication is inside the method) while it's outside for
>>>> everything else. Just a note, keep it if wanted.
>>>
>>> They are hidden inside dedup everywhere, except where the
>>> performance overhead of a call might be an issue (like mark and copy).
>>>
>>>>
>>>> TestStringDeduplicationTools.java:
>>>>
>>>> - maybe put retrieval of the unsafe object into the test library.
>>>> Browsing through the code shows quite a lot copy+paste here already.
>>>>
>>>> - "qeueue" -> "queue"
>>>
>>> Check.
>>>
>>>>
>>>> all tests:
>>>>
>>>> - missing bug ids
>>>
>>> Check.
>>>
>>>>
>>>> From here it is mostly about comments/documentation:
>>>
>>> I agree that adding a feature overview and some additional comments
>>> is probably a good idea. However, I also know that Bengt and StefanK
>>> aren't too crazy about large comment blobs so I tried to kept the
>>> noise down. I also disagree with some of the other comments below
>>> which tend to be about personal taste rather than code quality or
>>> breaking some coding convention we have, e.g. the static helpers,
>>> readability, naming, etc.
>>>
>>> I'll add some comments and make some other changes and send out an
>>> updated webrev shortly.
>>>
>>> cheers,
>>> /Per
>>>
>>>>
>>>> stringdedup.?pp:
>>>>
>>>> - is it possible to put an overview about how the string deduplication
>>>> works here? There is absolutely no place in the code where e.g. the
>>>> string deduplication cycle is explained. There is no comprehensive
>>>> information about this feature anywhere, just bits about corner
>>>> cases or
>>>> particular here and there, but nothing that can be used as entry point
>>>> to understand what this is actually doing.
>>>>
>>>> To some extent, the comments seem to be repetitions of what the code
>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be power
>>>> of 2)" or "// Store hash code in cache" followed by
>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>
>>>> There is also no reference to the JEP (I am not sure if it is
>>>> customary
>>>> to put references to outside documents - I would prefer some); however
>>>> the JEP does not replace documentation in the source as it is not
>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>> interesting for the alternatives considered though.
>>>>
>>>> Ideally, to reasonably understand the feature I would like to not need
>>>> to spend hours looking through the implementation.
>>>>
>>>> - how many threads does this feature add, and are supported by this
>>>> code
>>>> (for which parts)? Synchronization between them?
>>>>
>>>> - the comments "For XY" in the StringDedup header file do not give any
>>>> hint about what they are doing (not sure if it is interesting as it is
>>>> easy to find out the only callers). It seems that this information has
>>>> been put at the place where these methods are invoked (or just
>>>> somewhere
>>>> in the cpp file). Imo the methods and members (except trivial)
>>>> should be
>>>> documented in the hpp file in the class body, i.e. in the interface.
>>>>
>>>> - there are a few methods (also in the StringDedupTable class) in
>>>> StringDedup where the results of the methods do not seem obvious and
>>>> should be commented.
>>>>
>>>> - I really dislike the location of the "Candidate selection"
>>>> explanation
>>>> within the cpp file. I would not have found it if I were not
>>>> looking for
>>>> some documentation ;) It would probably fit very well in the above
>>>> mentioned overview paragraphs.
>>>>
>>>> - please add the static helper functions that implement the actual
>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation) to the
>>>> StringDedup class (as private static methods) so that they can be
>>>> found
>>>> easily. Static non-class helper functions in cpp files should only be
>>>> used for stuff that is really unimportant and not relevant to the code
>>>> imo. However these seem to be _the_ central policy methods for string
>>>> dedup.
>>>> So it would be nice to have them mentioned in the class/hpp file and
>>>> documented there.
>>>>
>>>> - facts about the code are repeated, e.g. that the table size must
>>>> be a
>>>> power of two. This is nice, but has a few drawbacks: the compiler does
>>>> not care about comments at all, so it might be useful to put
>>>> asserts in
>>>> these places instead, and second, it's just duplication of text
>>>> with the
>>>> usual problem of getting out of date.
>>>>
>>>> - I am not completely sure why the class has not been called
>>>> G1StringDedup. It won't work without G1 anyway and calls G1 internals
>>>> all the time. There does not seem to be a way to make it more generic
>>>> without considerable refactoring. It is fine to keep the name though,
>>>> just asking. I also know that this feature does not want to be a
>>>> generic
>>>> solution.
>>>>
>>>> - is it possible to separate out the statistics variables of
>>>> StringDedupTable into either a separate class instead of using globals
>>>> or separate them a little from the other unrelated members? (And
>>>> describe them) This would improve readability.
>>>>
>>>> I *think* this is everything from StringDedupTable::_entries_added to
>>>> _rehash_count? I have no idea without searching for every use and
>>>> deducing from that.
>>>>
>>>> - when implementing the hash table for the StringDedupTable, did you
>>>> consider trying to reuse one of the existing hashmap implementations?
>>>> And potentially extend them.
>>>>
>>>> Imo we already have way too many ad-hoc implementations of often used
>>>> data structures (particularly) in G1 and GC code.
>>>>
>>>> - please document the properties of the StringDedupTable (type,
>>>> hashing,
>>>> automatic resizing behavior, thread safety, ...).
>>>>
>>>> - please explain what StringDedupTable/Cache are and what they are
>>>> used
>>>> for in context of the feature.
>>>>
>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>
>>
>
More information about the hotspot-dev
mailing list