RFR(L): 8029075 - String deduplication in G1
Per Liden
per.liden at oracle.com
Fri Mar 14 19:07:31 UTC 2014
Hi,
Here's an updated webrev, based on feedback from primarily Thomas and
Coleen (Thanks!).
http://cr.openjdk.java.net/~pliden/8029075/webrev.2/
I hope to have addressed all issues that were raised. The main theme of
course is the addition of documentation/comments and the splitting of
some classes into separate files. I think I've addressed most, if not
all, of the other minor issues Thomas mentioned. The only thing I can
think of that wasn't changed was the categorization of the VM flags,
which I motivated in a different mail.
I didn't think it made sense to upload a diff against the previous
webrev as that diff turned out to be a lot bigger than real webrev
itself. Sorry if this makes reviewing harder.
cheers,
/Per
On 2014-03-14 14:19, Per Liden wrote:
> Hi Coleen,
>
> Inline below...
>
> On 2014-03-14 13:19, Coleen Phillimore wrote:
>>
>> Per, 3 quick comments below. Thank you for the reorganization and
>> additional comments.
>>
>> On 3/13/14 11:02 AM, Per Liden wrote:
>>> Hi Coleen,
>>>
>>> Thanks for looking at the patch! Comment below.
>>>
>>> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>>>
>>>> Hi Per,
>>>>
>>>> Why not add the call for StringDedup::unlink() inside the function
>>>> unlink_string_and_symbol_table rather than just above in two
>>>> places, since they go together.
>>>>
>>>> + // Delete dead entries in string deduplication queue/table
>>>> + StringDedup::unlink(&GenMarkSweep::is_alive);
>>>> +
>>>> // Delete entries for dead interned string and clean up
>>>> unreferenced symbols in symbol table.
>>>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive);
>>>>
>>>
>>> Yep, I'll move that call into unlink_string_and_symbol_table(). The
>>> unlink_string_and_symbol_table() is a fairly new addition, there
>>> used to be StringTable/SymbolTable calls there when I added the
>>> StrDedup call, that's why it looks like that.
>>>
>>>>
>>>>
>>>> I didn't see anything else specific to change. Some general
>>>> comments which are similar to Thomas's. There seems to be a lack
>>>> of comments in the deduplication file. I hate huge block comments
>>>> too but some comments about what it does in general would be nice.
>>>> Like 3 sentences per class. The file is overwhelming (ie, I
>>>> completely lose interest) with the statistics printing at the
>>>> front. Maybe you could put the most interesting classes at the
>>>> front of the file? Maybe statistics printing could be a separate
>>>> file.
>>>>
>>>> One of the conventions that we have is that accessors for classes
>>>> are on one line, so you could make the file smaller by doing that
>>>> reformatting.
>>>>
>>>> It is also a general convention to have comments above the
>>>> declaration of a class to say generally what it's for, when it's
>>>> used, etc. It's not easy to tell by the name or we could guess
>>>> wrong. What is StringDedupQueue for, eg? That would make
>>>> reviewing the code easier.
>>>
>>> Agree, I'll be adding comments to each class and a general
>>> description of what dedup does.
>>>
>>>>
>>>> There seems to be a lot of duplication of code with the hashtable
>>>> in hashtable.hpp, only different. We have lots of hashtables in
>>>> hotspot. Did you try to reinstantiate this hashtable and why
>>>> didn't that not work? There are interesting memory allocation
>>>> functions for the entries in this normal hashtable that might help
>>>> with fragmentation, for instance.
>>>
>>> Very early on I actually did use an instance of Hashtable and even
>>> played with the idea of having a combined StringTable/DedupTable. I
>>> decided not to use that though, mostly because I felt I would need
>>> to make too many modifications to non-dedup related code and because
>>> Hashtable<> carries concepts, like shared entries, which didn't
>>> quite fit with dedup. Using the existing StringTable would mean it
>>> would have to have logic to manage entries which point to either
>>> Strings or char[], which again seemed like an too intrusive change.
>>> The Hashtable allocation strategy (with allocating blocks of
>>> entries) seems nice from a performance perspective, but freeing
>>> those becomes a bit of a challenge, which I guess is why it never
>>> frees any entries. Since the dedup table is there to in the end save
>>> memory, I felt I needed better control over the table's memory
>>> usage. At one point I actually started writing a
>>> WeakOopsHashTable<>, which I hoped could serve as the base for both
>>> StringTable and DedupTable, but again I felt I was losing focus a
>>> bit as it would mean modifying large parts of non-GC and non-dedeup
>>> related stuff. The potential connection to CDS here also made me a
>>> bit scared of going in that direction, as I don't have a complete
>>> understanding of the implications of that.
>>
>> Ok, I'd figured they didn't map that well. There's a lot of
>> duplication already in symbol and string tables also.
>>>
>>>>
>>>> Lastly, why is the hash code in the string written to and used? The
>>>> StringTable uses the same hash algorithm until it has to change the
>>>> hash code, but the new hash code isn't stored in the string because
>>>> it would be incompatible.
>>>
>>> I think StringTable and the dedup table is doing the same thing
>>> here. The hash code in the String object is only used/updated in the
>>> case we're using the standard/compatible hash function. As soon as
>>> the table is rehashed and switches to another hash function it no
>>> longer updates the hash code in the String. use_java_hash() tells
>>> use if we're using the standard/compatible hash and it's used here:
>>>
>>> if (use_java_hash()) {
>>> // Get hash code from cache
>>> hash = java_lang_String::hash(java_string);
>>> }
>>>
>>> and later here:
>>>
>>> if (use_java_hash() && hash != 0) {
>>> // Store hash code in cache
>>> java_lang_String::set_hash(java_string, hash);
>>> }
>>>
>>> Concerns?
>>
>> I have to look again but wanted to answer quickly (forgot to last
>> night). This is good, but why do we have to update the hash code in
>> the java/lang/String instance at all?
>
> We don't strictly have to, but it mimics the behavior of
> String.hashCode() and it is seems like the natural thing to do to
> given that the hash field in String is a cache for this purpose so
> that future users of String.hashCode() don't have to do the same
> calculation. Do you have a reason in mind why we wouldn't want to
> write it back?
>
> cheers,
> /Per
>
>>
>>>
>>>>
>>>> I would like to see this be a few different files. It seems like
>>>> this file does three things conceptually, interacts with G1 somehow
>>>> to determine which strings are candidates for deduplication, stores
>>>> these things in a rehashable, resizable hash table and gathers
>>>> statistics.
>>>>
>>>> I'm sorry to have so many comments so late and I haven't really
>>>> tried to read the code for correctness. I think some reformatting
>>>> and reorganization would make the reviewers job easier. It's a big
>>>> change and worthwhile.
>>>
>>> Great, I've no problems splitting this up. I'm more of a "one class
>>> per file" person myself, but I had the (apparently incorrect)
>>> impression that wasn't the hotspot way. I'm thinking I'll split this
>>> up into the following files (while I'm at it I'll also add the G1
>>> prefix as Thomas suggested):
>>>
>>> g1StringDedup - Main interface for the GC, would contain entry point
>>> and things like the task and closure classes
>>> g1StringDedupTable - The table, cache and entry classes
>>> g1StringDedupQueue - The queue
>>> g1StringDedupStat - The statistics sutff
>>> g1StringDedupThread - The dedup thread
>>
>> Yes, this would be nice. symbolTable.cpp should be split up too if
>> that is what gave you the impression. I will file an RFE to do
>> that, it's been at the back of my mind to do so.
>>
>> Thanks,
>> Coleen
>>>
>>> Sounds good?
>>>
>>> Will send out and updated webrev.
>>>
>>> Thanks,
>>> /Per
>>>
>>>>
>>>> Thanks,
>>>> Coleen
>>>>
>>>>
>>>> On 3/12/14 9:34 AM, Per Liden wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> Thanks for reviewing. Comments inline.
>>>>>
>>>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>>>> Hi Per,
>>>>>>
>>>>>>
>>>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here's an updated webrev based on the feedback from Bengt and
>>>>>>> Thomas.
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>>>
>>>>>>
>>>>>> See further below.
>>>>>>
>>>>>>>
>>>>>>> And as suggested, two changes in the initial webrev were broken
>>>>>>> out into
>>>>>>> separate CRs, available here:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>>>
>>>>>> Looks good.
>>>>>>
>>>>>> Did you check the impact on something like specjbb2005 which does
>>>>>> nothing but object copying? That code is somewhat performance
>>>>>> sensitive.
>>>>>
>>>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any
>>>>> difference.
>>>>>
>>>>>>
>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>>>
>>>>>> Looks good.
>>>>>>
>>>>>> Here are my comments: there are a few minor issues, and a single
>>>>>> major:
>>>>>> basically the documentation for this feature is lacking imo. I
>>>>>> listed a
>>>>>> few particular issues of mine with it.
>>>>>>
>>>>>> First the minor issues:
>>>>>>
>>>>>> - could you check on copyright dates on the modified files? They
>>>>>> have
>>>>>> not been updated anywhere afaics.
>>>>>>
>>>>>> - not sure if good, but maybe make
>>>>>> PrintStringDeduplicationStatistics
>>>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>>>
>>>>>> - maybe make the new *ALot flags notproduct. These seem to be
>>>>>> debugging
>>>>>> helpers similar to other *ALot flags.
>>>>>
>>>>> I had a conversion with Bengt about this some time ago and I think
>>>>> the options are categorized correctly. The ALot flags needs be
>>>>> available in product builds because they are used by tests. Most
>>>>> Print*Statistics are product therefore it makes sense to follow
>>>>> that pattern for PrintStringDeduplicationStatistics.
>>>>>
>>>>>>
>>>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention
>>>>>> that if
>>>>>> the object is promoted, that it is always considered as candidate.
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>> marksweep.inline.hpp
>>>>>>
>>>>>> - MarkSweep::mark_object(): did you measure impact on other
>>>>>> collectors
>>>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>>>> that !UseG1GC automatically disables UseStringDeduplication (or
>>>>>> use a
>>>>>> global variable for that) to avoid the two checks. (The latter
>>>>>> option is
>>>>>> probably better).
>>>>>
>>>>> Yes, ran jbb2005/2013.
>>>>>
>>>>>>
>>>>>> symbolTable.cpp:
>>>>>>
>>>>>> - again both UseG1GC and UseStringDedupliation are paired. Could
>>>>>> this be
>>>>>> put in a method/global variable somewhere that is set up
>>>>>> correctly at
>>>>>> startup?
>>>>>
>>>>> I think using the options is much more readable. Adding another
>>>>> state which tells us if G1 & Dedup is enabled means we need a
>>>>> pre-init stage to set up that state, which I don't think is very
>>>>> nice.
>>>>>
>>>>>>
>>>>>> g1CollectedHeap.cpp:
>>>>>>
>>>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>>>> UseStringDeduplication is inside the method) while it's outside for
>>>>>> everything else. Just a note, keep it if wanted.
>>>>>
>>>>> They are hidden inside dedup everywhere, except where the
>>>>> performance overhead of a call might be an issue (like mark and
>>>>> copy).
>>>>>
>>>>>>
>>>>>> TestStringDeduplicationTools.java:
>>>>>>
>>>>>> - maybe put retrieval of the unsafe object into the test library.
>>>>>> Browsing through the code shows quite a lot copy+paste here already.
>>>>>>
>>>>>> - "qeueue" -> "queue"
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>> all tests:
>>>>>>
>>>>>> - missing bug ids
>>>>>
>>>>> Check.
>>>>>
>>>>>>
>>>>>> From here it is mostly about comments/documentation:
>>>>>
>>>>> I agree that adding a feature overview and some additional
>>>>> comments is probably a good idea. However, I also know that Bengt
>>>>> and StefanK aren't too crazy about large comment blobs so I tried
>>>>> to kept the noise down. I also disagree with some of the other
>>>>> comments below which tend to be about personal taste rather than
>>>>> code quality or breaking some coding convention we have, e.g. the
>>>>> static helpers, readability, naming, etc.
>>>>>
>>>>> I'll add some comments and make some other changes and send out an
>>>>> updated webrev shortly.
>>>>>
>>>>> cheers,
>>>>> /Per
>>>>>
>>>>>>
>>>>>> stringdedup.?pp:
>>>>>>
>>>>>> - is it possible to put an overview about how the string
>>>>>> deduplication
>>>>>> works here? There is absolutely no place in the code where e.g. the
>>>>>> string deduplication cycle is explained. There is no comprehensive
>>>>>> information about this feature anywhere, just bits about corner
>>>>>> cases or
>>>>>> particular here and there, but nothing that can be used as entry
>>>>>> point
>>>>>> to understand what this is actually doing.
>>>>>>
>>>>>> To some extent, the comments seem to be repetitions of what the code
>>>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be
>>>>>> power
>>>>>> of 2)" or "// Store hash code in cache" followed by
>>>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>>>
>>>>>> There is also no reference to the JEP (I am not sure if it is
>>>>>> customary
>>>>>> to put references to outside documents - I would prefer some);
>>>>>> however
>>>>>> the JEP does not replace documentation in the source as it is not
>>>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>>>> interesting for the alternatives considered though.
>>>>>>
>>>>>> Ideally, to reasonably understand the feature I would like to not
>>>>>> need
>>>>>> to spend hours looking through the implementation.
>>>>>>
>>>>>> - how many threads does this feature add, and are supported by
>>>>>> this code
>>>>>> (for which parts)? Synchronization between them?
>>>>>>
>>>>>> - the comments "For XY" in the StringDedup header file do not
>>>>>> give any
>>>>>> hint about what they are doing (not sure if it is interesting as
>>>>>> it is
>>>>>> easy to find out the only callers). It seems that this
>>>>>> information has
>>>>>> been put at the place where these methods are invoked (or just
>>>>>> somewhere
>>>>>> in the cpp file). Imo the methods and members (except trivial)
>>>>>> should be
>>>>>> documented in the hpp file in the class body, i.e. in the interface.
>>>>>>
>>>>>> - there are a few methods (also in the StringDedupTable class) in
>>>>>> StringDedup where the results of the methods do not seem obvious and
>>>>>> should be commented.
>>>>>>
>>>>>> - I really dislike the location of the "Candidate selection"
>>>>>> explanation
>>>>>> within the cpp file. I would not have found it if I were not
>>>>>> looking for
>>>>>> some documentation ;) It would probably fit very well in the above
>>>>>> mentioned overview paragraphs.
>>>>>>
>>>>>> - please add the static helper functions that implement the actual
>>>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation)
>>>>>> to the
>>>>>> StringDedup class (as private static methods) so that they can be
>>>>>> found
>>>>>> easily. Static non-class helper functions in cpp files should
>>>>>> only be
>>>>>> used for stuff that is really unimportant and not relevant to the
>>>>>> code
>>>>>> imo. However these seem to be _the_ central policy methods for
>>>>>> string
>>>>>> dedup.
>>>>>> So it would be nice to have them mentioned in the class/hpp file and
>>>>>> documented there.
>>>>>>
>>>>>> - facts about the code are repeated, e.g. that the table size
>>>>>> must be a
>>>>>> power of two. This is nice, but has a few drawbacks: the compiler
>>>>>> does
>>>>>> not care about comments at all, so it might be useful to put
>>>>>> asserts in
>>>>>> these places instead, and second, it's just duplication of text
>>>>>> with the
>>>>>> usual problem of getting out of date.
>>>>>>
>>>>>> - I am not completely sure why the class has not been called
>>>>>> G1StringDedup. It won't work without G1 anyway and calls G1
>>>>>> internals
>>>>>> all the time. There does not seem to be a way to make it more
>>>>>> generic
>>>>>> without considerable refactoring. It is fine to keep the name
>>>>>> though,
>>>>>> just asking. I also know that this feature does not want to be a
>>>>>> generic
>>>>>> solution.
>>>>>>
>>>>>> - is it possible to separate out the statistics variables of
>>>>>> StringDedupTable into either a separate class instead of using
>>>>>> globals
>>>>>> or separate them a little from the other unrelated members? (And
>>>>>> describe them) This would improve readability.
>>>>>>
>>>>>> I *think* this is everything from
>>>>>> StringDedupTable::_entries_added to
>>>>>> _rehash_count? I have no idea without searching for every use and
>>>>>> deducing from that.
>>>>>>
>>>>>> - when implementing the hash table for the StringDedupTable, did you
>>>>>> consider trying to reuse one of the existing hashmap
>>>>>> implementations?
>>>>>> And potentially extend them.
>>>>>>
>>>>>> Imo we already have way too many ad-hoc implementations of often
>>>>>> used
>>>>>> data structures (particularly) in G1 and GC code.
>>>>>>
>>>>>> - please document the properties of the StringDedupTable (type,
>>>>>> hashing,
>>>>>> automatic resizing behavior, thread safety, ...).
>>>>>>
>>>>>> - please explain what StringDedupTable/Cache are and what they
>>>>>> are used
>>>>>> for in context of the feature.
>>>>>>
>>>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the hotspot-dev
mailing list