RFR(L): 8029075 - String deduplication in G1
Bengt Rutisson
bengt.rutisson at oracle.com
Mon Mar 17 09:51:32 UTC 2014
Hi Per,
On 2014-03-14 20:07, Per Liden wrote:
> Hi,
>
> Here's an updated webrev, based on feedback from primarily Thomas and
> Coleen (Thanks!).
>
> http://cr.openjdk.java.net/~pliden/8029075/webrev.2/
Looks good to me.
Bengt
>
> I hope to have addressed all issues that were raised. The main theme
> of course is the addition of documentation/comments and the splitting
> of some classes into separate files. I think I've addressed most, if
> not all, of the other minor issues Thomas mentioned. The only thing I
> can think of that wasn't changed was the categorization of the VM
> flags, which I motivated in a different mail.
>
> I didn't think it made sense to upload a diff against the previous
> webrev as that diff turned out to be a lot bigger than real webrev
> itself. Sorry if this makes reviewing harder.
>
> cheers,
> /Per
>
> On 2014-03-14 14:19, Per Liden wrote:
>> Hi Coleen,
>>
>> Inline below...
>>
>> On 2014-03-14 13:19, Coleen Phillimore wrote:
>>>
>>> Per, 3 quick comments below. Thank you for the reorganization and
>>> additional comments.
>>>
>>> On 3/13/14 11:02 AM, Per Liden wrote:
>>>> Hi Coleen,
>>>>
>>>> Thanks for looking at the patch! Comment below.
>>>>
>>>> On 2014-03-12 23:10, Coleen Phillimore wrote:
>>>>>
>>>>> Hi Per,
>>>>>
>>>>> Why not add the call for StringDedup::unlink() inside the function
>>>>> unlink_string_and_symbol_table rather than just above in two
>>>>> places, since they go together.
>>>>>
>>>>> + // Delete dead entries in string deduplication queue/table
>>>>> + StringDedup::unlink(&GenMarkSweep::is_alive);
>>>>> +
>>>>> // Delete entries for dead interned string and clean up
>>>>> unreferenced symbols in symbol table.
>>>>> G1CollectedHeap::heap()->unlink_string_and_symbol_table(&GenMarkSweep::is_alive);
>>>>>
>>>>
>>>> Yep, I'll move that call into unlink_string_and_symbol_table(). The
>>>> unlink_string_and_symbol_table() is a fairly new addition, there
>>>> used to be StringTable/SymbolTable calls there when I added the
>>>> StrDedup call, that's why it looks like that.
>>>>
>>>>>
>>>>>
>>>>> I didn't see anything else specific to change. Some general
>>>>> comments which are similar to Thomas's. There seems to be a lack
>>>>> of comments in the deduplication file. I hate huge block
>>>>> comments too but some comments about what it does in general would
>>>>> be nice. Like 3 sentences per class. The file is overwhelming
>>>>> (ie, I completely lose interest) with the statistics printing at
>>>>> the front. Maybe you could put the most interesting classes at
>>>>> the front of the file? Maybe statistics printing could be a
>>>>> separate file.
>>>>>
>>>>> One of the conventions that we have is that accessors for classes
>>>>> are on one line, so you could make the file smaller by doing that
>>>>> reformatting.
>>>>>
>>>>> It is also a general convention to have comments above the
>>>>> declaration of a class to say generally what it's for, when it's
>>>>> used, etc. It's not easy to tell by the name or we could guess
>>>>> wrong. What is StringDedupQueue for, eg? That would make
>>>>> reviewing the code easier.
>>>>
>>>> Agree, I'll be adding comments to each class and a general
>>>> description of what dedup does.
>>>>
>>>>>
>>>>> There seems to be a lot of duplication of code with the hashtable
>>>>> in hashtable.hpp, only different. We have lots of hashtables in
>>>>> hotspot. Did you try to reinstantiate this hashtable and why
>>>>> didn't that not work? There are interesting memory allocation
>>>>> functions for the entries in this normal hashtable that might help
>>>>> with fragmentation, for instance.
>>>>
>>>> Very early on I actually did use an instance of Hashtable and even
>>>> played with the idea of having a combined StringTable/DedupTable. I
>>>> decided not to use that though, mostly because I felt I would need
>>>> to make too many modifications to non-dedup related code and
>>>> because Hashtable<> carries concepts, like shared entries, which
>>>> didn't quite fit with dedup. Using the existing StringTable would
>>>> mean it would have to have logic to manage entries which point to
>>>> either Strings or char[], which again seemed like an too intrusive
>>>> change. The Hashtable allocation strategy (with allocating blocks
>>>> of entries) seems nice from a performance perspective, but freeing
>>>> those becomes a bit of a challenge, which I guess is why it never
>>>> frees any entries. Since the dedup table is there to in the end
>>>> save memory, I felt I needed better control over the table's memory
>>>> usage. At one point I actually started writing a
>>>> WeakOopsHashTable<>, which I hoped could serve as the base for both
>>>> StringTable and DedupTable, but again I felt I was losing focus a
>>>> bit as it would mean modifying large parts of non-GC and non-dedeup
>>>> related stuff. The potential connection to CDS here also made me a
>>>> bit scared of going in that direction, as I don't have a complete
>>>> understanding of the implications of that.
>>>
>>> Ok, I'd figured they didn't map that well. There's a lot of
>>> duplication already in symbol and string tables also.
>>>>
>>>>>
>>>>> Lastly, why is the hash code in the string written to and used?
>>>>> The StringTable uses the same hash algorithm until it has to
>>>>> change the hash code, but the new hash code isn't stored in the
>>>>> string because it would be incompatible.
>>>>
>>>> I think StringTable and the dedup table is doing the same thing
>>>> here. The hash code in the String object is only used/updated in
>>>> the case we're using the standard/compatible hash function. As soon
>>>> as the table is rehashed and switches to another hash function it
>>>> no longer updates the hash code in the String. use_java_hash()
>>>> tells use if we're using the standard/compatible hash and it's used
>>>> here:
>>>>
>>>> if (use_java_hash()) {
>>>> // Get hash code from cache
>>>> hash = java_lang_String::hash(java_string);
>>>> }
>>>>
>>>> and later here:
>>>>
>>>> if (use_java_hash() && hash != 0) {
>>>> // Store hash code in cache
>>>> java_lang_String::set_hash(java_string, hash);
>>>> }
>>>>
>>>> Concerns?
>>>
>>> I have to look again but wanted to answer quickly (forgot to last
>>> night). This is good, but why do we have to update the hash code in
>>> the java/lang/String instance at all?
>>
>> We don't strictly have to, but it mimics the behavior of
>> String.hashCode() and it is seems like the natural thing to do to
>> given that the hash field in String is a cache for this purpose so
>> that future users of String.hashCode() don't have to do the same
>> calculation. Do you have a reason in mind why we wouldn't want to
>> write it back?
>>
>> cheers,
>> /Per
>>
>>>
>>>>
>>>>>
>>>>> I would like to see this be a few different files. It seems like
>>>>> this file does three things conceptually, interacts with G1
>>>>> somehow to determine which strings are candidates for
>>>>> deduplication, stores these things in a rehashable, resizable hash
>>>>> table and gathers statistics.
>>>>>
>>>>> I'm sorry to have so many comments so late and I haven't really
>>>>> tried to read the code for correctness. I think some
>>>>> reformatting and reorganization would make the reviewers job
>>>>> easier. It's a big change and worthwhile.
>>>>
>>>> Great, I've no problems splitting this up. I'm more of a "one class
>>>> per file" person myself, but I had the (apparently incorrect)
>>>> impression that wasn't the hotspot way. I'm thinking I'll split
>>>> this up into the following files (while I'm at it I'll also add the
>>>> G1 prefix as Thomas suggested):
>>>>
>>>> g1StringDedup - Main interface for the GC, would contain entry
>>>> point and things like the task and closure classes
>>>> g1StringDedupTable - The table, cache and entry classes
>>>> g1StringDedupQueue - The queue
>>>> g1StringDedupStat - The statistics sutff
>>>> g1StringDedupThread - The dedup thread
>>>
>>> Yes, this would be nice. symbolTable.cpp should be split up too if
>>> that is what gave you the impression. I will file an RFE to do
>>> that, it's been at the back of my mind to do so.
>>>
>>> Thanks,
>>> Coleen
>>>>
>>>> Sounds good?
>>>>
>>>> Will send out and updated webrev.
>>>>
>>>> Thanks,
>>>> /Per
>>>>
>>>>>
>>>>> Thanks,
>>>>> Coleen
>>>>>
>>>>>
>>>>> On 3/12/14 9:34 AM, Per Liden wrote:
>>>>>> Hi Thomas,
>>>>>>
>>>>>> Thanks for reviewing. Comments inline.
>>>>>>
>>>>>> On 03/11/2014 01:39 PM, Thomas Schatzl wrote:
>>>>>>> Hi Per,
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 2014-03-07 at 15:13 +0100, Per Liden wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Here's an updated webrev based on the feedback from Bengt and
>>>>>>>> Thomas.
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~pliden/8029075/webrev.1/
>>>>>>>>
>>>>>>>
>>>>>>> See further below.
>>>>>>>
>>>>>>>>
>>>>>>>> And as suggested, two changes in the initial webrev were broken
>>>>>>>> out into
>>>>>>>> separate CRs, available here:
>>>>>>>>
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036672
>>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036672/webrev.0/
>>>>>>>
>>>>>>> Looks good.
>>>>>>>
>>>>>>> Did you check the impact on something like specjbb2005 which does
>>>>>>> nothing but object copying? That code is somewhat performance
>>>>>>> sensitive.
>>>>>>
>>>>>> Yes, I did Aurora runs with jbb2005 and jbb2013, didn't see any
>>>>>> difference.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8036673
>>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8036673/webrev.0/
>>>>>>>
>>>>>>> Looks good.
>>>>>>>
>>>>>>> Here are my comments: there are a few minor issues, and a single
>>>>>>> major:
>>>>>>> basically the documentation for this feature is lacking imo. I
>>>>>>> listed a
>>>>>>> few particular issues of mine with it.
>>>>>>>
>>>>>>> First the minor issues:
>>>>>>>
>>>>>>> - could you check on copyright dates on the modified files? They
>>>>>>> have
>>>>>>> not been updated anywhere afaics.
>>>>>>>
>>>>>>> - not sure if good, but maybe make
>>>>>>> PrintStringDeduplicationStatistics
>>>>>>> diagnostic? It seems to really be some diagnosis functions.
>>>>>>>
>>>>>>> - maybe make the new *ALot flags notproduct. These seem to be
>>>>>>> debugging
>>>>>>> helpers similar to other *ALot flags.
>>>>>>
>>>>>> I had a conversion with Bengt about this some time ago and I
>>>>>> think the options are categorized correctly. The ALot flags needs
>>>>>> be available in product builds because they are used by tests.
>>>>>> Most Print*Statistics are product therefore it makes sense to
>>>>>> follow that pattern for PrintStringDeduplicationStatistics.
>>>>>>
>>>>>>>
>>>>>>> - StringDeduplicationAgeThreshold: maybe also needs to mention
>>>>>>> that if
>>>>>>> the object is promoted, that it is always considered as candidate.
>>>>>>
>>>>>> Check.
>>>>>>
>>>>>>>
>>>>>>> marksweep.inline.hpp
>>>>>>>
>>>>>>> - MarkSweep::mark_object(): did you measure impact on other
>>>>>>> collectors
>>>>>>> using this code? It is enabled unconditionally. Also I would prefer
>>>>>>> that !UseG1GC automatically disables UseStringDeduplication (or
>>>>>>> use a
>>>>>>> global variable for that) to avoid the two checks. (The latter
>>>>>>> option is
>>>>>>> probably better).
>>>>>>
>>>>>> Yes, ran jbb2005/2013.
>>>>>>
>>>>>>>
>>>>>>> symbolTable.cpp:
>>>>>>>
>>>>>>> - again both UseG1GC and UseStringDedupliation are paired. Could
>>>>>>> this be
>>>>>>> put in a method/global variable somewhere that is set up
>>>>>>> correctly at
>>>>>>> startup?
>>>>>>
>>>>>> I think using the options is much more readable. Adding another
>>>>>> state which tells us if G1 & Dedup is enabled means we need a
>>>>>> pre-init stage to set up that state, which I don't think is very
>>>>>> nice.
>>>>>>
>>>>>>>
>>>>>>> g1CollectedHeap.cpp:
>>>>>>>
>>>>>>> - StringDedup::initialize() is run unconditionally (the check for
>>>>>>> UseStringDeduplication is inside the method) while it's outside for
>>>>>>> everything else. Just a note, keep it if wanted.
>>>>>>
>>>>>> They are hidden inside dedup everywhere, except where the
>>>>>> performance overhead of a call might be an issue (like mark and
>>>>>> copy).
>>>>>>
>>>>>>>
>>>>>>> TestStringDeduplicationTools.java:
>>>>>>>
>>>>>>> - maybe put retrieval of the unsafe object into the test library.
>>>>>>> Browsing through the code shows quite a lot copy+paste here
>>>>>>> already.
>>>>>>>
>>>>>>> - "qeueue" -> "queue"
>>>>>>
>>>>>> Check.
>>>>>>
>>>>>>>
>>>>>>> all tests:
>>>>>>>
>>>>>>> - missing bug ids
>>>>>>
>>>>>> Check.
>>>>>>
>>>>>>>
>>>>>>> From here it is mostly about comments/documentation:
>>>>>>
>>>>>> I agree that adding a feature overview and some additional
>>>>>> comments is probably a good idea. However, I also know that Bengt
>>>>>> and StefanK aren't too crazy about large comment blobs so I tried
>>>>>> to kept the noise down. I also disagree with some of the other
>>>>>> comments below which tend to be about personal taste rather than
>>>>>> code quality or breaking some coding convention we have, e.g. the
>>>>>> static helpers, readability, naming, etc.
>>>>>>
>>>>>> I'll add some comments and make some other changes and send out
>>>>>> an updated webrev shortly.
>>>>>>
>>>>>> cheers,
>>>>>> /Per
>>>>>>
>>>>>>>
>>>>>>> stringdedup.?pp:
>>>>>>>
>>>>>>> - is it possible to put an overview about how the string
>>>>>>> deduplication
>>>>>>> works here? There is absolutely no place in the code where e.g. the
>>>>>>> string deduplication cycle is explained. There is no comprehensive
>>>>>>> information about this feature anywhere, just bits about corner
>>>>>>> cases or
>>>>>>> particular here and there, but nothing that can be used as entry
>>>>>>> point
>>>>>>> to understand what this is actually doing.
>>>>>>>
>>>>>>> To some extent, the comments seem to be repetitions of what the
>>>>>>> code
>>>>>>> already shows. E.g. "some_constant = (1 << 10); // 1024 (must be
>>>>>>> power
>>>>>>> of 2)" or "// Store hash code in cache" followed by
>>>>>>> "java_lang_String::set_hash(java_string, hash);".
>>>>>>>
>>>>>>> There is also no reference to the JEP (I am not sure if it is
>>>>>>> customary
>>>>>>> to put references to outside documents - I would prefer some);
>>>>>>> however
>>>>>>> the JEP does not replace documentation in the source as it is not
>>>>>>> detailed enough, and the JEP cannot be changed any more. The JEP is
>>>>>>> interesting for the alternatives considered though.
>>>>>>>
>>>>>>> Ideally, to reasonably understand the feature I would like to
>>>>>>> not need
>>>>>>> to spend hours looking through the implementation.
>>>>>>>
>>>>>>> - how many threads does this feature add, and are supported by
>>>>>>> this code
>>>>>>> (for which parts)? Synchronization between them?
>>>>>>>
>>>>>>> - the comments "For XY" in the StringDedup header file do not
>>>>>>> give any
>>>>>>> hint about what they are doing (not sure if it is interesting as
>>>>>>> it is
>>>>>>> easy to find out the only callers). It seems that this
>>>>>>> information has
>>>>>>> been put at the place where these methods are invoked (or just
>>>>>>> somewhere
>>>>>>> in the cpp file). Imo the methods and members (except trivial)
>>>>>>> should be
>>>>>>> documented in the hpp file in the class body, i.e. in the
>>>>>>> interface.
>>>>>>>
>>>>>>> - there are a few methods (also in the StringDedupTable class) in
>>>>>>> StringDedup where the results of the methods do not seem obvious
>>>>>>> and
>>>>>>> should be commented.
>>>>>>>
>>>>>>> - I really dislike the location of the "Candidate selection"
>>>>>>> explanation
>>>>>>> within the cpp file. I would not have found it if I were not
>>>>>>> looking for
>>>>>>> some documentation ;) It would probably fit very well in the above
>>>>>>> mentioned overview paragraphs.
>>>>>>>
>>>>>>> - please add the static helper functions that implement the actual
>>>>>>> policies (is_candidate_from_mark, is_candidate_from_evacuation)
>>>>>>> to the
>>>>>>> StringDedup class (as private static methods) so that they can
>>>>>>> be found
>>>>>>> easily. Static non-class helper functions in cpp files should
>>>>>>> only be
>>>>>>> used for stuff that is really unimportant and not relevant to
>>>>>>> the code
>>>>>>> imo. However these seem to be _the_ central policy methods for
>>>>>>> string
>>>>>>> dedup.
>>>>>>> So it would be nice to have them mentioned in the class/hpp file
>>>>>>> and
>>>>>>> documented there.
>>>>>>>
>>>>>>> - facts about the code are repeated, e.g. that the table size
>>>>>>> must be a
>>>>>>> power of two. This is nice, but has a few drawbacks: the
>>>>>>> compiler does
>>>>>>> not care about comments at all, so it might be useful to put
>>>>>>> asserts in
>>>>>>> these places instead, and second, it's just duplication of text
>>>>>>> with the
>>>>>>> usual problem of getting out of date.
>>>>>>>
>>>>>>> - I am not completely sure why the class has not been called
>>>>>>> G1StringDedup. It won't work without G1 anyway and calls G1
>>>>>>> internals
>>>>>>> all the time. There does not seem to be a way to make it more
>>>>>>> generic
>>>>>>> without considerable refactoring. It is fine to keep the name
>>>>>>> though,
>>>>>>> just asking. I also know that this feature does not want to be a
>>>>>>> generic
>>>>>>> solution.
>>>>>>>
>>>>>>> - is it possible to separate out the statistics variables of
>>>>>>> StringDedupTable into either a separate class instead of using
>>>>>>> globals
>>>>>>> or separate them a little from the other unrelated members? (And
>>>>>>> describe them) This would improve readability.
>>>>>>>
>>>>>>> I *think* this is everything from
>>>>>>> StringDedupTable::_entries_added to
>>>>>>> _rehash_count? I have no idea without searching for every use and
>>>>>>> deducing from that.
>>>>>>>
>>>>>>> - when implementing the hash table for the StringDedupTable, did
>>>>>>> you
>>>>>>> consider trying to reuse one of the existing hashmap
>>>>>>> implementations?
>>>>>>> And potentially extend them.
>>>>>>>
>>>>>>> Imo we already have way too many ad-hoc implementations of often
>>>>>>> used
>>>>>>> data structures (particularly) in G1 and GC code.
>>>>>>>
>>>>>>> - please document the properties of the StringDedupTable (type,
>>>>>>> hashing,
>>>>>>> automatic resizing behavior, thread safety, ...).
>>>>>>>
>>>>>>> - please explain what StringDedupTable/Cache are and what they
>>>>>>> are used
>>>>>>> for in context of the feature.
>>>>>>>
>>>>>>> - stringdedup.cpp, line 1309: candicate -> candidate
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the hotspot-dev
mailing list