Cost of single-threaded nmethod hotness updates at each safepoint (in JDK 8)

Sat Aug 1 02:17:29 UTC 2015

Hi Ramki,

That experiment was performed on 7u60, not 8; I may revisit this with 8 or
perhaps wait for segregated code cache to be available before trying
again.  One thing that worried me was the tuning aspect of tiered, which is
a bit opaque as compared to, say, GC logs - it's a bit too black boxey for
me.  Also, the servers I was running this on have tightly chosen cpu
affinity masks and there aren't many spare cores to dedicate to C1 and C2
compiler threads.  But, I may look at this again in the near future.

sent from my phone
On Jul 31, 2015 8:39 PM, <ysr1729 at gmail.com> wrote:

> Hi Vitaly -- Which jdk 8 version were you testing? It's a bit of the
> proverbial curate's egg at the moment (albeit not in the original sense, i
> assure you!)  but if i may be allowed to mix my metaphors, I would be
> inclined not to throw out the baby with the bath water, yet. There are
> services that have seen benefits and some that haven't, and the picture
> overall is still a bit fuzzy. May be someone out there has done a more
> disciplined epidemiological study...
>
> PS: a couple of services were running tiered when it wasn't the default
> (in jdk 7)...
>
> -- ramki
>
> Sent from my iPhone
>
> On Jul 31, 2015, at 3:08 PM, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>
> Ramki, are you actually seeing better peak perf with tiered than C2? I
> experimented with it on a real workload and it was a net loss for peak perf
> (anywhere from 8-20% worse than C2, but also quite unstable); this was with
> a very large code cache to play it safe, but no other tuning.
>
> sent from my phone
> On Jul 31, 2015 6:02 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:
>
>> OK, will do and add you as watcher; thanks Vladimir! (don't yet know if
>> with tiered and a necessarily bounded, if large, code cache whether
>> flushing will in fact eventually become necessary, wrt yr suggested
>> temporary workaround.)
>>
>> Have a good weekend!
>> -- ramki
>>
>> On Fri, Jul 31, 2015 at 2:28 PM, Vladimir Kozlov <
>> vladimir.kozlov at oracle.com> wrote:
>>
>>> Got it. Yes, it is issue with thousands java threads.
>>> You are the first pointing this problem. File bug on compiler. We will
>>> look what we can do. Most likely we need parallelize this work.
>>>
>>> Method's hotness is used only for UseCodeCacheFlushing. You can try to
>>> guard Threads::nmethods_do(&set_hotness_closure); with this flag and switch
>>> it off.
>>>
>>> We need mark_as_seen_on_stack so leave it.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>> On 7/31/15 11:48 AM, Srinivas Ramakrishna wrote:
>>>
>>>> Hi Vladimir --
>>>>
>>>> I noticed the increase even with Initial and Reserved set to the default
>>>> of 240 MB, but actual usage much lower (less than a quarter).
>>>>
>>>> Look at this code path. Note that this is invoked at every safepoint
>>>> (although it says "periodically" in the comment).
>>>> In the mark_active_nmethods() method, there's a thread iteration in both
>>>> branches of the if. I haven't checked to
>>>> see which of the two was the culprit here, yet (if either).
>>>>
>>>> // Various cleaning tasks that should be done periodically at safepoints
>>>>
>>>> void SafepointSynchronize::do_cleanup_tasks() {
>>>>
>>>> ....
>>>>
>>>>    {
>>>>
>>>>      TraceTime t4("mark nmethods", TraceSafepointCleanupTime);
>>>>
>>>>      NMethodSweeper::mark_active_nmethods();
>>>>
>>>>    }
>>>>
>>>> ..
>>>>
>>>> }
>>>>
>>>>
>>>> void NMethodSweeper::mark_active_nmethods() {
>>>>
>>>>   ...
>>>>
>>>>    if (!sweep_in_progress()) {
>>>>
>>>>      _seen = 0;
>>>>
>>>>      _sweep_fractions_left = NmethodSweepFraction;
>>>>
>>>>      _current = CodeCache::first_nmethod();
>>>>
>>>>      _traversals += 1;
>>>>
>>>>      _total_time_this_sweep = Tickspan();
>>>>
>>>>
>>>>      if (PrintMethodFlushing) {
>>>>
>>>>        tty->print_cr("### Sweep: stack traversal %d", _traversals);
>>>>
>>>>      }
>>>>
>>>>      Threads::nmethods_do(&mark_activation_closure);
>>>>
>>>>
>>>>    } else {
>>>>
>>>>      // Only set hotness counter
>>>>
>>>>      Threads::nmethods_do(&set_hotness_closure);
>>>>
>>>>    }
>>>>
>>>>
>>>>    OrderAccess::storestore();
>>>>
>>>> }
>>>>
>>>>
>>>> On Fri, Jul 31, 2015 at 11:43 AM, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>>>
>>>>     Hi Ramki,
>>>>
>>>>     Did you fill up CodeCache? It start scanning aggressive only with
>>>>     full CodeCache:
>>>>
>>>>        // Force stack scanning if there is only 10% free space in the
>>>>     code cache.
>>>>        // We force stack scanning only non-profiled code heap gets full,
>>>>     since critical
>>>>        // allocation go to the non-profiled heap and we must be make
>>>>     sure that there is
>>>>        // enough space.
>>>>        double free_percent = 1 /
>>>>     CodeCache::reverse_free_ratio(CodeBlobType::MethodNonProfiled) *
>>>> 100;
>>>>        if (free_percent <= StartAggressiveSweepingAt) {
>>>>          do_stack_scanning();
>>>>        }
>>>>
>>>>     Vladimir
>>>>
>>>>     On 7/31/15 11:33 AM, Srinivas Ramakrishna wrote:
>>>>
>>>>
>>>>         Yes.
>>>>
>>>>
>>>>         On Fri, Jul 31, 2015 at 11:31 AM, Vitaly Davidovich
>>>>         <vitalyd at gmail.com <mailto:vitalyd at gmail.com>
>>>>         <mailto:vitalyd at gmail.com <mailto:vitalyd at gmail.com>>> wrote:
>>>>
>>>>              Ramki, are you running tiered compilation?
>>>>
>>>>              sent from my phone
>>>>
>>>>              On Jul 31, 2015 2:19 PM, "Srinivas Ramakrishna"
>>>>         <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>
>>>>              <mailto:ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>>>
>>>> wrote:
>>>>
>>>>
>>>>                  Hello GC and Compiler teams!
>>>>
>>>>                  One of our services that runs with several thousand
>>>> threads
>>>>                  recently noticed an increase
>>>>                  in safepoint stop times, but not gc times, upon
>>>>         transitioning to
>>>>                  JDK 8.
>>>>
>>>>                  Further investigation revealed that most of the delta
>>>> was
>>>>                  related to the so-called
>>>>                  pre-gc/vmop "cleanup" phase when various book-keeping
>>>>         activities
>>>>                  are performed,
>>>>                  and more specifically in the portion that walks java
>>>> thread
>>>>                  stacks single-threaded (!)
>>>>                  and updates the hotness counters for the active
>>>>         nmethods. This
>>>>                  code appears to
>>>>                  be new to JDK 8 (in jdk 7 one would walk the stacks
>>>>         only during
>>>>                  code cache sweeps).
>>>>
>>>>                  I have two questions:
>>>>                  (1) has anyone else (typically, I'd expect applications
>>>>         with
>>>>                  many hundreds or thousands of threads)
>>>>                  noticed this regression?
>>>>                  (2) Can we do better, for example, by:
>>>>                         (a) doing these updates by walking thread
>>>> stacks in
>>>>                  multiple worker threads in parallel, or best of all:
>>>>                         (b) doing these updates when we walk the thread
>>>>         stacks
>>>>                  during GC, and skipping this phase entirely
>>>>                               for non-GC safepoints (with attendant
>>>> loss in
>>>>                  frequency of this update in low GC frequency
>>>>                               scenarios).
>>>>
>>>>                  It seems kind of silly to do GC's with many multiple
>>>> worker
>>>>                  threads, but do these thread stack
>>>>                  walks single-threaded when it is embarrasingly parallel
>>>>         (one
>>>>                  could predicate the parallelization
>>>>                  based on the measured stack sizes and thread
>>>> population, if
>>>>                  there was concern on the ovrhead of
>>>>                  activating and deactivating the thread gangs for the
>>>> work).
>>>>
>>>>                  A followup question: Any guesses as to how code cache
>>>>                  sweep/eviction quality might be compromised if one
>>>>                  were to dispense with these hotness updates entirely
>>>>         (or at a
>>>>                  much reduced frequency), as a temporary
>>>>                  workaround to the performance problem?
>>>>
>>>>                  Thoughts/Comments? In particular, has this issue been
>>>>         addressed
>>>>                  perhaps in newer JVMs?
>>>>
>>>>                  Thanks for any comments, feedback, pointers!
>>>>                  -- ramki
>>>>
>>>>                  PS: for comparison, here's data with
>>>>         +TraceSafepointCleanup from
>>>>                  JDK 7 (first, where this isn't done)
>>>>                  vs JDK 8 (where this is done) with a program that has
>>>> a few
>>>>                  thousands of threads:
>>>>
>>>>
>>>>
>>>>                  JDK 7:
>>>>                  ..
>>>>                  2827.308: [sweeping nmethods, 0.0000020 secs]
>>>>                  2828.679: [sweeping nmethods, 0.0000030 secs]
>>>>                  2829.984: [sweeping nmethods, 0.0000030 secs]
>>>>                  2830.956: [sweeping nmethods, 0.0000030 secs]
>>>>                  ..
>>>>
>>>>                  JDK 8:
>>>>                  ..
>>>>                  7368.634: [mark nmethods, 0.0177030 secs]
>>>>                  7369.587: [mark nmethods, 0.0178305 secs]
>>>>                  7370.479: [mark nmethods, 0.0180260 secs]
>>>>                  7371.503: [mark nmethods, 0.0186494 secs]
>>>>                  ..
>>>>
>>>>
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150731/0a3666cf/attachment.htm>