RFR: 8231666: ThreadIdTable::grow() invokes invalid thread transition

Sat Oct 5 03:58:06 UTC 2019

Hi Daniil,

On 5/10/2019 1:23 pm, Daniil Titov wrote:
> Hi David and Robbin,
> 
> Please review a new version of the fix that adds the max size check check_concurrent_work code [1].

That change seems fine.

>>     I don't think you need to repeat the load factor check here:
>>     
>>     void ThreadIdTable::do_concurrent_work(JavaThread* jt) {
>>          assert(_is_initialized, "Thread table is not initialized");
>>          _has_work = false;
>>          double load_factor = get_load_factor();
>>          log_debug(thread, table)("Concurrent work, load factor: %g",
>>     load_factor);
>>          if (load_factor > PREF_AVG_LIST_LEN &&
>>     !_local_table->is_max_size_reached()) {
>>            grow(jt);
>>          }
>>        }
>>     
>>     as we will only execute this code if the load factor was seen to be too
>>     high.
> 
> I decided to leave it unchanged since in my understanding it could be the case when some threads exited and
> were removed from the table after the work was triggered but before the service thread called do_concurrent_work()
> method. In this case we might have the load factor back to the normal and therefore have no need to increase the size
>   of the thread table.

True, but if new threads get added again you could just repeat the 
process. This is a more stable process if you use an "edge trigger" 
rather than a "level trigger". But either way we are making assumptions 
about the pattern of adding and removing threads. So okay to leave as-is.

So, good to go.

Thanks,
David

> Testing: Mach5 tier1, tier2, and tier3 tests passed.
> 
> [1] Webrev:  http://cr.openjdk.java.net/~dtitov/8231666/webrev.03/
> [2] Bug: https://bugs.openjdk.java.net/browse/JDK-8231666
> 
> Thank you,
> Daniil
> 
> On 10/3/19, 11:30 PM, "Robbin Ehn" <robbin.ehn at oracle.com> wrote:
> 
>      Hi Daniil,
>      
>      >
>      > You might also want to put the max size check in the check_concurrent_work code:
>      >
>      > +   // Resize if we have more items than preferred load factor
>      > +   if ( load_factor > PREF_AVG_LIST_LEN && !_local_table->is_max_size_reached()) {
>      >
>      > so that we don't keep waking up the service thread for nothing if the table gets
>      > full.
>      
>      Yes that would be a good, otherwise seems fine.
>      
>      >
>      > Thanks,
>      > David
>      > -----
>      >
>      >> Testing:  Mach5 tier1, tier2, and tier3 tests successfully passed.
>      
>      And if you have not done so, you should test this with the benchmark you have as
>      a stress test and see that this does what we think.
>      
>      Thanks, Robbin
>      
>      
>      >>
>      >> Thank you!
>      >>
>      >> Best regards,
>      >> Daniil
>      >>
>      >> On 10/2/19, 3:26 PM, "David Holmes" <david.holmes at oracle.com> wrote:
>      >>
>      >>      Hi Daniil,
>      >>      On 3/10/2019 2:21 am, Daniil Titov wrote:
>      >>      > Hi David and Robbin,
>      >>      >
>      >>      > Could we consider  making the ServiceThread responsible for the
>      >> ThreadIdTable resizing in the similar way how
>      >>      > it works for  StringTable  and ResolvedMethodTable, rather than having
>      >> ThreadIdTable::add() method calling ThreadIdTable::grow()?
>      >>      > As I understand It should solve  the current  issue and  address the
>      >> concern that  the doing the resizing could be a relatively long and
>      >>      > doing it without polling  for safepoints or while the holding
>      >> Threads_lock is not desirable.
>      >>      I originally rejected copying that part of the code from the other
>      >>      tables as it seems to introduce unnecessary complexity. Having a
>      >>      separate thread trying to grow the table when it could be concurrently
>      >>      having threads added and removed seems like it could introduce hard to
>      >>      diagnose performance pathologies. It also adds what we know to be a
>      >>      potentially long running action to the workload of the service thread,
>      >>      which means it may also impact the other tasks the service thread is
>      >>      doing, thus potentially introducing even more hard to diagnose
>      >>      performance pathologies.
>      >>      So this change does concern me. But go ahead and trial it.
>      >>      Thanks,
>      >>      David
>      >>      > Thank you,
>      >>      > Daniil
>      >>      >
>      >>      >
>      >>      > On 10/2/19, 6:25 AM, "David Holmes" <david.holmes at oracle.com> wrote:
>      >>      >
>      >>      >      Hi Robbin,
>      >>      >
>      >>      >      On 2/10/2019 7:58 pm, Robbin Ehn wrote:
>      >>      >      > Hi David,
>      >>      >      >
>      >>      >      >> What if the table is full and must be grown?
>      >>      >      >
>      >>      >      > The table uses chaining, it just means load factor tip over what is
>      >>      >      > considered a good backing array size.
>      >>      >
>      >>      >      Coleen raised a good question in a separate discussion, which made me
>      >>      >      realize that once the table has been initially populated all
>      >> subsequent
>      >>      >      additions, and hence all subsequent calls to grow() always happen
>      >> with
>      >>      >      the Threads_lock held. So we can't just defer the grow().
>      >>      >
>      >>      >      >> That aside, I'd like to know how expensive it is to grow this
>      >> table.
>      >>      >      >> What are we talking about here?
>      >>      >      >
>      >>      >      > We use global counter which on write_synchronize must scan all
>      >>      >      > threads to make sure they have seen the update (there some
>      >>      >      > optimization to avoid it if there is no readers at all). Since this
>      >>      >      > table contains the threads, we get double penalized, for each new
>      >>      >      > thread the synchronization cost increase AND the number of items.
>      >>      >      >
>      >>      >      > With concurrent reads you still need many thousands of threads, but
>      >>      >      > I think I saw someone mentioning 100k threads, assuming concurrent
>      >>      >      > queries the resize can take hundreds of ms to finish. Note that
>      >> reads
>      >>      >      > and inserts still in operate roughly at the same speed while
>      >>      >      > resizing. So a longer resize is only problematic if we do not
>      >>      >      > respect safepoints.
>      >>      >      I think if anything were capable of running 100K threads we would be
>      >>      >      hitting far worse scalability bottlenecks than this. But this does
>      >> seem
>      >>      >      problematic.
>      >>      >
>      >>      >      Thanks,
>      >>      >      David
>      >>      >      -----
>      >>      >
>      >>      >      > Thanks, Robbin
>      >>      >      >
>      >>      >      >>
>      >>      >      >> David
>      >>      >      >>
>      >>      >      >>> /Robbin
>      >>      >      >>>
>      >>      >      >>> On 2019-10-02 08:46, David Holmes wrote:
>      >>      >      >>>> Hi Daniil,
>      >>      >      >>>>
>      >>      >      >>>> On 2/10/2019 4:13 pm, Daniil Titov wrote:
>      >>      >      >>>>> Please review a change that fixes the issue. The problem
>      >> here is
>      >>      >      >>>>> that that the thread is added to the ThreadIdTable
>      >> (introduced in
>      >>      >      >>>>> [3]) while the Threads_lock is held by
>      >>      >      >>>>> JVM_StartThread. When new thread is added  to the thread
>      >> table the
>      >>      >      >>>>> table checks if its load factor is greater than required and
>      >> if so
>      >>      >      >>>>> it grows itself while polling for safepoints.
>      >>      >      >>>>> After changes [4]  an attempt to block the thread while
>      >> holding the
>      >>      >      >>>>> Threads_lock  results in assertion in
>      >>      >      >>>>> Thread::check_possible_safepoint().
>      >>      >      >>>>>
>      >>      >      >>>>> The fix  proposed by David Holmes ( thank you, David!)  is
>      >> to skip
>      >>      >      >>>>> the ThreadBlockInVM inside ThreadIdTable::grow() method if the
>      >>      >      >>>>> current thread owns the Threads_lock.
>      >>      >      >>>>
>      >>      >      >>>> Sorry but looking at the fix in context now I think it would be
>      >>      >      >>>> better to do this:
>      >>      >      >>>>
>      >>      >      >>>>      while (gt.do_task(jt)) {
>      >>      >      >>>>        if (Threads_lock->owner() == jt) {
>      >>      >      >>>>          gt.pause(jt);
>      >>      >      >>>>          ThreadBlockInVM tbivm(jt);
>      >>      >      >>>>          gt.cont(jt);
>      >>      >      >>>>        }
>      >>      >      >>>>      }
>      >>      >      >>>>
>      >>      >      >>>> This way we don't waste time with the pause/cont when there's no
>      >>      >      >>>> safepoint pause going to happen - and the owner() check is
>      >> quicker
>      >>      >      >>>> than owned_by_self(). That partially addresses a general
>      >> concern I
>      >>      >      >>>> have about how long it may take to grow the table, as we are
>      >>      >      >>>> deferring safepoints until it is complete in this
>      >> JVM_StartThread
>      >>      >      >>>> usecase.
>      >>      >      >>>>
>      >>      >      >>>> In the test you don't need all of:
>      >>      >      >>>>
>      >>      >      >>>>    32  * @run clean ThreadStartTest
>      >>      >      >>>>    33  * @run build ThreadStartTest
>      >>      >      >>>>    34  * @run main ThreadStartTest
>      >>      >      >>>>
>      >>      >      >>>> just the last @run suffices to build and run the test.
>      >>      >      >>>>
>      >>      >      >>>> Thanks,
>      >>      >      >>>> David
>      >>      >      >>>> -----
>      >>      >      >>>>
>      >>      >      >>>>> Testing : Mach 5 tier1 and tier2 completed successfully,
>      >> tier3 is
>      >>      >      >>>>> in progress.
>      >>      >      >>>>>
>      >>      >      >>>>> [1] Webrev:
>      >> http://cr.openjdk.java.net/~dtitov/8231666/webrev.01/
>      >>      >      >>>>> [2] Bug: https://bugs.openjdk.java.net/browse/JDK-8231666
>      >>      >      >>>>> [3] https://bugs.openjdk.java.net/browse/JDK-8185005
>      >>      >      >>>>> [4] https://bugs.openjdk.java.net/browse/JDK-8184732
>      >>      >      >>>>>
>      >>      >      >>>>> Best regards,
>      >>      >      >>>>> Danill
>      >>      >      >>>>>
>      >>      >      >>>>>
>      >>      >
>      >>      >
>      >>      >
>      >>
>      >>
>      
> 
>