[9] RFR(M) 8041984: CompilerThread seems to occupy all CPU in a very rare situation

Wed Oct 22 23:06:11 UTC 2014

Thank you, Igor

On 10/21/14 11:05 PM, Igor Veresov wrote:
> Alright, since nobody is taking on it.. The timeout feature looks good,
> I suppose a JIT needs to be adaptive to available CPU resources.
>
> A couple of small questions:
>
> 1. Why is it necessary to add phantom_obj to worklists?
>
> *  130   ptnodes_worklist.append(phantom_obj);
>    131   java_objects_worklist.append(phantom_obj);*

To avoid duplicated entries for phantom_obj on these lists.

Even so processed ideal nodes are unique on ideal_nodes list several 
ideal nodes are mapped to phantom_obj:

       case Op_CreateEx: {
         // assume that all exception objects globally escape
         map_ideal_node(n, phantom_obj);
         break;

as result ptnode_adr(n->_idx) will return phantom_obj several times and 
it will be added to above lists.

I added next comment:

   // Processed ideal nodes are unique on ideal_nodes list
   // but several ideal nodes are mapped to the phantom_obj.
   // To avoid duplicated entries on the following worklists
   // add the phantom_obj only once to them.
   ptnodes_worklist.append(phantom_obj);

>
> 2. What does pidx_bias do here?
>
>   399   inline void add_to_worklist(PointsToNode* pt) {
>   400     PointsToNode* ptf = pt;
>   401     uint pidx_bias = 0;
>   402     if (PointsToNode::is_base_use(pt)) {
>   403       ptf = PointsToNode::get_use_node(pt)->as_Field();
>   404       pidx_bias = _next_pidx;
>   405     }
>   406     if (!_in_worklist.test_set(ptf->pidx() + pidx_bias)) {
>   407       _worklist.append(pt);
>   408     }
>   409   }

It is complicated :(

To avoid creating a separate worklist in PointsToNode some edges are 
encoded specially (low bit set in a pointer):

   // Mark base edge use to distinguish from stored value edge.
   bool add_base_use(FieldNode* use) { return 
_uses.append_if_missing((PointsToNode*)((intptr_t)use + 1)); }

It is used in add_java_object_edges():

     PointsToNode* use = _worklist.at(l);
     if (PointsToNode::is_base_use(use)) {
       // Add reference from jobj to field and from field to jobj 
(field's base).
       use = PointsToNode::get_use_node(use)->as_Field();
       if (add_base(use->as_Field(), jobj)) {
         new_edges++;

So the _worklist may have the same node's pointers, with the bit set and 
without.  add_to_worklist() want to make sure only one of each type is 
present on the _worklist. As result VectorSet _in_worklist should have 2 
separate entries for each type of pointers.

I added new _pidx to PointsToNode because I did not want VectorSet 
_in_worklist to be big - number of created EA nodes is usually about 10% 
from number of Ideal nodes. So _next_pidx is the number of all created 
EA nodes and adding it as bias guarantee the an unique entry in VectorSet.

I will add next comment to add_to_worklist():

     if (PointsToNode::is_base_use(pt)) {
       // Create a separate entry in _in_worklist for a marked base edge
       // because _worklist may have an entry for a normal edge pointing
       // to the same node. To separate them use _next_pidx as bias.
       ptf = PointsToNode::get_use_node(pt)->as_Field();
       pidx_bias = _next_pidx;

I hope I answered your question :)

Thanks,
Vladimir

>
>
> Thanks,
> igor
>
>
> On Oct 21, 2014, at 11:32 AM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>> Please, I need reviews.
>>
>> Thanks,
>> Vladimir
>>
>> On 10/14/14 4:52 PM, Vladimir Kozlov wrote:
>>> I updated changes.
>>>
>>> http://cr.openjdk.java.net/~kvn/8041984/webrev.01/
>>>
>>> As David suggested I added SAMPLE_SIZE constant:
>>>
>>> #define SAMPLE_SIZE 4
>>>         if ((next % SAMPLE_SIZE) == 0) {
>>>           // Each 4 iterations calculate how much time it will take
>>>           // to complete graph construction.
>>>           time.stop();
>>>           double stop_time = time.seconds();
>>>           double time_per_iter = (stop_time - start_time) /
>>> (double)SAMPLE_SIZE;
>>>           double time_until_end = time_per_iter *
>>> (double)(java_objects_length - next);
>>>           if ((start_time + time_until_end) >= EscapeAnalysisTimeout) {
>>>
>>> I also replaced TIME_LIMIT constant with flag:
>>>
>>>   product(double, EscapeAnalysisTimeout, 20. DEBUG_ONLY(+40.),      \
>>>           "Abort EA when it reaches time limit (in sec)")      \
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 10/14/14 1:40 PM, Vladimir Kozlov wrote:
>>>> On 10/14/14 10:30 AM, David Chase wrote:
>>>>> I’m curious about the granularity of the completion-prediction —
>>>>> you’re checking every 4 iterations and extrapolating from there.
>>>>> Do we know that the times are that uniform, or would we be better off
>>>>> with larger sample sizes?
>>>>
>>>> Yes, I observed uniformly slowly growing times for PARTS of iterations
>>>> and not all iterations. That is why I want to take several samples.
>>>> For example first iterations where fast but when array allocation
>>>> objects begin processed time are rapidly growing.
>>>> If I take very big sample fast iterations affect the result - delay
>>>> abort.
>>>>
>>>> I did implementation similar to your "alternate approach" suggestion and
>>>> I saw that the abort happened much later than with current my approach.
>>>>
>>>>> (How long does a single iteration take, anyway?)
>>>>
>>>> As I said one slow iteration may take upto 1 sec. So I want to abort as
>>>> soon as I see a such case (but allow spikes). Taking larger sample sizes
>>>> may delay an abort for several seconds.
>>>>
>>>>>
>>>>> And in the sample code, I see the choice of sample size (4) encoded as
>>>>>   (next & 3) == 0
>>>>>   // Each 4 iterations calculate how much time it will take
>>>>>   double time_per_iter = (stop_time - start_time) * 0.25;
>>>>>
>>>>> why not write this as (say)
>>>>>
>>>>>   static const int LOG_SAMPLE_SIZE = 2;
>>>>>   static const int SAMPLE_SIZE = (1 << LOG_SAMPLE_SIZE);
>>>>>   next % SAMPLE_SIZE == 0   // next could be a uint, couldn’t it?
>>>>>   double time_per_iter = (stop_time - start_time) * (1.0/SAMPLE_SIZE);
>>>>>
>>>>> perhaps rewriting 1.0/SAMPLE_SIZE as another static const as necessary
>>>>> to get it properly precalculated; I’m not sure what C++ compilers are
>>>>> doing nowadays.
>>>>
>>>> I am optimizing :) by avoiding % and / operations. But I agree and can
>>>> add SAMPLE_SIZE constant so code can be more clear.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>> An alternate approach would be to compute the overall rate thus far,
>>>>> checking at
>>>>> (say) iterations 4, 8, 16, 32, etc and computing the average from
>>>>> there, starting
>>>>> at whatever power of two is a decent fraction of the iterations needed
>>>>> to timeout.
>>>>>
>>>>> e.g., don’t reset start_time, just compute elapsed at each sample:
>>>>>
>>>>> if ((next & (ptwo - 1) == 0) {
>>>>>   time.stop();
>>>>>   double elapsed = time.seconds() - start_time;
>>>>>   double projected = elapsed * (double) java_objects_length / next;
>>>>>
>>>>> Or you can compute a java_objects_length that would surely lead to
>>>>> failure, and
>>>>> always do that (fast, integer) comparison:
>>>>>   ...
>>>>>   unsigned int length_bound = (int) next * (CG_BUILD_TIME_LIMIT /
>>>>> elapsed)
>>>>>   …
>>>>>   if (java_objects_length > length_bound) { … timeout
>>>>>
>>>>> David
>>>>>
>>>>> On 2014-10-10, at 11:41 PM, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8041984
>>>>>> http://cr.openjdk.java.net/~kvn/8041984/webrev/
>>>>>>
>>>>>> Next method puts C2's EA to its knees:
>>>>>>
>>>>>> com.graphhopper.coll.GHLongIntBTree$BTreeEntry::put()
>>>>>>
>>>>>> https://github.com/karussell/graphhopper/blob/master/core/src/main/java/com/graphhopper/coll/GHLongIntBTree.java
>>>>>>
>>>>>>
>>>>>>
>>>>>> It shows that current EA connection graph build code is very
>>>>>> inefficient when a graph has a lot of connections (edges).
>>>>>>
>>>>>> With inlining the BTreeEntry::put() method has about 100 allocations
>>>>>> from which about 80 are arrays allocations. Most of them are object
>>>>>> arrays which are generated by Arrays.copyOf(). After number of edges
>>>>>> in EA Connection Graph reaches about 50000 it takes more then 1 sec
>>>>>> to find all nodes which reference an allocation.
>>>>>>
>>>>>> Current EA code has bailout by timeout from graph building code. But
>>>>>> the timeout (30 sec for product VM) is checked only after all object
>>>>>> nodes are processed. So it may take > 2 mins before EA is aborted:
>>>>>>
>>>>>> <phase name='connectionGraph' nodes='18787' live='10038'
>>>>>> stamp='4.417'>
>>>>>> <connectionGraph_bailout reason='reached time limit'/>
>>>>>> <phase_done name='connectionGraph' nodes='18787' live='10038'
>>>>>> stamp='150.967'/>
>>>>>>
>>>>>> Also ConnectionGraph::add_java_object_edges() method does not check
>>>>>> the presence of a node when it adds new one to the worklist (in
>>>>>> add_to_worklist() method). Which leads to hundreds MB of used by EA
>>>>>> memory when BTreeEntry::put() is processed.
>>>>>>
>>>>>> To address issues new timeout checks were added.
>>>>>> New _pidx index field is added to graph's nodes and VectorSet is used
>>>>>> to reduce size of worklist in add_java_object_edges() method.
>>>>>>
>>>>>> Escaping node CreateEx is mapped phantom_obj to reduce number of
>>>>>> process objects in connection graph. We do the same for other
>>>>>> escaping objects.
>>>>>>
>>>>>> Tested with graphhopper server application:
>>>>>>
>>>>>> <phase name='connectionGraph' nodes='18787' live='10038'
>>>>>> stamp='6.829'>
>>>>>> <connectionGraph_bailout reason='reached time limit'/>
>>>>>> <phase_done name='connectionGraph' nodes='18787' live='10038'
>>>>>> stamp='22.355'/>
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>
>