RFR: 8324649: Shenandoah: refactor implementation of free set [v2]

Fri Jan 26 18:47:35 UTC 2024

On Thu, 25 Jan 2024 20:45:50 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> Several objectives:
>> 1. Reduce humongous allocation failures by segregating regular regions from humongous regions
>> 2. Do not retire regions just because an allocation failed within the region if the memory remaining within the region is large enough to represent a LAB
>> 3. Track range of empty regions in addition to range of available regions in order to expedite humongous allocations
>> 4. Treat collector reserves as available for Mutator allocations after evacuation completes
>> 5. Improve encapsulation so as to enable an OldCollector reserve for future integration of generational Shenandoah
>> 
>> On internal performance pipelines, this change shows:
>> 
>>  1. some Increase in page faults and rss_max with certain workloads, presumably because of "segregation" of humongous from regular regions.
>>  2. An increase in System CPU time on certain benchmarks: sunflow (+165%), scimark.sparse.large (+50%), lusearch (+43%).  This system CPU time increase appears to correlate with increased page faults and/or rss.
>>  3. An increase in trigger_failure for the hyperalloc_a2048_o4096 experiment (not yet understood)
>>  4. 2-30x improvements on multiple metrics of the Extremem phased workload latencies (most likely resulting from fewer degenerated or full GCs)
>> 
>> Shenandoah
>> -------------------------------------------------------------------------------------------------------
>> +166.55% scimark.sparse.large/minor_page_fault_count p=0.00000
>>   Control: 819938.875   (+/-5724.56  )         40
>>   Test:    2185552.625   (+/-26378.64  )         20
>> 
>> +166.16% scimark.sparse.large/rss_max p=0.00000
>>   Control: 3285226.375   (+/-22812.93  )         40
>>   Test:    8743881.500   (+/-104906.69  )         20
>> 
>> +164.78% sunflow/cpu_system p=0.00000
>>   Control:      1.280s  (+/-  0.10s )         40
>>   Test:         3.390s  (+/-  0.13s )         20
>> 
>> +149.29% hyperalloc_a2048_o4096/trigger_failure p=0.00000
>>   Control:      3.259   (+/-  1.46  )         33
>>   Test:         8.125   (+/-  2.05  )         20
>> 
>> +143.75% pmd/major_page_fault_count p=0.03622
>>   Control:      1.000   (+/-  0.00  )         40
>>   Test:         2.438   (+/-  2.59  )         20
>> 
>> +80.22% lusearch/minor_page_fault_count p=0.00000
>>   Control: 2043930.938   (+/-4777.14  )         40
>>   Test:    3683477.625   (+/-5650.29  )         20
>> 
>> +50.50% scimark.sparse.small/minor_page_fault_count p=0.00000
>>   Control: 697899.156   (+/-3457.82  )         40
>>   Test:    1050363.812   (+/-175...
>
> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove unnecessary change related to debugging

I've been trying to understand the reported trigger failure regression on hyperalloc.  I cannot reproduce that result.  The host on which I'm experimenting apparently has more cores than what runs the pipeline, so I have to push hyperalloc to higher allocation rates and higher memory utilization in order to see any "trigger failures at all".  On my host, I see a small number of trigger failures beginning at 6144 KB/s allocation rate with live memory 4096 MB out of heap size 10 GB.  On this workload, I see slightly more trigger failures with the original free set implementation than with the new one.  Here are the results of all my recent experiments:

![Screenshot 2024-01-26 at 10 43 47 AM](https://github.com/openjdk/jdk/assets/51720475/2c4e4598-ebf7-4c7b-8e4a-8126e65850cd)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17561#issuecomment-1912522975