RFR: 8316142: Enable parallelism in vmTestbase/nsk/monitoring/stress/lowmem tests

Fri Sep 15 06:04:39 UTC 2023

On Thu, 14 Sep 2023 08:00:56 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>>>  and consume the usual amount of memory.
>> 
>> And how much is that? And at what concurrency level will we not be able to run these tests in parallel without potentially impacting the way they run i.e. running out of memory sooner than expected?
>> 
>> I'm concerned that these set of PRs to remove exclusive testing are going to cause a headache for those of us who have to monitor and triage CI testing. If I see one of these tests fail after this change goes in, there is nothing to give me any hint as to what has changed - no git log for the test file will show me something was modified!
>
>> > and consume the usual amount of memory.
>> 
>> And how much is that? And at what concurrency level will we not be able to run these tests in parallel without potentially impacting the way they run i.e. running out of memory sooner than expected?
> 
> They run at the standard heap sizes for the tests, driven by `MaxRAMPercentage` setup by build system. On my 18-core test servers, most of them run with ~700 MB RSS, sometimes peaking at ~1.1G. AFAICS, this is a common RSS for VM/GC tests. These tests eat Java heap / class memory and exit as soon as they catch OOME or load all the classes. The extended parallelism might delay that a bit, but I don't see this manifesting in practice. 
> 
>> I'm concerned that these set of PRs to remove exclusive testing are going to cause a headache for those of us who have to monitor and triage CI testing. If I see one of these tests fail after this change goes in, there is nothing to give me any hint as to what has changed - no git log for the test file will show me something was modified!
> 
> True. That's one of the reasons to avoid external test configs, whether it is `TEST.properties` near the tests, or the settings in global suite `TEST.ROOT`.
> 
> There are two bonus points from maintenance perspective:
> 
>  1. (technical) Note that the current `exclusiveDirs` limit the _in-group_ parallelism. This means that there is a random chance something else is running concurrently with these tests, if that test is outside of the this test group. So it is not like we are deciding if these tests should run in complete resource isolation from everything else or not -- they already are not isolated. Which means, if tests experience resource starvation, it would manifest pretty randomly, depending on what had been running in parallel. Unblocking the _in-group_ parallelism allows us to make these conditions manifesting more reliably. Which, I argue, benefits tests maintainability: if test can fail due to resource starvation, they would do so more often than once in a blue moon. We verify that is unlikely to happen by stress-testing multiple iterations of these tests.
>  
>  2. (organizational) Due to these parallelism blockages, `tier4` is remarkably slow. It is >10x slower than `tier3`, for example, and it gets worse as more untapped parallelism there is on the machine. Which is why I see both ad-hoc developer and vendor testing pipelines do not run `tier4` as frequently as they run `tier{1,2,3}`. Making `tier4` more parallel, and thus faster to run, me...

> @shipilev thanks for the broader context, but what platforms and configurations are you actually testing on?

Mostly 16..32-core x86_64 and AArch64 EC2 instances, similar to where the bulk of our testing runs. Testing with fastdebug binaries, sometimes juggling the GC and JIT selections.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15689#issuecomment-1720723033