RFR: 8272807: Permit use of memory concurrent with pretouch

Fri Sep 10 13:38:17 UTC 2021

On 2021-09-09 00:19, Kim Barrett wrote:
>> On Sep 8, 2021, at 10:07 AM, Stefan Karlsson <stefank at openjdk.java.net> wrote:
>>
>> On Thu, 2 Sep 2021 18:33:56 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:
>>
>>> Note that this PR replaces the withdrawn https://github.com/openjdk/jdk/pull/5215.
>>>
>>> Please review this change which adds os::touch_memory, which is similar to
>>> os::pretouch_memory but allows concurrent access to the memory while it is
>>> being touched.  This is accomplished by using an atomic add of zero as the
>>> operation for touching the memory, ensuring the virtual location is backed
>>> by physical memory while not changing any values being read or written by
>>> other threads.
>>>
>> I think it would be prudent to separate this PR into two separate PRs.
> I don’t see any benefit to that, but I can do it.

There reason why I suggest that this gets split into two PRs is that 
there seems to be at least two different reasons for changing the code, 
and by mixing them into one PR it makes the review process unfocused. By 
mixing cleanups, fixes, and features in one PR, I as a reviewer must 
also spend more time trying to figure out what changes are necessary for 
what reason. I'm personally more likely to review smaller PRs that have 
a clear focus.

>
>> 1) For the changes to the pretouch loop.
>>
>> * I'm not sure we need all the added safeguarding, which, as you say, complicate the code. I think a few asserts would be good enough.
> It’s not; see below.
>
>> I don't think the function needs to:
>> - work if the caller pass in end < start. Sounds like an assert.
> There's already an assert for end < start.
>
> The test currently in the code is for start < end, e.g. is the range
> non-empty.  I originally tried asserting a non-empty range, but that seems
> such show up sometimes.  I didn't try to figure out whether those could be
> eliminated, though sometimes empty ranges show up pretty naturally and
> range-based operations typically permit the range to be empty.
>
> I found the empty range check up front simplified the analysis of the
> protected code, since certain kinds of errors simply can't happen because of
> it.  Unless empty ranges are forbidden, something needs to be done somewhere
> to prevent writing outside the range.

Before continuing the discussing around these checks, I'd like to 
understand the motivation for some of the changes. I'm specifically 
looking at the alignment of sizeof(int). I would have expected that all 
'start' addresses being passed to the "touch" code to be at least 
aligned to os::vm_page_size(). Straying away from that seems to add 
complexity, that I'm not sure is worth having. Do you have a specific 
use-case that needs this? Or could we simplify the code by adding an 
assert(is_aligned(start, os::vm_page_size), "")? If not, could you point 
me to the code that doesn't conform to that?

>
>> - Safeguard that end + page doesn't overflow. Do we ever hand out pages at the end of the virtual address range? There's a lot of HotSpot code that assumes that we never get that kind of memory, so is it worth protecting against this compared to the extra complexity it gives? Previously, I could take a glance at that function and understand it. Now I need to pause and stare at it a bit.
> I suspect overflow can't happen on at least some 64bit platforms, but I
> don't know of anything that would prevent it on a 32bit platform. And I had
> exactly the opposite reaction from you to the old code. I looked at it and
> immediately wondered what might happen on pointer overflow, and questioned
> whether I understood the code.
>
> And as noted in the PR description, the old code for PretouchTask checks for
> overflow, but does so incorrectly, such that the check could be optimized
> away (or lead to other problems) because it would involve UB to fail the check.

I don't know of any platform that hands out pages at the top of the 
address range. (Not saying that I know the details of all platforms).  
However, maybe this argument is more appealing: If the program got hold 
of a memory address at the end of the address range, then I think C++'s 
pointer comparison would break down. There's a section that states:
"If one pointer points to an element of an array, or to a subobject 
thereof, and another pointer points one past the last element of the 
array, the latter pointer compares greater."

So, if we have a 4K byte array at 0xfffff000, then the start address 
would obviously be 0xfffff000, but "one past the last element" would be 
0xfffff000 + 0x1000, which would overflow, likely resulting in the 
address 0x0 (barring UB issues). If that happens, then it seems like the 
statement above is at risk.

>
>> 2) For the new infrastructure.
>>
>> * I'm not sure we should add TouchTask until we have a concrete use-case (with prototype/PR) for it. Without that, it is hard to gauge the usability of this feature. How do other threads proceed concurrently with the worker threads? What is the impact of using TouchTask? One of the concerns that has been brought up by others is that using a WorkGang will block the rest of the JVM from safepointing while the workers are pre-touching. A use-case would clarify this, I think.
> Here is a description of a specific example from my experiments. When the G1
> allocator runs out of regions that were pre-allocated for eden and can't GC
> because of the GCLocker, it tries to allocate a new region, first from the
> free region list, and then an actually new region (assuming that's possible
> within the heap size constraints). In the latter case, when AlwaysUsePretouch
> is true, it currently does a *single-threaded* pretouch of the region and
> ancillary memory. (It's single-threaded because most of the code path involved
> is shared with during-GC code that is running under the work gang. That code
> is a good candidate for os::touch_memory, so adding a gang argument through
> that call tree that is null to indicate don't pretouch doesn't seem like an
> improvement.) It could instead do a parallel concurrent touch after making the
> new region available for other mutator threads to use for allocation. Yes, it
> blocks safepoints, but that's already true for the existing code, and the
> existing code may block them for *much* longer due to the unparallelized
> pretouch. The downside of using a workgang here is the gang threads could be
> competing with mutator threads for cores. I think parallelizing here is likely
> beneficial though. I think there's a similar situation in ParallelGC, though I
> haven't chased through all of that code carefully yet.

Thanks for the use-case.

Maybe the WorkGang is an easy and good enough approach to start with, 
even with its use of a safepoint-blocking mechanism. Is this feature 
going to be turned off by default?

I have the feeling that for small enough requests, it will be better to 
let the allocating thread "concurrent touch" the memory. Other Java 
threads that allocate small objects, will probably get the paged in 
memory faster than if we spawn up worker threads. It would also be 
interesting to see an approach that touches memory in an async thread 
(though it needs to cooperate with the JVM so that the memory isn't 
uncommitted).

>
>> * Is the usage of template specialization really needed? Wouldn't it be sufficient to pass down a bool parameter instead? I doubt we would see any performance difference by doing that.
> The use of a template here seemed pretty straight-forward when I wrote it,
> but I agree a boolean argument would work just as well and is more obvious
> to read. I'll change that.  I expect the same generated code either way.
>
>> * It's a little bit confusing that in the context of TouchTask "touch" means "concurrent touch", while in the touch_memory_* it means either concurrent or non-concurrent. Maybe rename TouchTask to something to make this distinction a bit clearer.
> I'm open to alternative naming. I thought ConcurrentTouchTask was rather
> long. My experiments repo currently uses PretouchTask::concurrent_touch, but
> I don't like that naming either.
>
> The rationale for "touch" vs "pretouch" is "touching" is the primary generic
> concept, and touch_memory can be used anywhere. Meanwhile "pretouching" is now
> a restricted variant that might have better performance under those limitions.
> Similarly for TouchTask vs PretouchTask. The static functions are just shared
> helpers for the API functions; "touch_memory" is the "primary" thing.

I think ConcurrentTouchTask is a good and descriptive name that will aid 
the readability.

StefanK

>