RFR: 8272807: Permit use of memory concurrent with pretouch

Sun Sep 12 19:44:40 UTC 2021

> On Sep 10, 2021, at 9:38 AM, Stefan Karlsson <stefan.karlsson at oracle.com> wrote:
> 
> On 2021-09-09 00:19, Kim Barrett wrote:
>>> On Sep 8, 2021, at 10:07 AM, Stefan Karlsson <stefank at openjdk.java.net> wrote:
>>> 
>>> On Thu, 2 Sep 2021 18:33:56 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:
>>> 
>>>> Note that this PR replaces the withdrawn https://github.com/openjdk/jdk/pull/5215.
>>>> 
>>>> Please review this change which adds os::touch_memory, which is similar to
>>>> os::pretouch_memory but allows concurrent access to the memory while it is
>>>> being touched.  This is accomplished by using an atomic add of zero as the
>>>> operation for touching the memory, ensuring the virtual location is backed
>>>> by physical memory while not changing any values being read or written by
>>>> other threads.
>>>> 
>>> I think it would be prudent to separate this PR into two separate PRs.
>> I don’t see any benefit to that, but I can do it.
> 
> There reason why I suggest that this gets split into two PRs is that there seems to be at least two different reasons for changing the code, and by mixing them into one PR it makes the review process unfocused. By mixing cleanups, fixes, and features in one PR, I as a reviewer must also spend more time trying to figure out what changes are necessary for what reason. I'm personally more likely to review smaller PRs that have a clear focus.

I expected that if I only proposed adding the serial version that one of the
first things any reviewer would ask is why I wasn't also adding the parallel
version.  (But maybe you wouldn't have asked that.)

I could have separated the restructuring of the pretouch loop as a
preliminary change.  Similarly, I could have separated the restructuring of
the parallel pretouch chunk claim as a preliminary change.  I thought about
doing that (and I'm guessing you would have preferred it), but neither
seemed that useful as I'd be immediately refactoring the code anyway.

I could have added concurrent touching support and some uses of it, but
decided to first do a PR devoted to just the infrastructure, leaving usage to
later, where the usage changes would be the focus.

Out of the many possible ways to slice things up, I picked one that seemed to
me to be coherent and of a reasonable (not too large) size.

If you need a different slicing before reviewing this, please make a specific
request.

>>> 1) For the changes to the pretouch loop.
>>> 
>>> * I'm not sure we need all the added safeguarding, which, as you say, complicate the code. I think a few asserts would be good enough.
>> It’s not; see below.
>> 
>>> I don't think the function needs to:
>>> - work if the caller pass in end < start. Sounds like an assert.
>> There's already an assert for end < start.
>> 
>> The test currently in the code is for start < end, e.g. is the range
>> non-empty.  I originally tried asserting a non-empty range, but that seems
>> such show up sometimes.  I didn't try to figure out whether those could be
>> eliminated, though sometimes empty ranges show up pretty naturally and
>> range-based operations typically permit the range to be empty.
>> 
>> I found the empty range check up front simplified the analysis of the
>> protected code, since certain kinds of errors simply can't happen because of
>> it.  Unless empty ranges are forbidden, something needs to be done somewhere
>> to prevent writing outside the range.
> 
> Before continuing the discussing around these checks, I'd like to understand the motivation for some of the changes. I'm specifically looking at the alignment of sizeof(int). I would have expected that all 'start' addresses being passed to the "touch" code to be at least aligned to os::vm_page_size(). Straying away from that seems to add complexity, that I'm not sure is worth having. Do you have a specific use-case that needs this? Or could we simplify the code by adding an assert(is_aligned(start, os::vm_page_size), "")? If not, could you point me to the code that doesn't conform to that?

I *think* all existing dynamic uses ensure the start (and I think end too) are
at least vm_page_size() aligned. There's at least one static use that
definitely doesn't: BitMap::pretouch, of which there appear to be no callers
anymore.

The existing pretouch implementation doesn't require (as in assert or
comments) any particular alignment, and will complete without error (subject
to any overflow issues) regardless of the alignments of the arguments. It is
documented as possibly not touching the last page if the arguments aren't
(presumably page) aligned.

So it appears that pretouching has always permitted arguments that aren't page
aligned. Many of the pretouch callers perform explicit page alignment of the
arguments from values that might not be so aligned. Most of those occur in the
context of some operation (like commit) that is already dealing with pages and
page-aligned values. I'm guessing they are doing to ensure they avoid the
documented possible lack of touching the last page.

That seems like the wrong API tradeoff to me, given the ease with which
arbitrarily unaligned ranges can be handled by pretouch (as demonstrated by
the new code).

And it's potentially much worse for callers of the proposed new touch
operation, which will be called in some entirely different context.

For example, when ParallelGC expands the old generation, the higher level code
where the new touch operation would be placed (PSOldGen::expand_for_allocate)
calls an expansion function with a minimum size and has available the start
and end of the added range. If I recall correctly, that start and end are
going to be page aligned, but that information comes from digging through the
call chain; there's no promise of such by the expansion function (and no
reason for it to make such a promise, as it's really just an artifact of the
lower level implementation).  The touch call would either need to assume (and
assert) that alignment or ensure it, neither of which is an improvement to the
calling code.

There are similar situations in G1 when touching the auxiliary memory
associated with a region (card table, BOT, &etc). The size of the card table
range associated with a 1M region is 2K, which is less than an linux default
x86 page, and much less than a linux default aarch64 page.

I thought about making the new concurrent touch operation require the range to
be int-aligned. I didn't because I think it's important to have the two touch
variants be API compatible, and there was no reason to add any alignment
requirement to pretouch. Adding such a requirement to touch would just be
allowing the underlying implementation to show through in an inconvenient way
for callers.

>>> - Safeguard that end + page doesn't overflow. Do we ever hand out pages at the end of the virtual address range? There's a lot of HotSpot code that assumes that we never get that kind of memory, so is it worth protecting against this compared to the extra complexity it gives? Previously, I could take a glance at that function and understand it. Now I need to pause and stare at it a bit.
>> I suspect overflow can't happen on at least some 64bit platforms, but I
>> don't know of anything that would prevent it on a 32bit platform. And I had
>> exactly the opposite reaction from you to the old code. I looked at it and
>> immediately wondered what might happen on pointer overflow, and questioned
>> whether I understood the code.
>> 
>> And as noted in the PR description, the old code for PretouchTask checks for
>> overflow, but does so incorrectly, such that the check could be optimized
>> away (or lead to other problems) because it would involve UB to fail the check.
> 
> I don't know of any platform that hands out pages at the top of the address range. (Not saying that I know the details of all platforms).  However, maybe this argument is more appealing: If the program got hold of a memory address at the end of the address range, then I think C++'s pointer comparison would break down. There's a section that states:
> "If one pointer points to an element of an array, or to a subobject thereof, and another pointer points one past the last element of the array, the latter pointer compares greater."
> 
> So, if we have a 4K byte array at 0xfffff000, then the start address would obviously be 0xfffff000, but "one past the last element" would be 0xfffff000 + 0x1000, which would overflow, likely resulting in the address 0x0 (barring UB issues). If that happens, then it seems like the statement above is at risk.

It is indeed the case that C++ pointer comparison can break down. That quote
is about arrays created by facilities provided by the C/C++ language and its
libraries, like malloc. But we regularly step outside that boundry by using OS
facilities like mmap. As an example, it's for this reason that we have the
pointer_delta utility function. On a 32bit platform a "proper" C/C++ array
can't be 2Gbytes or larger in size, because the difference between the start
and end can't be be represented by a 32bit ptrdiff_t.

That doesn't prevent the C/C++ language implementation from using memory near
the end of the address space; it just needs to take some care to avoid
allocating an array whose end is at the end of the address space.

Do we have potentially incorrect code because of this?  Quite possibly.  I
didn't want to go on a speculative hunt for such, but tidying up code that I
was touching (no pun intended) anyway seemed appropriate.

>>> 2) For the new infrastructure.
>>> 
>>> * I'm not sure we should add TouchTask until we have a concrete use-case (with prototype/PR) for it. Without that, it is hard to gauge the usability of this feature. How do other threads proceed concurrently with the worker threads? What is the impact of using TouchTask? One of the concerns that has been brought up by others is that using a WorkGang will block the rest of the JVM from safepointing while the workers are pre-touching. A use-case would clarify this, I think.
>> Here is a description of a specific example from my experiments. When the G1
>> allocator runs out of regions that were pre-allocated for eden and can't GC
>> because of the GCLocker, it tries to allocate a new region, first from the
>> free region list, and then an actually new region (assuming that's possible
>> within the heap size constraints). In the latter case, when AlwaysUsePretouch
>> is true, it currently does a *single-threaded* pretouch of the region and
>> ancillary memory. (It's single-threaded because most of the code path involved
>> is shared with during-GC code that is running under the work gang. That code
>> is a good candidate for os::touch_memory, so adding a gang argument through
>> that call tree that is null to indicate don't pretouch doesn't seem like an
>> improvement.) It could instead do a parallel concurrent touch after making the
>> new region available for other mutator threads to use for allocation. Yes, it
>> blocks safepoints, but that's already true for the existing code, and the
>> existing code may block them for *much* longer due to the unparallelized
>> pretouch. The downside of using a workgang here is the gang threads could be
>> competing with mutator threads for cores. I think parallelizing here is likely
>> beneficial though. I think there's a similar situation in ParallelGC, though I
>> haven't chased through all of that code carefully yet.
> 
> Thanks for the use-case.
> 
> Maybe the WorkGang is an easy and good enough approach to start with, even with its use of a safepoint-blocking mechanism. Is this feature going to be turned off by default?

The current code is already safepint-blocking. The idea in the above is to
allow other threads to make use of the newly allocated chunk sooner. I think
the feature might be under some user control, though perhaps something
different from AlwaysPreTouch.

> I have the feeling that for small enough requests, it will be better to let the allocating thread "concurrent touch" the memory. Other Java threads that allocate small objects, will probably get the paged in memory faster than if we spawn up worker threads. It would also be interesting to see an approach that touches memory in an async thread (though it needs to cooperate with the JVM so that the memory isn't uncommitted).

Small requests, like at the Java small object level, wouldn't normally go
through this mechanism.  Pretouch is used for larger allocation chunks, and I
think the new mechanism would be operating on the same or similar granularity.
So MinHeapDeltaBytes, G1HeapRegionSize, and so on.

I've done some experimenting with cooperative touching at smaller
granularities, but haven't come up with an abstraction I was happy with. (Not
that I think the existing PretouchTask style is ideal; it doesn't support
parallel touching of a G1 region's auxiliary data very well.) The idea is that
as each thread carves (say) a new TLAB out of a region, it can (if needed)
advance the touched boundry for the region.

>>> * Is the usage of template specialization really needed? Wouldn't it be sufficient to pass down a bool parameter instead? I doubt we would see any performance difference by doing that.
>> The use of a template here seemed pretty straight-forward when I wrote it,
>> but I agree a boolean argument would work just as well and is more obvious
>> to read. I'll change that.  I expect the same generated code either way.
>> 
>>> * It's a little bit confusing that in the context of TouchTask "touch" means "concurrent touch", while in the touch_memory_* it means either concurrent or non-concurrent. Maybe rename TouchTask to something to make this distinction a bit clearer.
>> I'm open to alternative naming. I thought ConcurrentTouchTask was rather
>> long. My experiments repo currently uses PretouchTask::concurrent_touch, but
>> I don't like that naming either.
>> 
>> The rationale for "touch" vs "pretouch" is "touching" is the primary generic
>> concept, and touch_memory can be used anywhere. Meanwhile "pretouching" is now
>> a restricted variant that might have better performance under those limitions.
>> Similarly for TouchTask vs PretouchTask. The static functions are just shared
>> helpers for the API functions; "touch_memory" is the "primary" thing.
> 
> I think ConcurrentTouchTask is a good and descriptive name that will aid the readability.

I think ConcurrentTouchTask has a defect in that it suggests it's for use
during the concurrent part of a GC cycle. But it can and should be used in
some during-GC-pause or other safepoint situations. That is, "concurrent" is
overloaded in our world.

That's why I suggest thinking about it differently, with "touch" being generic
and "pretouch" being restricted and perhaps gaining an efficiency benefit from
the restriction.