RFR: 8337217: Port VirtualMemoryTracker to use VMATree

Fri Nov 8 10:51:43 UTC 2024

On Fri, 27 Sep 2024 09:52:07 GMT, Afshin Zafari <azafari at openjdk.org> wrote:

> > What is the consensus on having two implementations alive at the same time? I'd like to see us delete the old VirtualMemoryTracker and only have the tree implementation left as part of this PR. What is the status of our testing? Is the tree version equivalent now and passes all tests?
> 
> I have already told that the main reason to have 2 versions is to be able to switch back from new to old version at customer site at runtime. This is what Thomas has requested.

The issue with having two implementations is that it requires adding a new `java` option, and deprecating and removing those take 3 releases. That's 18 months of us being required to have the old version in the codebase, and being required to maintain it. I don't think that's a negligible promise to make.

Rather than having 2 implementations, I'd like to see us aiming for integration for JDK-25 after forking 24, so integration in December. That would give us 6 months of ensuring stability of the new implementation before it reaches any customers. During those 6 months it will go through all of our testing multiple times over, and be used by OpenJDK developers. What do you think of this idea?

> During the implementation, it is much beneficial to have access to the old version to compare the results and debug problems when they occur. Until the whole port to VMATree is not done, we need this 2 versions side-by-side feature.

Sure, I can understand that it's nice to have both versions present during development. Right now it seems like we have a broken build, do you have any plans on having a functioning and fully featured build soon?

>> src/hotspot/share/nmt/regionsTree.hpp line 31:
>> 
>>> 29: #include "nmt/vmatree.hpp"
>>> 30: 
>>> 31: class RegionsTree : public VMATree {
>> 
>> Is it necessary for this to be a superclass? Seems like this class contains utilities for working with a `VMATree`. Could it instead be `AllStatic` and take a tree as its first argument? I'd like to avoid inheritance unless necessary (typically for the usage of virtual functions).
>
> It is inherited because it is not a new type. I think that `_tree->visit_reserved_region(par1, par2)` is more readable than `VMATree::visit_reserved_region(_tree, par1, par2)`
> I could not find any *virtual functions* in these classes, what do you mean by that?

I'm saying that inheritance is mostly useful when we have virtual functions which we can specialize, and that `VMATree` has none.

>> src/hotspot/share/nmt/regionsTree.hpp line 33:
>> 
>>> 31: class RegionsTree : public VMATree {
>>> 32:  private:
>>> 33:   NativeCallStackStorage* _ncs_storage;
>> 
>> No need for this to be a pointer.
>
> How to use the ctor with a `bool` arg, then?

The bool argument is just passed along.

```c++
  RegionsTree(bool with_storage) : VMATree(), _ncs_storage(with_storage) {
  }

>> src/hotspot/share/nmt/regionsTree.hpp line 108:
>> 
>>> 106:             CommittedMemoryRegion cmr((address)base, comm_size, st);
>>> 107:             comm_size = 0;
>>> 108:             prev.clear_node();
>> 
>> This is unneeded.
>
> Which part? Why?
> `clear_node()` is the same as `prev = nullptr` if pointers were used.
> `is_valid()` is the same as `prev == nullptr` if pointers were used.

Because you immediately reassign `prev` to `curr` in L114 it's not necessary to first call `clear_node`.

>> src/hotspot/share/nmt/virtualMemoryTrackerWithTree.cpp line 101:
>> 
>>> 99: }
>>> 100: 
>>> 101: void VirtualMemoryTrackerWithTree::apply_summary_diff(VMATree::SummaryDiff diff) {
>> 
>> Applying a summary diff in the MemoryFileTracker is this:
>> 
>> ```c++
>>   for (int i = 0; i < mt_number_of_types; i++) {
>>     VirtualMemory* summary = file->_summary.by_type(NMTUtil::index_to_flag(i));
>>     summary->reserve_memory(diff.flag[i].commit);
>>     summary->commit_memory(diff.flag[i].commit);
>>   }
>> 
>> 
>> This is much simpler and doesn't require looking at signs and so on.
>
> In `MemoryFileTracker`, the changes in commit/reserve are applied to a local `VirtualMemorySnapshot`. Here we have to apply them to the global `VirtualMemorySummary`.

Yeah, that doesn't seem like a problem.

```c++
  for (int i = 0; i < mt_number_of_types; i++) {
    r = diff.flag[i].reserve;
    c = diff.flag[i].commit;
    flag = NMTUtil::index_to_flag(i);
    VirtualMemory* mem = VirtualMemorySummary::as_snapshot()->by_type(flag);
    reserved = mem->reserved();
    committed = mem->committed();
    mem->reserve_memory(r);
    mem->commit_memory(c);
    if ((size_t)-r > reserved) {
      print_err("release");
    }
    if ((size_t)-c > reserved || (size_t)c > committed) {
      print_err("uncommit");
    }
  }

>> src/hotspot/share/nmt/virtualMemoryTrackerWithTree.hpp line 34:
>> 
>>> 32: #include "utilities/ostream.hpp"
>>> 33: 
>>> 34: class VirtualMemoryTrackerWithTree : AllStatic {
>> 
>> Can this class be redone to use the same pattern as `MemoryFileTracker` with an outer class that's not static and an inner class that's stack, storing an instance of the outer class? That makes testing the class by creating an instance of it possible, unlike our old tests.
>
> Can we do this change as a separate RFE? It impacts the code a lot.

Then just invert it: Have the outer class be static and the inner class be an instance. We can change the `MemoryFileTracker` to be that, as it's not as large of a change.

>> src/hotspot/share/nmt/vmatree.hpp line 59:
>> 
>>> 57:   // Bit fields view: bit 0 for Reserved, bit 1 for Committed.
>>> 58:   // Setting a region as Committed preserves the Reserved state.
>>> 59:   enum class StateType : uint8_t { Reserved = 1, Committed = 3, Released = 0, COUNT = 4 };
>> 
>> Why do we now consider `StateType` to be a bit field?
>
> As the comments above it says, and we already had a discussion on it: The 'Committed' state now means 'Reserved and Committed'. So, when we visit nodes and check for Reserved ones, the committed nodes are also considered as Reserved. Otherwise, we would ignore the Committed nodes and continue looking for Reserved nodes.

Right, I don't remember the discussion on this being a bit field. I agree, though, Committed also means Reserved, but it's not necessary to express that as a bit field. Regardless, I leave it up to you how we express this.

>> test/hotspot/gtest/nmt/test_nmt_treap.cpp line 168:
>> 
>>> 166:     tty->print_cr("d1: %lf, d2: %lf, d2/d1: %lf", d1, d2, d2 / d1);
>>> 167:     EXPECT_LT((int)(d2 / d1), N2 / N1);
>>> 168:   }
>> 
>> 1000 and 10_000 elements are both quite small, isn't one measurement prone to being slower than expected due to unforeseen circumstances? For example, if `malloc` performs a heavy allocation (by mapping in a few new  pages) during the 10_000 elements insertion then that may stall enough that the `EXPECT_LT` fails. We are also bound to only performing the test once.
>> 
>> Is your thinking that these perf tests should be kept in, or should they be removed before integrating?
>
> Generally, I think it is necessary to have performance tests to verify if any future changes do not degrade the performance.
> The approach of stress testing can be agreed between us, but the tests have to exist.
> In this approach, the input size is scaled N times and we expect the execution time grows as O(log(N)) which is much less than N.
> This test fails and we need to have a justification for it. If approach is invalid or not correct implemented, we fix it. But the expectations remains the same.

Why would the execution time grow logarithmically when we do linearly more work? When we run this with `N2` we will perform `10_000 * log(10_000, 2)` units of work, and for `N1` we will perform `1_000 * log(1_000, 2)` units of work, approximately. A closer approximation is `O(log(1)) + O(log(2)) + ... + O(log(n))` for `n = N2` or `n = N1`.

Let's calculate that:

>>> import math
>>> def time_it(n):
...     t = 0
...     for i in range(1, n):
...             t  = t + math.log(i)
...     return [t, 3*t] # Approximate best and worst case
... 
>>> time_it(1000)
[5905.220423209189, 17715.661269627566]
>>> time_it(10_000)
[82099.71749644217, 246299.15248932652]
>>> def compare(a, b):
...     ab, aw = a
...     bb, bw = b
...     return [ bb / ab, bw / aw ]
... 
>>> compare(time_it(1000), time_it(10_000))
[13.902904821938064, 13.902904821938066]
>>> # Ouch, that's outside of our linear bound!

What I said previously also applies, especially the performance of `malloc` will impact this I suspect.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20425#issuecomment-2378929190
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1711105506
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703735921
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1799085027
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703751887
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703739805
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1776820787
PR Review Comment: https://git.openjdk.org/jdk/pull/20425#discussion_r1703877183