RFR: 8345314: Add a red–black tree as a utility data structure [v12]
Thomas Stuefe
stuefe at openjdk.org
Thu Jan 16 17:32:56 UTC 2025
On Fri, 10 Jan 2025 10:03:22 GMT, Casper Norrbin <cnorrbin at openjdk.org> wrote:
>> Hi everyone,
>>
>> This effort began as an exploration of replacing the current NMT treap with a red-black tree. Along the way, I discovered that others were also interested in having a general-purpose tree structure available within HotSpot.
>>
>> The red-black tree is designed to serve as a drop-in replacement for the existing NMT treap, keeping a nearly identical interface. However, I’ve also added a few additional requested features, such as an iterator.
>>
>> Testing builds off the treap tests, adding a few extra that inserts/removes and checks that the tree is correct. Testing uses the function `verify_self`, which iterates over the tree and checks that all red-black tree properties hold. Additionally, the tree has been tested in vmatree instead of the treap without any errors.
>>
>> For those who may want to revisit the fundamentals of red-black trees, [Wikipedia](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree) offers a great summary with tables covering the various balancing cases. Alternatively, your favorite data structure book could provide even more insight.
>
> Casper Norrbin has updated the pull request incrementally with one additional commit since the last revision:
>
> clarified comment
Looked through verification first; will continue tomorrow.
src/hotspot/share/utilities/rbTree.hpp line 55:
> 53: class RBNode {
> 54: friend RBTree;
> 55: friend class RBTreeTest;
Is there a reason to be friends of RBTree? Even RBTreeTest?
src/hotspot/share/utilities/rbTree.hpp line 66:
> 64:
> 65: enum Color : uint8_t { BLACK, RED };
> 66: Color _color;
Possibly for a future optimization. You could keep the color information as one bit overlayed in one of the pointers. That saves four to eight bytes per Node structure.
The Node pointers have to be aligned to `sizeof(void*)`, so we will always have an alignment shadow of at least two bits. If you encapsulate accessing the tagged pointer correctly, this should not be complicated.
(I believe the kernel RB tree does the same thing)
src/hotspot/share/utilities/rbTree.hpp line 70:
> 68: public:
> 69: const K& key() const { return _key; }
> 70: V& val() { return _value; }
Here, and in other places: please give us const variants too (as @xmas92 noted, I believe).
src/hotspot/share/utilities/rbTree.inline.hpp line 92:
> 90: #ifdef ASSERT
> 91: template <typename K, typename V, typename COMPARATOR, typename ALLOCATOR>
> 92: inline bool RBTree<K, V, COMPARATOR, ALLOCATOR>::RBNode::is_correct(
Some proposals for better/more useful/more expressive verification:
1) I would use asserts inside the `Node::is_correct` function. I would not bother returning a bool but assert right away. This is used during tree verification only, and ideally, we want to crash hard and fast right there to have aa point to quickly debug. Arguably, I would rename the function to something like `verify()` (I'll use that name from now on)
2) I would then also not use combined asserts but assert each condition separately.
3) Please use size_t (or uintx, which I prefer but I know others don't) for the number of nodes/number of blacks in the arguments of this function.
4) I would not pass in limits to be checked by the node. Instead would let Node::verify() increase stats.
5) I would also return longest/shortest path to any leaf, and black-nodes-til-leaf, like this (see below point(6)):
void verify(
size_t& num_nodes, // total nodes
size_t& blacks_until_leaf, // returns the number of black nodes from here to its leafs
size_t& longest_leaf_path, // length of longest to-leaf path from here
size_t& shortest_leaf_path, // length of shortest to-leaf path from here
size_t current_depth
) const;
you also can create a Stat struct and pass a reference to that if you dislike so many arguments.
6) In Node::verify(), I would explicitly check the following red-black tree rules:
- If Node is red, it shall have no red childs; (I see you do this already)
- If a Node has exactly one child, the child must be red (consequently, the node must be black)
- After calling `Node::verify() `for left and right child recursively, the blacks-until-leaf for right node and the one for the left node should be the same (which is also why we return just one value for `blacks_until_leaf`)
- the longest leaf path and the shortest leaf path must never be farther apart than factor 2 (so, `shortest_leaf_path <= longest_leaf_path <= shortest_leaf_path * 2 `). That asserts that we are balanced as far as RB trees go.
7) In debug builds, please give the Node class a marker bit: `DEBUG_ONLY(bool _visited)`. Keep inside of the Tree class a `DEBUG_ONLY("bool _expected_visited")` bit. On Node creation/addition, set node->_visited to tree->_expected_visited. Then do this: on each `Tree::verify()` call, flip the tree expected bit, and in `Node::verify()` assert that the current node "_visited" bit is !_expected. Then change node->_visited to _expected. This is to ensure that the tree does not contain duplicates - we should have visited each node exactly once. This will be a thing to look out for when adding caller-allocated nodes as a feature. (the flipping of tree->_expected_visited just lets us avoid resetting the bit in each node after verification is completed)
8) If you do (7), you could also do something like this: write a gtest where all nodes are allocated from an array-like allocator. Now you have an independent way to traverse all nodes (via the array). After tree verification, you can then do one additional sweep through the array and check that every `Node::_visited == Tree::_expected_visited`. This makes sure our iteration process reaches all nodes.
9) outside the tree, once root->verify() returned, just check the num_nodes counter for == Tree::_size. Remove Node::count_nodes(), it's not needed. Now, less code and we have avoided one unnecessary iteration of the tree just to get the node count.
src/hotspot/share/utilities/rbTree.inline.hpp line 105:
> 103: bool right_is_correct = num_blacks == 0;
> 104: if (_left != nullptr) {
> 105: if (COMPARATOR::cmp(_left->key(), _key) >= 0 || // left >= root, or
Under which circumstances would the comparator return 0? We don't have duplicates in the tree, right?
src/hotspot/share/utilities/rbTree.inline.hpp line 114:
> 112: if (_right != nullptr) {
> 113: if (COMPARATOR::cmp(_right->key(), _key) <= 0 || // right <= root, or
> 114: (is_red() && _left->is_red()) || // 2 red nodes, or
Did you mean `_right->is_red()` ?
-------------
PR Review: https://git.openjdk.org/jdk/pull/22360#pullrequestreview-2556582163
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918935276
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918938994
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918805128
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918920888
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918882840
PR Review Comment: https://git.openjdk.org/jdk/pull/22360#discussion_r1918883391
More information about the hotspot-dev
mailing list