sentinels & who owns the value space Re: valhalla-dev Digest, Vol 17, Issue 10
Rezaei, Mohammad A.
Mohammad.Rezaei at gs.com
Sat Dec 19 04:57:20 UTC 2015
Let’s take a simple example: IntHashSet in GS Collections. As the name suggests, it stores ints. All ints may be stored in the set.
The set is implemented as an open hash set with a single int array, 4 int fields and no reference objects:
private int table;
private int occupiedWithData;
private int occupiedWithSentinels;
// The 32 bits of this integer indicate whether the items 0 to 31 are present in the set.
private int zeroToThirtyOne;
private int zeroToThirtyOneOccupied;
No wrapper objects of any kind (Integer, or Entry) are created when set operations are performed.
Internally, there are two special values: EMPTY (0) and REMOVED (1). If a position in the array is EMPTY, there is no client value there. If a position in the array is REMOVED, there was once a client value there, but not anymore. We call these two special values sentinels. Any other value in the array means that value is in the set.
The following code is perfectly legal:
IntHashSet set = new IntHashSet();
set.contains(0); // returns false, even though the internal array is full of 0
set.contains(0); // returns true
set.contains(1); // returns true
this is accomplished by remembering the addition of 0 or 1 by flipping a bit in zeroToThirtyOne. All methods (add, contains, remove, etc) are guarded by if-statements that handle 0 and 1 specially. We don’t need 2^32 + 2 values, even though two values have special meaning in the structure, and they are not excluded from the client’s use, nor does the main storage structure need more than 32 bits per element. The cost of the extra storage (the int zeroToThiryOne) is constant and does not affect the main storage; it’s overhead goes to zero as the set size increases.
All of this is possible because the set can designate two client space values as special and use a uniform single array for storage.
The same approach is impossible with value types as proposed in Valhalla.
From: Thomas W [mailto:twhitmore.nz at gmail.com]
Sent: Friday, December 18, 2015 6:52 PM
To: Rezaei, Mohammad A. [Tech]
Cc: valhalla-dev at openjdk.java.net; timo.kinnunen at gmail.com; vitalyd at gmail.com
Subject: Re: sentinels & who owns the value space Re: valhalla-dev Digest, Vol 17, Issue 10
> A properly written data structure (see trove, gs collections and koloboke for examples) does
> not preclude the client from using the sentinel values in the collection or for other purposes. null is a sentinel in hashmap, yet
> null keys are fully supported. A properly written data structure hides the actual sentinel values from the client.
Thanks for your email. You may be referring to a sentinel for _entries_, which is a different issue from sentinels for _keys_ as _entries_ are not the client value space. I'm not sure it would even be correct terminology to describe null entries as "sentinels"; they're definitely not sentinel key values.
If the collection requires a sentinel for internal meaning, an unwrapped null sentinel value cannot be used in a collection allowing null values from the client where the same storage element is required to store both.
This is due to the simple fact that the meaning of that sentinel cannot not be distinguished from all valid client values. Don't forget, if you want to consider eg. a byte type -- the client has 256 valid values, plus you're saying the collection has 1 more (the sentinel) for a total of 257 values.
Bytes do not allow 257 values. When I said information theory, I meant it.
We are discussing use of sentinels in value space _directly_. I expect you are referring to alternative approaches using additional storage/ or indirection, which make it easy to solve the problem -- see below. However your (very vague) references presumably to those approaches don't seem to shed any light on the (different) topic which we were discussing.
Sentinels directly in value space, with no extra storage, indirection or workarounds, is really what we were discussing.
> No information theory is broken, because the client does not need to be aware of the internal sentinels used in the data
Information theory then tells us that a collection allowing 256 valid client values, plus distinguishing a sentinel, must using more than 1 byte for storage. Indirection, reference-type sentinels or additional structure are all possible. The collection will not be using just the declared datatype for storage, since those 257 distinct meanings cannot be coded in a single byte.
Once we accept a need for indirection/ extra storage, the 'empty cell' problem is already solved -- the collection has space for a flag/ or can easily mask null with a unique non-null reference, etc. Simplest solutions are the most direct; which would seem to leave no benefit (only negatives) for random-bit approaches.
More information about the valhalla-dev