RFR(S): 8087143: Reduce Symbol::_identity_hash to 2 bytes

Tue Jun 23 19:28:36 UTC 2015

On 6/23/15 4:08 AM, Aleksey Shipilev wrote:
> On 06/22/2015 11:48 PM, Ioi Lam wrote:
>
>> Since it's absolutely unlikely that anyone would have a system
>> dictionary (or any hash table) larger than 16M entries (even if we
>> change SystemDictionary::calculate_systemdictionary_size),  let's
>> forget the top 8 bits, and just look at the bottom 24 bits of
>> Symbol::identity_hash().
> This is a technical debt: conditionalizing the choice of hashcode
> function on an weird hash table implementation detail. If you don't want
> to revisit this over and over again after it bites you, please create a
> hash function that spits all the significant bits.
>
>
>>>>    I also have problem with "|", could it be more chance to have the
>>>> results have more '1' s and make the case worse?
>>>>
>>> Oops, I think it should be this:
>>>
>>> unsigned identity_hash() {
>>>      unsigned lower = (unsigned)_identity_hash;
>>>      unsigned upper1 = ((unsigned)_body[0]);
>>>      unsigned upper2 = ((unsigned)_body[1]);
>>> unsigned upper3 = ((unsigned)((uintptr_t)(this) >>
>>> (LogMinObjAlignmentInBytes + 3)));
>>>      unsigned upper4 = ((unsigned)_length) ;
>>>      return lower | ((upper1 ^ upper2 ^ upper3 ^ upper4) << 16);
>>> }
> The rationale for "|" is simple: with proper shifts, it does concat, not
> mixing. So you can spread the entropy of the sources over the entire
> bitscale. In other words, you have to first maximize the value domain
> for the result, and then try to maximize the the amount of source data
> *each* result bit gets.
>
> The baseline os::random() should pass the bit randomicity test (it is
> one of the basic statistical tests for randomness), and if we are
> augmenting it, we should strive to satisfy the same requirement.
>
> For example, the hash function below is supposed to generate 32-bit
> value, and we are arguing about the 16 bits upper word. Let's see:
>
>    // affects lower 8 bits
>    ((unsigned)_body[0]);
>
>    // affects lower 8 bits, probably correlated with _body[0]
>    ((unsigned)_body[1]);
>
>    // hopefully (!) affects 16 bits, but is it high entropy?
>    ((unsigned)((this) >> (LogMinObjAlignmentInBytes + 3)));
>
>    // realistically, affects lower 4-10 bits
>    ((unsigned)_length);
>
> So you would like to cram the address at the entire 16-bit upword, and
> the overlay other values over it to compensate for potentially low entropy:
>
>    identity_hash
>     | (address
>           ^ (_length << 8) // cover higher bytes of address
>           ^ (( _body[0] << 8) | _body[1])
>       ) << 16

Hi Aleksey,

I like this version :-)

I assume by "address" you mean

((unsigned)((this) >> (LogMinObjAlignmentInBytes + 3)));

> But if you want to go properly with choosing the hash function instead
> of flying blind, you'd need to dump all these individual values for
> symbols in a large workload, and simulate the simplest hash function,
> instead of guessing. (There is a funny example in AoCP where Knuth
> "built" the random function in the same manner by slapping pieces
> together to gain more randomness, only to fail miserably)

I think for performance reasons we are storing a random number in the 
Symbol instead of computing the hashcode base on the string value. The 
intention of this RFE not to change that, but just use fewer random bits 
to accomplish the same goal. Yes, we are assuming a particular hashtable 
implementation, but I don't think we are going to change that any time 
soon. And if we ever change the hash function to heavily depend on the 
top 16 bits of the hash code, we can certainly revert back to a 32-bit 
random number.

To change the hashing algorithm completely would be a different RFE 
altogether (I am sure John Rose has a few of those in his pocket :-)

Thanks
- Ioi
> Thanks,
> -Aleksey
>