RFR: 8330027: Identity hashes of archived objects must be based on a reproducible random seed [v3]

Wed Apr 24 15:42:31 UTC 2024

On Thu, 18 Apr 2024 07:51:22 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> CDS archive contains archived objects with identity hashes.
>> 
>> These hashes are deliberately preserved or even generated during dumping. They are generated based on a seed that is initialized randomly on a per-thread basis. These generations precede CDS dump initialization, so they are not affected by the init_random call there, nor would they be affected by [JDK-8323900](https://bugs.openjdk.org/browse/JDK-8323900).
>> 
>> A random seed will not work for dumping archives since it prevents reproducible archive generation. Therefore, when dumping, these seeds must be initiated in a reproducible way.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   final version

> > > > > I get that the chance for this happening is remote, but hunting sources of entropy is frustrating work, and the patch is really very simple. So, why not fix it? I don't share the opinion that this is added complexity.
> > > > 
> > > > 
> > > > Why not do it inside `Thread::Thread()`
> > > > ```
> > > > // thread-specific hashCode stream generator state - Marsaglia shift-xor form
> > > >   if (CDSConfig::is_dumping_static_archive()) {
> > > >      _hashStateX = 0;
> > > >   } else {
> > > >      _hashStateX = os::random();
> > > >   }  
> > > > ```
> > > 
> > > 
> > > Because then it would inject `os::random` into the startup of every thread, not just of every thread that generates iHashes. So it would also fire for GC threads and other thread started before "our" threads. That would make our random sequence against order and number of threads started.
> > 
> > 
> > My last answer was rubbish, sorry, did not read your comment carefully enough.
> > Yes, your approach would also work, but it would lead to the two threads involved in dumping the archive - VMthread and the one Java thread - using the same seed, hence generating the same sequence of ihashes. That, in turn, can lead to different archived objects carrying the same ihash, which may negatively impact performance later when the archive is used.
> 
> I think it's better to just not compute the identity hash inside the VM thread. Here's what I tried
> 
> [iklam at ad95e2e](https://github.com/iklam/jdk/commit/ad95e2e8b00cb151617463af41648cdece2dfc7b)
> 
> We thought that forcing the identity hash computation would increase sharing across processes, as it would mean fewer updates of the object headers during run time. However, most of the heap objects in the CDS archive are not accessible by the application (they are part of the archived module graph, etc). Also the archive contains a large number of Strings, which are unlikely to need the identity hash (String has its own hashcode() method).
> 
> Since the reason is rather dubious, I think it's better to remove it and simplify the system.

I like that, that is simpler. Okay, then we will only call ihash from a single thread, so a global constant seed should be fine. I should be able to assert that, right?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18735#issuecomment-2075248104