RFR: 8292914: Drop the counter from lambda class names [v6]

Fri Feb 17 13:54:50 UTC 2023

On Thu, 16 Feb 2023 19:35:54 GMT, David M. Lloyd <duke at openjdk.org> wrote:

>> The class generated for lambda proxies is now defined as a hidden class. This means that the counter, which was used to ensure a unique class name and avoid clashes, is now redundant. In addition to performing redundant work, this also impacts build reproducibility for native image generators which might already have a strategy to cope with hidden classes but cannot cope with indeterminate definition order for lambda proxy classes.
>> 
>> This solves JDK-8292914 by making lambda proxy names always be stable without any configuration needed. This would also replace #10024.
>
> David M. Lloyd has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use a unique index for the dumped lambda class instead of a time stamp

> This proposal is appealing for its simplicity, however, there are points from [JDK-8292914](https://bugs.openjdk.org/browse/JDK-8292914) that it does not address:
> 
> 1. _To register lambda classes for serialization_. In this case, the agent run would register all lambdas from the capturing class for serialization: there would be no identifier to distinguish one lambda from another.

In the serialized lambda representation, the lambda proxy class name should not have any part to play at all, since it's all handled using `SerializedLambda` and the generated `$deserializeLambda$` method. Right?

> 2. _For reproducible builds_. This PR makes a small step towards reproducible builds, however, it complicates the implementation on the Native Image side significantly. Now, as each class needs a unique name, we must create the unique (and stable) lambda-proxy name for all lambdas within a capturing class. This is hard with parallel static analysis as methods are discovered concurrently, and hence non-deterministically. When we discover a class, we would need to find all of its captured lambdas, then perform a deterministic traversal of the class' bytecode in order to give lambda-proxies a stable name. For such an implementation we don't even need this PR as we can simply strip the sequence number from the name and invent a new one.

This isn't just a problem for lambda proxies. All hidden classes in fact share this common challenge; the defined class name is not sufficient to identify the class because any number of hidden classes may have the same base name. In qbicc, we now compute the hidden class's unique sub-identifier using a cryptographic hash function over the body of the class. This helps reproducibility for us since the same object file name (and symbol names) will always be generated for the same input class bytes. And it works not just for lambda proxies, but also lambda forms and method handle/direct method handle classes, as well as user-defined hidden classes.

> 3. _To collect profiles of lambda methods for profile-guided optimizations_. With all captured lambdas having the same name, we would get polluted profiles and lose on performance.

These classes are hidden classes, and the problem applies broadly to that case; so, if you want to have accurate profiles, you *must* have a reproducible sub-identifier for *every* hidden class, otherwise you'll have the same problem all over again.

> 4. _To reflectively access lambdas_. Again we would not be able to distinguish between lambdas in the same capturing class and hence we would get imprecise reflection registration.

The same thing applies, I think. You need to solve this problem for *all* hidden classes, not just lambda proxies. And by solving the problem for hidden classes, you only need the solution in one place; having an additional scheme for lambda proxies will only introduce problems at this point.

> @dmlloyd Maybe I am missing something, but I don't understand how does this PR make the builds more reproducible?

Given that a reproducibility solution must exist for native image generators which gives stable identifiers to all hidden classes (not just lambda proxies), removing sequence numbers from hidden class generation is what actually would allow such a solution to work. Having an unpredictable sequence number in the class name means that each time the class is built, it may have different content (and this, in our case at least, a different cryptographic hash). So to prevent lambda proxy generation from undermining general reproducibility of hidden classes, the simplest solution is to always use the same base name. Thus, removing the sequence number (as done in this PR) enables reproducibility.

Incidentally, through this scheme I discovered that there are multiple cases where the JDK will define more than one identical class, for lambda proxies and for method handles at least, which have the same content. There could be an optimization opportunity there (or more than one).

-------------

PR: https://git.openjdk.org/jdk/pull/12579