Toward Condensers

Tue Aug 1 21:00:33 UTC 2023

On 8/1/2023 4:34 PM, Dan Heidinga wrote:
> Thanks for sharing this document.  And for the work, you, Brian and 
> Paul have been putting into this.
>
> A couple of questions / comments about the document based on my reading:
>
> * In the "The condenser pipeline" section, the model shows an 
> Extractor -> Condenser 1 -> Condenser 2 -> Distiller as an example of 
> the pipeline which is linear: A, then B, then C,....  Often when 
> applying optimizations (ie: in a compiler), there's a virtuous circle 
> where one optimization exposes new opportunities for another, which 
> triggers more opportunities for the first.  This leads to running 
> through optimization passes until a fixpoint or some limit occurs. 
> Dead code elimination is often an pass that benefits from being 
> repeatedly run.  This is still early days but I'll ask anyway: Have 
> you thought about how the condenser pipeline would benefit from 
> repeated application?  Or how condensers could opt into repeated 
> application?

Yes, some :)  Suppose we want to apply the "replace Foo with Bar" 
condenser and the "inject Baz" condenser, but Bar might use Foo.  So 
you'll want to run the first again after the second.  While the diagram 
shows linearization, the simple Condenser interface allows for much more 
complicated composition.  Additionally, we anticipate condensers will 
have factory methods that take configuration data, such as which 
classes/containers to transform.  So the "inject Baz" condenser might 
scan for suitable Baz injection points, do the needful, and then turn 
around and

     Condenser fooAgain = new FooToBarCondenser(listOfContainersITouched);
     return fooAgain.condense(condensedResult);

So the second running of the FooToBar can happen inside the InjectBaz 
condenser, from the outside it looks like there are just two, but the 
reality is more complicated.  (Another way to say this is that composing 
condensers is so simple that maybe the runner need not really accept a 
list of condensers, but one condenser, that will do the composition 
itself, just as functions in haskell take exactly one argument.)

How it ultimately gets surfaced may be a matter of some bikeshedding, 
but I think all the degrees of freedom needed are there already.

>
> * in "The application model", the first goals states:
> > Abstracted away from the representation — Condensers should not
> > directly read and write files as they do their work; they should express
> > their behavior in terms of changes to the model, and let the tooling
> > handle the representation.
>
> I agree with the goal of having the model mediate access to the 
> classfiles / jars / modules / resource files, I think this goal may 
> overstate the requirement of "not directly read and write files" as 
> condensers that have offline training runs will want to use the 
> filesystem to access their training data.  Would phrasing this as 
> "Condensers should only access classfiles and resources through the 
> data model; they should ....." express the intent more clearly?

I think you've got the spirit of it.  In a phased execution such as you 
describe, the config files or training data also represents shifting 
data across phases.  Maybe that is embedded directly in the application 
configuration as resources, or maybe we have to support the notion of a 
"filesystem mount point" where we are asserting that a part of the 
runtime file system has been shifted in time to a place in the 
training-time file system.  Details TBD.

> And the last goal states:
> > Scrutable — It should be possible to answer the question,
> > “what did that condenser do?”
>
> Which I don't see explicitly addressed in the rest of the document.  
> It's a good goal.  Is the intention here that the ModelUpdater can be 
> interrogated to analyse the changes?  Are you envisioning a logging 
> mechanism of some sort here or something else?

I was thinking that the act of applying a ModelUpdater would optionally 
produce a log in a standardized format, so you could see what actually 
got done by a condenser.  I also imagine that we will want to add Log 
data to the data model eventually, so condensers can dump out analysis 
data and have it show up in the log in the right place.  The data model 
as it stands is clearly super-simplified, and will evolve a lot, but 
even this super-simplified version is enough to write condensers like 
"turn lambda capture into inner classes."

> * In the "Data model" section, how are duplicate classes on the 
> classpath handled?  Are Containers representing JARs on the classpath 
> explicitly ordered so as to linearize them to ensure the earliest 
> definition of a class wins?  Does the model need to expose the 
> classpath ordering?

The application model gives us a list of modules and a list of classpath 
entries.  Since multi-valued attributes preserve their order, this 
should be enough to preserve the existing story: that modules represent 
a partition of packages, and we resolve conflicts on the classpath with 
"first wins".  (We can also have condensers that assert properties like 
"no duplicates".)

> Is ContainerKind missing a "directory" type as well?  It might be 
> possible to pretend a filesystem directory on the CP is a JAR for 
> model purposes, but that doesn't feel quite right.  Did you consider 
> directories when making the model? If they were deliberately excluded 
> it might help to expand on why in the document.

I'll defer this one to Paul.

> Should the Data model include "classloader" as a member? With modules 
> we can map which module will be loaded by which classloader and can 
> guess for most classpath entries.  For more complicated classloading 
> schemes (including self-first), it might be beneficial to model the 
> classloader network in the model as well.  This may be something that 
> a non-standard condenser could augment the model / analysis with.

Good question!  Not sure yet.

> * In "Example: Lambda forms" contains the following sentence
> > After condensing we can update the java.base module on the file 
> system from the updated application model.
>
> which should probably delegate the updating of the file system to the 
> Distiller.  Given the prohibition against filesystem access in the 
> "The application model" section, talking about the filesystem here 
> seems odd.

This seems like an error in the doc.  Will investigate.

> I'm looking forward to seeing the prototype and trying to port my 
> pregenerate lambdas jlink plugin to be a condenser.  I think it should 
> be a fairly smooth process.

Indeed, I think that will be a good validation test, and I expect it 
will go smoothly.