Newbie dumb question

Thu Aug 1 22:53:12 UTC 2024

Hi Archie,

In our current approach code to be reflected over needs to be explicitly identified. You have to opt in and grant permission that the code you wrote can be reflected over and accessed by something else and potentially interpreted differently, perhaps with explicit knowledge of that. Such reflection is not transitive - the code of methods that are invoked are not reflected over, since you may not have permission to do so.

We are not generally trying to support the transformation of whole Java programs to other programming platforms (similar I guess to transpilation). Nor are we trying to enhance the Java platform to provide a universal intermediate language. A main goal of Babylon is to broaden Java’s reach to foreign programming models. Arguably a more realistic and less ambitious goal :-)

In that respect reflected code is likely to be scoped and constrained to that which is needed, and which may differ in terms of the Java programming model and was written with that knowledge in mind. This is the case for the support of GPU kernels, where use of certain language constructs and APIs (such as HashMap) do not translate very well to say CUDA or PTX - we are not trying to preserve the Java programming model and Java APIs when executing on the GPU. I think the same applies, albeit with different constraints, to many other use cases - not all code is parallel suitable code, not all code is differentiable code, not all code is translatable to SQL statements, not all code represents a machine learning model.

At the moment we also cannot model all Java language constructs, local class declarations are particularly problematic. We are striving to model as much as we can (nearly all expressions and statements are currently modeled to some degree of fidelity), but we may hit limits in terms of complexity or unnecessary duplication with existing information in class files, such as class declarations. For the analysis of reflected code I think we are in reasonable shape.

Note that we are not limiting ourselves to the source compiler producing code models from its internal AST. We are also actively exploring lifting code models from bytecode, assuming one has access to class files. That may be more in the spirit of what you are thinking if bytecode is not suitable for such purposes. However, there are many traps in doing so. Much of Java’s core libraries are naturally written with Java in mind and the workings of the JVM (what else would we have in mind :-) ), trying to reinterpret that differently, or emulating, will be problematic.

Paul.

> On Jul 24, 2024, at 7:35 AM, Archie Cobbs <archie.cobbs at gmail.com> wrote:
> 
> Hi,
> 
> Got a newbie dumb question...
> 
> Why not have the compiler mandatorily generate code models for all class files?
> 
> Motivation: I'm interested in how Babylon could facilitate the general use case of remote execution, e.g. for database access (i.e., moving the code to the data vs. the other way around like we do today), high-performance computing, holomorphic encryption, or some other purpose:
>     • 
> We want to describe a computation/algorithm that will execute on some remote, non-Java runtime
>     • We want developers to be able to write the computation in completely normal Java and have it translated and serialized
>     • "Normal" includes the use of usual classes like java.util.HashMap, Guava utility classes, etc.
> This is similar to the GPU use case except for #3 - AFAICT, with current Babylon your reflected code model doesn't include classes like HashMap they aren't code-reflected - correct?
> 
> In an ideal world, it should be possible to access the code model for every non-native method, so there are no "holes" in your view of the code. Otherwise, when coding up your algorithm you'll have one hand tightly tied behind your back - no collections, no streams, no Guava, no Apache commons-foobar, etc.
> 
> Then you could really go to town optimizing your code - virtual dispatch becomes non-virtual, escape analysis eliminates heap allocations, lots of methods inlined, etc. You could take a huge Java processing pipeline and lower it down to target a relatively simplistic sandbox-style virtual machine environment (could be WASM, SQL stored procedure, or even machine code). For scenarios where the pipeline is going to be executed frequently (e.g., database query) it is well worth it to pay this one-time, upfront cost for these extensive optimizations.
> 
> Of course, the code couldn't do native code things like Thread.start() without some kind of shim... and there may be more native code "gotchas" lying around in the standard Java libraries than one might think. But I'm guessing most stuff you would want to send over the network for the above examples of remote execution would need few if any native methods.
> 
> Side note: This situation reminds me of when I first switched from svn to git and gasped at the notion that git stores the entire revision history on your local disk. It turns out to work great, and all the scary seeming downsides are not an issue (e.g., space is not an issue with today's disks and the fact that git just ZIPs up the past history into giant "packs"). Moreover, there are important upsides like speed, private branches, repo portability, server/client symmetry, the ability to work on an airplane, etc.
> 
> Similarly, I'm sure there are a lot of "obvious" reasons to not make code models mandatory, like disk space, compiler speed, and fear of mass panic. But in the end these may also be non-issues, especially if having code models available for all classes opens up a lot of use cases that aren't available now. Maybe there are some simple things that could be done to quiet the naysayers, like storing the JRE code models in a separate JAR file, etc.
> 
> Put another way, I think we need to think bigger...  e.g., project Babylon could position Java as a universal "starting language" so to speak.
> 
> -Archie
> 
> -- 
> Archie L. Cobbs