Looking ahead: proposed Hg forest consolidation for JDK 10

Wed Oct 12 12:16:04 UTC 2016

Hi Goetz,

thanks for looking through the JEP and providing us with feedback! Please see
my replies inline.

On 2016-10-12, Lindenmaier, Goetz wrote:
> Hi Joe, 
> 
> thanks for your detailed answer.  Unfortunately it 
> doesn't dispel my concerns.
> 
> > Hi Goetz,
> > 
> > On 10/11/2016 2:30 AM, Lindenmaier, Goetz wrote:
> > > Hi,
> > >
> > > I see several problems with this approach.
> > >
> > > 1.) Mercurial already has problems scaling with the current repositories.
> > >      This will get worse with bigger repos. E.g. 'hg diff' takes
> > >      14 secs on jdk, but only 2 secs on jaxp:
> > >      jdk:  ~90000 files, 15000 changes, hg diff takes  14 secs
> > >      jaxp: ~12000 files,  1000 changes, hg diff takes  2 secs
> > 
> > By its nature, hg diff needs to walk the directory tree so a bigger tree
> > will generally be slower. 
> Yes, and that's bad!
> 
> > Doing a diff on a particular subdirectory, say
> > for hotspot,  should have comparable performance as today.
> 
> The use case of hg diff is to find what was changed. Obviously, if I only 
> do it on the subdir, I might miss something. 

Are you ever running `hg tdiff` today? If not, then
how do you know you are not missing sending out part of your patch for
reivew?

If you are comfortable doing `hg diff` in only the hotspot repository
today, then I don't see how running `hg diff hotspot` would make you
less comfortable.

> > The fsmonitor extension,
> > https://www.mercurial-scm.org/wiki/FsMonitorExtension, could help in
> > this case too.
> > 
> > > 2.) Cloning the repo does not scale.
> > >      Cloning the root repo and calling get_source.sh takes 20 min.
> > >      I ususally only clone the root repo and hotspot. This only
> > >      takes 3 min.
> > >      I don't think merging the repos might improve the 20 mins.
> > >      In contrary, as cloning the jdk repo takes most of the time,
> > >      and the others run in parallel, cloning an even bigger repo
> > >      will be slower.
> > >      Alternatively, one could hold a 'master' repo and replicate that
> > >      by local copy. But this shows similar timings (1:40 vs. 9min).
> > 
> > We've discussed this kind of use-case internally as well. The
> > recommendation is to have a designated local master and then do local
> > clones of that. On a unix system if the local clones are on the same
> > disk, hard links are used with a copy-on-write policy so the clones are
> > space-efficient and time-efficient to create. The local clone times
> > we've seen are about 30 seconds in that case.
> 
> I would have to run the watchman on all the machines I happen to 
> work on. A possible solution imposing work on every user.

Again, working with local clones does scale. Do you use an SSD on the
machines you are working on? If so, then cloning the consolidated
repository locally shouldn't take more than approx 30 sec.

As for watchman, that is only used to speed up `hg status`, so running
e.g. `hg diff` or `hg status` in a subdirectory as I explained will
yield similar benefits.

> > > 3.) Having to clone the full repos will require considerably more
> > >      disk space.
> > >      I'm working on various issues in hotspot and keep them seperated
> > >      by doing this in individual repositories that only contain hotspot.
> > >      These repos will require considerably more space.
> 
> > If disk space is a concern, you can use mq or bookmarks against a single
> > repo.
> 
> I use mq a lot.  But often for separate tasks separate repos are required.
> Say, I'm working on
>   - testing a change of someone other against head revision to review it.
>   - developing the s390 port with a mq that contains 10 patches
>   - looking for a performance regression by syncing to older revisions, 
>     building and running benchmarks in a script. 
> You can't combine such tasks with a mq in one repo.

No, but you can with e.g. bookmarks (or branches) and the CONF feature
of the build system. I, for example, have one CONF per bookmark. That
means I can update my code to the feature/bug I'm currently working on,
make some changes, and get an incremental compilation (instead of
rebuilding).

If you prefer separate repositories, then again, local clones will help
decrease the overhead. If you are open to discussing/sharing your
workflow and trying out some features of Mercurial and the build system,
then I'm confident we can find an effective way for you to work with a
consolidated forest.

> > > 4.) There will be additional merges because changes that are now done
> > >      in two repos will then be done in a single repo. If I then sync back
> > >      a few hotspot changes, a lot of files in the other subdirectories
> > >      will get touched. This slows down sync and causes rebuilds.
> > >      Sure this might just be what is intended, but currently I don't
> > >      need to rebuild jdk etc. very often.
> > 
> > While hotspot and the rest of the JDK can often be treated as
> > approximately independent, they are not truly independent.
> 
> Yes, but they _are_ approximately independent. That suffices to 
> avoid lot's of boilerplate work.
> In other SCM systems you can sync back only a subdirectory.
> Mercurial does not support that.

If I understand, you seem to be working mostly in the hotspot repository
(and sometimes in the top-level)? If that is the case, then you are not
feeling the pain of doing dependent changes between top-level, hotspot
and the jdk. Many other developers feel this pain clearly, particulary
developers working with:
- performance (often require dependent changes in both JDK and hotspot)
- build (often require dependent changes in all repos)
- runtime (the tools in the jdk, e.g. jstat, often require
  dependent changes in jdk and hotspot)
- testing (test often needs to be updated in both jdk, hotspot and
  top-level)
Furthermore, many of us are interested in all changes going in, because
there might be performance or functional regressions introduced due to
changes in another repository. Having a way to perform a bisect
tremendously helps in those situations.

Again, if you mostly work in one repository, then you will not have been
exposed to these problems. However, many developers in the OpenJDK
project tend to work across many repositories.

> > > 5.) It will get harder to monitor submitted changes that are relevant
> > >      for a specific area. E.g., I might only want to see changes in hotspot.
> > >      In the web frontend, you can not browse changes on subdirectory basis.
> > >      Maybe this can be solved, as the commandline 'hg log' etc. already
> > support
> > >      this.
> > 
> > We don't have plans to change the Hg web UI so I think a command line
> > solution would be appropriate here.
> 
> You should consider fixing this, maybe as a follow up.  You can already 
> browse file history,  This should be also possible for directories.

That is a feature request that needs to be sent to the Mercurial
project. The web ui of the OpenJDK repositories comes from the `hg
serve` command, that is not something that Oracle has developed. I know
that Mozilla has awarded a grant to the Mercurial project recently to
improve the web UI [0], the Mercurial developers are probably interested
in this kind of feedback.

> > > 6.) A single repo will simplify making combined changes. So there will be
> > >      more of these. But combined changes complicate handling of our
> > >      licensed   code.
> > >      In our activities as licensee, we are consuming hotspot change-wise.
> > >      This is because we modified a lot in hotspot, and merging hotspot
> > >      changes step by step simplifies the merging.
> > >      On the other side, we consume the changes to jdk etc. as chunks.
> > >      This is because we changed much less in these directories so
> > >      that merging causes less problems. Also, there are much more
> > >      changes and we don't have the manpower to consume them change-
> > wise.
> > >      Having combined changes requires more synchronization between
> > >      the two merging tasks. It's already an increasing effort in
> > >      jdk9.
> > >      Also, to follow these two different merging approaches for hotspot
> > >      and the rest, we would have to first split the single repo into
> > >      two parts.
> > >
> > >
> > > Comments to the JEP:
> > >
> > > I appreciate that the change history is kept as it makes research
> > > in old changes more easy. On the other side, dropping the history
> > > might speed up handling of the new repo.
> > 
> > We are aware that Facebook has developed Hg plugins to allow shallow
> > clones, i.e. clones without all the history, but we haven't investigated
> > using them yet.
> > 
> > >
> > > I also appreciate the changes in directory layout. If the
> > > repos are merged, this should be done this way.
> > >
> > > We find it difficult to keep the jtreg runner in sync with our
> > > current version of jdk9, especially as we have two of them (We
> > > test openJdk and SAP JVM 9, and within SAP JVM 9 hotspot and
> > > jdk often differ in a few builds.)
> > > I would appreciate if the runner could be included in the
> > > root/test directory.
> > 
> > I'm not quite sure what you are referring to by the jtreg runner.
> 
> I mean the code in http://hg.openjdk.java.net/code-tools/jtreg
> 
> As Andrew stated, some subdirectories are pretty stable. It
> might completely make sense to merge these into one repository, but I'm
> really concerned about jdk and hotspot. 
> 
> In general, I think those people that are highly specialized on complex
> subcomponents of the VM will suffer from this.  They often are fine
> just working with hotspot / jdk etc..  In general, these people develop
> new components in the latest branch.

You mean working on tip? The OpenJDK repositories do not make use of
branches (besides having the sole default branch).

> Those people that have to maintain and test the VM really will profit
> from the new setup.  They anyways always operate with the full 
> repo tree.
> Having this said, I think it would make more sense to put the legacy code
> base into merged repos, and not the development branch?

When you say branch, do you mean "forest" (the wording is important here
for me to understand since branch also in a concept in Mercurial)? That
is, do you think of jdk9/dev, jdk9/hs and jdk9/jdk9 as branches?

I personally always work with the "full tree" since it is crucial to
develop changes on top of a stable "tree configuration". Even though
hotspot and the jdk ususally are compatible with each other give or take
a few days, there have been plenty of situations where having the
repositories in a non-tested configuration can result in rather funky
behavior.

Again, I'm pretty confident that we can find a way for you (and the rest
of the SAP contributors) to work effectively with a consolidated OpenJDK
repository. I just need to learn more about your particular use case to
come up with a nice solution. And you guys of course have to be willing
to change your workflow slightly :)

Thanks,
Erik

[0]: https://blog.mozilla.org/blog/2015/12/10/mozilla-open-source-support-first-awards-made/

> Best regards,
>   Goetz.
> 
> 
>