Looking ahead: proposed Hg forest consolidation for JDK 10

Tue Oct 18 11:01:11 UTC 2016

Hi Erik - thanks for the comments. I indeed got the hardlink story 
backwards - which means the size of a local clone will be relatively 
stable in time - good news!

Regarding shaare/bookmarks, that is a good suggestion. I tried bookmarks 
extensively, and I found them a bit too finicky to use with a single 
repo (as with branches, you need to be very aware in which bookmark you 
are on, or mistakes can be very common). Additionally, using just a 
single repo creates problem when you want to store 'project' specific 
metadata (tempoarry tests, IDE config and the likes).

So, using 'share' seems like a major step forward because it allows you 
to 'forget' about branches being there (each folder will be on a 
separate branch).

One thing that will remain convoluted (if I'm understanding how 
bookmarks work correctly) is for instance when you want to fetch new 
changes from the remote repo. In that case, I think you need to:

* go in the main repo forest (the one with the 'main' bookmark)

* do a pull/update

* go in the share you were working with

* do an hg pull 'main'

Is that correct? And, more importantly, could other bookmarks/shares be 
left as they are (i.e. not updated) ? I'd like the various shares to be 
as independent as possible.

On a separate note, langtools is one of those cases (there are probably 
others, like Nashorn) where the repo is currently fairly isolated from 
the rest of the JDK - meaning that you can just fetch a langtools repo, 
build it and test it in isolation. A similar case could arise if a JDK 
developer would like to work, say, only on a reduced set of JDK modules. 
While cloning all files is perfectly acceptable disk size-wise I find 
the lack of granularity a tad annoying in the general case (and I've 
built some tools to overcome that problem).

Maurizio

On 17/10/16 12:47, Erik Helin wrote:
> Hi Maurizio,
>
> thanks for your feedback! Please see my replies inline.
>
> On 2016-10-13, Maurizio Cimadamore wrote:
>> Hi Joe,
>> some comments on this. As my workflow typically involve cloning one
>> langtools repo per each new fix, I'll start with discussing local clones
>> first. Starting with some concrete numbers, I am currently working on 2
>> forests (jdk 9 and valhalla); between these two forests I currently have ~35
>> langtools clones (for various prototypes and bug fixes). Also, as I'm
>> working on two machines, I keep them in sync using Unison, a very common
>> sync tool in linux land based on rsync.
>>
>> I have been experimenting with local clones, to see to which degree a local
>> clone could save in terms of space. My findings are that a local clone takes
>> around 800M - which seems consistent with the fact that Mercurial hardlinks
>> the repo files but not the history, which is simply copied.
> You might have gotten this the wrong way around. Mercurial will use
> hard links for most of the metadata for local clones on a
> file system that support hard links. The source code files themselves
> won't be hard links (otherwise, if you edited one file in one local
> clone then the file sharing the same inode in another local clone would
> get changed).
>
>> For people like me, working on langtools, that's quite a significant jump in
>> terms of space - a clean langtools repo is around 150M. So, in my specific
>> case, disk usage will jump from 150M * 35  =~ 5G  to 800M * 35 =~ 28G (this
>> is a very conservative estimate - since it's assuming that all files are
>> hardlinked, which will not be the case as soon as I start making some
>> changes in the local clones). While this is not a deal breaker in terms of
>> disk spaces (my SSD has 200G in total), it poses serious strain on my
>> ability to do regular syncing/backups.
> Thanks for sharing your workflow! For this use case, could you perhaps
> try out the `hg share` extension? You need to enable the extension in
> your .hgrc. A share is like a clone, but Mercurial will share the store
> folder between all shares. This is *not* done using hardlinks, if you
> look in the .hg folder for a share, you will not see the "store" folder
> (you will see a file named sharedpath instead).
>
> Using shares on their own can be a bit tricky, but if you combine them
> with bookmarks, then you get a very powerful solution. In your case, I
> would suggest the following:
>
> $ hg clone http://hg.openjdk.java.net/jdk9/consol-proto
> $ cd consol-proto
> $ hg bookmark '@' # traditional name for "master" bookmark
> $ cd ..
> $ hg share -B consol-proto bugfix-1
> $ cd bugfix-1
> $ hg bookmark 'bugfix-1'
>
> You will now end up with two directories, consol-proto and bugfix-1,
> both looking like a full forest, but they will share the same Mercurial
> store *and* list of bookmarks (but the active bookmark won't be shared).
> Since the shares use different bookmarks, the work you do in a share
> won't interfere with the work you do in another share (you will get
> multiple heads, but each head will have a bookmark associated with it).
>
> For backing up, you now only need to back up the consol-proto
> repository (it contains all the bookmarks and all commits). There is no
> need to back up the shares, they can always be created from the
> consol-proto repository.
>
> On my machine, using Linux 4.3.3 and ext4 as my filesystem, with hg
> version 3.8.1, a share uses 661 MB of disk. If you know want 35 shares,
> you would end up using 35 * 661 = 22.6 GB. But you only have to back up
> one repository!
>
>> Add to this the fact that most backup/syncing tools explicitly calls out the
>> hardlink case as being problematic. Unison doesn't support them, rsync
>> supports them to some degree, and even some professional backup tools I'm
>> using no do support them (or recommend to do without them anyway). So, local
>> cloning could be a fine solution when working on one machine, but as soon as
>> you start considering back up, you have troubles. For this reasons I will
>> have to consider to change my day to day workflow, and to try and avoid
>> relying on clone as much as I did - which poses issues: for instance, if I
>> keep all my patches in the same repo (by using either MQ or bookmarks) - how
>> do I differentiate between the different IDE projects to work on them?
> If you are using shares as suggested above, you would have one folder
> with all the source code for each bookmark.
>
>> Last but not the least - if the local clone size I'm seeing now (800M) is
>> almost entirely history-driven, and that already accounts for 50% of the
>> total size - doesn't that mean that i.e. in 2-3 years time, the size of the
>> history will trump the size of the files, meaning that the advantages of
>> doing local clones will be smaller and smaller over time?
> No, it is the other way around. For a local clone, you share all of the
> history (using hard links). So the size of your local clones scale with
> size of the source files (the same is true for shares). This can easily
> be verified by doing `du -ms .hg` for a share, I get 6 MB.
>
> Thanks,
> Erik
>
>> On a separate and more <meta> note, it seems to me that this effort is two
>> things at once:
>>
>> * a repo consolidation: use a single repo instead of a forest
>> * a source restructuring
>>
>> Each of the above moves has risks and costs for people in the OpenJDK land.
>> For instance, as discussed above, the repo consolidation might mean
>> significantly change the workflow people use on a daily basis (see above).
>> At the same time, the source restructuring is posing issues for things like
>> builds, IDE support, and the likes.
>>
>> I wonder if it wouldn't be sensible to do the repo restructuring now, where
>> the new repo is simply a consolidated version of the new one; no need to
>> update build scripts to take into account new paths. Then, maybe in the next
>> release (JDK 11), we could attack the source restructuring problem. This way
>> people will have more time to adjust to the big changes that are coming.
>>
>> What do you think?
>>
>> Maurizio
>>
>>
>> On 11/10/16 03:11, joe darcy wrote:
>>> Hello,
>>>
>>> Looking ahead to JDK 10, a group of JDK engineers have been exploring
>>> consolidating the large number of Hg repositories in an open JDK forest to
>>> a single one with the goal of using the consolidated arrangement for JDK
>>> 10.
>>>
>>> This message is being sent to jdk9-dev since a jdk10-dev alias to discuss
>>> JDK 10 doesn't exist yet.
>>>
>>> A JEP describing the project has been submitted :
>>>
>>>     JDK-8167368: Consolidate JDK 10 OpenJDK repositories to a single
>>> repository
>>>     https://bugs.openjdk.java.net/browse/JDK-8167368
>>>
>>> The text of the JEP describes the motivation and current state of the work
>>> in more detail, including proposed changes to the file layout. Publication
>>> of the prototype consolidated repository is planned, but not done yet. The
>>> email below has a list of additional anticipated questions and answers.
>>>
>>> We feel this consolidated arrangement offers some significant structural
>>> advantages for managing the JDK's source code and we are now asking for
>>> feedback on this potential change. In particular, if you feel there is a
>>> show-stopper problem with making this change, please let us know!
>>>
>>> I'd like to acknowledge the work of Stefan Sarne, Stuart Marks, and
>>> Ingemar Aberg participating in discussions leading up to the prototype and
>>> I'd like to especially recognize the contributions of Erik Helin for savvy
>>> Hg manipulations and Erik Joelsson for skillful build wrangling in this
>>> project.
>>>
>>> Please send initial comments by October 18, 2016.
>>>
>>> Cheers,
>>>
>>> -Joe
>>>
>>> Q: What about the set of forests for JDK 10? Are we going to have master,
>>> dev, client, hotspot, etc. the same set as in 9?
>>> A: That is a separate question from the repository consolidation, but
>>> there will likely be simplifications here too. Discussions on that point
>>> will come later.
>>>
>>> Q: I usually just build the code in repo X today. Will I have have to
>>> build the *whole JDK* now?
>>> A: Not necessarily. The same top-level build targets should work in the
>>> consolidated forest.
>>>
>>> Q: Does disk usage change?
>>> A: The total disk usage of the current forest compared to the consolidated
>>> forest is nearly the same.
>>>
>>> Q: In more detail, how were the changesets imported?
>>> A: The scripts used for the consolidation conversion are attached to the
>>> JEP.
>>>
>>> Q: What happens to the Hg hashes?
>>> A: The conversion scheme used in the prototype does *not* preserve Hg
>>> hashes of changesets compared the current forests. However, the bug ids
>>> are preserved and can be searched for. In addition, one or more
>>> pre-consolidation forests should be archived in perpetuity so that URLs in
>>> bug comments continue to work, etc.
>>>
>>> A mapping of the old hashes to the corresponding new hashes might be
>>> generated and placed in the final new repo.
>>>
>>> Q: I'm allergic to tabs; what about jcheck?
>>> A: If history is preserved, the checking done by jcheck needs to be
>>> modified for the consolidated forest. One way to do this is to augment the
>>> white lists used in jcheck with the conflicting changesets. This approach
>>> may not be elegant, but it is effective and doesn't appear to appreciably
>>> impact jcheck running times.
>>>
>>> Q: Will the future 9 update forest also have this consolidation
>>> restructuring?
>>> A: The script used to do the consolidation conversion is deterministic and
>>> could be run to create the  9 update forest in the future at the
>>> discretion of the 9 update team.
>>>
>>> Q: For backports for forwardports, will there be a script to translate
>>> patch files across the consolidation boundary?
>>> A: That work is planned, but not yet done; see JDK-8165623: Create patch
>>> translator to update paths pre/post consolidation.
>>>
>>> Q: It's the 21st century and I develop using an IDE. That is still going
>>> to work, right?
>>> A: The prototype to date does include updating the various IDE support
>>> files, but bug JDK-8167142 has been filed to track that work.
>>>