Looking ahead: proposed Hg forest consolidation for JDK 10

Tue Oct 18 13:59:18 UTC 2016

On 10/18/2016 01:01 PM, Maurizio Cimadamore wrote:
> Hi Erik - thanks for the comments. I indeed got the hardlink story
> backwards - which means the size of a local clone will be relatively
> stable in time - good news!
>
> Regarding shaare/bookmarks, that is a good suggestion. I tried bookmarks
> extensively, and I found them a bit too finicky to use with a single
> repo (as with branches, you need to be very aware in which bookmark you
> are on, or mistakes can be very common). Additionally, using just a
> single repo creates problem when you want to store 'project' specific
> metadata (tempoarry tests, IDE config and the likes).
>
> So, using 'share' seems like a major step forward because it allows you
> to 'forget' about branches being there (each folder will be on a
> separate branch).
>
> One thing that will remain convoluted (if I'm understanding how
> bookmarks work correctly) is for instance when you want to fetch new
> changes from the remote repo. In that case, I think you need to:
>
> * go in the main repo forest (the one with the 'main' bookmark)
>
> * do a pull/update
>
> * go in the share you were working with
>
> * do an hg pull 'main'
>
> Is that correct? And, more importantly, could other bookmarks/shares be
> left as they are (i.e. not updated) ? I'd like the various shares to be
> as independent as possible.

Well, this is one case when you have to remember that all 'hg shares' 
share the same "store". When you do `hg pull -u` in the your 'default' 
folder, this will advance the "@" bookmark (assuming that is the name of 
your bookmark in the default folder) to the latest commit. However, your 
other bookmarks won't be updated (intentionally).

Now, when you change to the folder bugfix-1, the most common way to sync 
your work on the anonymous branch described by the bookmark bugfix-1 and 
the work on the default branch described by the bookmark "@" is to 
rebase the parents to the bookmark bugfix-1 on top of the changeset 
labeled @. Using my limited ASCII-art skills, this would look like the 
following.

After doing `hg pull -u` in the 'default' directory:

                 * (@)
                 |
                 *
                 |
(bugfix-1) *    *
            |    |
            *    *
            |    |
            *----*
                 |
                 *
                 |

You then change directory to 'bugfix-1' and run `hg rebase -d @`, which 
means "rebase all changesets that are parents of the active bookmark on 
top of of the changeset described by the @ bookmark". After running the 
above command, the changeset graph would look like:

     (bugfix-1)  *
                 |
                 *
                 |
                 *
                 |
                 * (@)
                 |
                 *
                 |
                 *
                 |
                 *
                 |
                 *
                 |
                 *
                 |

To use rebase, you must enable the rebase extension in your .hgrc. `hg 
rebase` uses the exact same logic as `hg merge` if a conflict arises.

> On a separate note, langtools is one of those cases (there are probably
> others, like Nashorn) where the repo is currently fairly isolated from
> the rest of the JDK - meaning that you can just fetch a langtools repo,
> build it and test it in isolation. A similar case could arise if a JDK
> developer would like to work, say, only on a reduced set of JDK modules.
> While cloning all files is perfectly acceptable disk size-wise I find
> the lack of granularity a tad annoying in the general case (and I've
> built some tools to overcome that problem).

You mean that you can work on the langtools repository only? Do you only 
need a boot JDK? How do you build in those cases? I assumed that 
langtools, like all the other repositories, were building from the top 
repository?

In general, with a consolidated forest, you will have to carry all the 
files in the repository along with you. If you just copy the langtools 
directory somewhere, then you won't get the hg metadata (there is no 
longer any .hg directory in the langtools directory). However, as you 
noticed, it is perfectly fine to only work with a small part of the 
files in the repository. I myself often build only hotspot by running 
`make hotspot` from the top-level directory. Or did I miss your point?

Thanks,
Erik

> Maurizio
>
>
> On 17/10/16 12:47, Erik Helin wrote:
>> Hi Maurizio,
>>
>> thanks for your feedback! Please see my replies inline.
>>
>> On 2016-10-13, Maurizio Cimadamore wrote:
>>> Hi Joe,
>>> some comments on this. As my workflow typically involve cloning one
>>> langtools repo per each new fix, I'll start with discussing local clones
>>> first. Starting with some concrete numbers, I am currently working on 2
>>> forests (jdk 9 and valhalla); between these two forests I currently
>>> have ~35
>>> langtools clones (for various prototypes and bug fixes). Also, as I'm
>>> working on two machines, I keep them in sync using Unison, a very common
>>> sync tool in linux land based on rsync.
>>>
>>> I have been experimenting with local clones, to see to which degree a
>>> local
>>> clone could save in terms of space. My findings are that a local
>>> clone takes
>>> around 800M - which seems consistent with the fact that Mercurial
>>> hardlinks
>>> the repo files but not the history, which is simply copied.
>> You might have gotten this the wrong way around. Mercurial will use
>> hard links for most of the metadata for local clones on a
>> file system that support hard links. The source code files themselves
>> won't be hard links (otherwise, if you edited one file in one local
>> clone then the file sharing the same inode in another local clone would
>> get changed).
>>
>>> For people like me, working on langtools, that's quite a significant
>>> jump in
>>> terms of space - a clean langtools repo is around 150M. So, in my
>>> specific
>>> case, disk usage will jump from 150M * 35  =~ 5G  to 800M * 35 =~ 28G
>>> (this
>>> is a very conservative estimate - since it's assuming that all files are
>>> hardlinked, which will not be the case as soon as I start making some
>>> changes in the local clones). While this is not a deal breaker in
>>> terms of
>>> disk spaces (my SSD has 200G in total), it poses serious strain on my
>>> ability to do regular syncing/backups.
>> Thanks for sharing your workflow! For this use case, could you perhaps
>> try out the `hg share` extension? You need to enable the extension in
>> your .hgrc. A share is like a clone, but Mercurial will share the store
>> folder between all shares. This is *not* done using hardlinks, if you
>> look in the .hg folder for a share, you will not see the "store" folder
>> (you will see a file named sharedpath instead).
>>
>> Using shares on their own can be a bit tricky, but if you combine them
>> with bookmarks, then you get a very powerful solution. In your case, I
>> would suggest the following:
>>
>> $ hg clone http://hg.openjdk.java.net/jdk9/consol-proto
>> $ cd consol-proto
>> $ hg bookmark '@' # traditional name for "master" bookmark
>> $ cd ..
>> $ hg share -B consol-proto bugfix-1
>> $ cd bugfix-1
>> $ hg bookmark 'bugfix-1'
>>
>> You will now end up with two directories, consol-proto and bugfix-1,
>> both looking like a full forest, but they will share the same Mercurial
>> store *and* list of bookmarks (but the active bookmark won't be shared).
>> Since the shares use different bookmarks, the work you do in a share
>> won't interfere with the work you do in another share (you will get
>> multiple heads, but each head will have a bookmark associated with it).
>>
>> For backing up, you now only need to back up the consol-proto
>> repository (it contains all the bookmarks and all commits). There is no
>> need to back up the shares, they can always be created from the
>> consol-proto repository.
>>
>> On my machine, using Linux 4.3.3 and ext4 as my filesystem, with hg
>> version 3.8.1, a share uses 661 MB of disk. If you know want 35 shares,
>> you would end up using 35 * 661 = 22.6 GB. But you only have to back up
>> one repository!
>>
>>> Add to this the fact that most backup/syncing tools explicitly calls
>>> out the
>>> hardlink case as being problematic. Unison doesn't support them, rsync
>>> supports them to some degree, and even some professional backup tools
>>> I'm
>>> using no do support them (or recommend to do without them anyway).
>>> So, local
>>> cloning could be a fine solution when working on one machine, but as
>>> soon as
>>> you start considering back up, you have troubles. For this reasons I
>>> will
>>> have to consider to change my day to day workflow, and to try and avoid
>>> relying on clone as much as I did - which poses issues: for instance,
>>> if I
>>> keep all my patches in the same repo (by using either MQ or
>>> bookmarks) - how
>>> do I differentiate between the different IDE projects to work on them?
>> If you are using shares as suggested above, you would have one folder
>> with all the source code for each bookmark.
>>
>>> Last but not the least - if the local clone size I'm seeing now
>>> (800M) is
>>> almost entirely history-driven, and that already accounts for 50% of the
>>> total size - doesn't that mean that i.e. in 2-3 years time, the size
>>> of the
>>> history will trump the size of the files, meaning that the advantages of
>>> doing local clones will be smaller and smaller over time?
>> No, it is the other way around. For a local clone, you share all of the
>> history (using hard links). So the size of your local clones scale with
>> size of the source files (the same is true for shares). This can easily
>> be verified by doing `du -ms .hg` for a share, I get 6 MB.
>>
>> Thanks,
>> Erik
>>
>>> On a separate and more <meta> note, it seems to me that this effort
>>> is two
>>> things at once:
>>>
>>> * a repo consolidation: use a single repo instead of a forest
>>> * a source restructuring
>>>
>>> Each of the above moves has risks and costs for people in the OpenJDK
>>> land.
>>> For instance, as discussed above, the repo consolidation might mean
>>> significantly change the workflow people use on a daily basis (see
>>> above).
>>> At the same time, the source restructuring is posing issues for
>>> things like
>>> builds, IDE support, and the likes.
>>>
>>> I wonder if it wouldn't be sensible to do the repo restructuring now,
>>> where
>>> the new repo is simply a consolidated version of the new one; no need to
>>> update build scripts to take into account new paths. Then, maybe in
>>> the next
>>> release (JDK 11), we could attack the source restructuring problem.
>>> This way
>>> people will have more time to adjust to the big changes that are coming.
>>>
>>> What do you think?
>>>
>>> Maurizio
>>>
>>>
>>> On 11/10/16 03:11, joe darcy wrote:
>>>> Hello,
>>>>
>>>> Looking ahead to JDK 10, a group of JDK engineers have been exploring
>>>> consolidating the large number of Hg repositories in an open JDK
>>>> forest to
>>>> a single one with the goal of using the consolidated arrangement for
>>>> JDK
>>>> 10.
>>>>
>>>> This message is being sent to jdk9-dev since a jdk10-dev alias to
>>>> discuss
>>>> JDK 10 doesn't exist yet.
>>>>
>>>> A JEP describing the project has been submitted :
>>>>
>>>>     JDK-8167368: Consolidate JDK 10 OpenJDK repositories to a single
>>>> repository
>>>>     https://bugs.openjdk.java.net/browse/JDK-8167368
>>>>
>>>> The text of the JEP describes the motivation and current state of
>>>> the work
>>>> in more detail, including proposed changes to the file layout.
>>>> Publication
>>>> of the prototype consolidated repository is planned, but not done
>>>> yet. The
>>>> email below has a list of additional anticipated questions and answers.
>>>>
>>>> We feel this consolidated arrangement offers some significant
>>>> structural
>>>> advantages for managing the JDK's source code and we are now asking for
>>>> feedback on this potential change. In particular, if you feel there
>>>> is a
>>>> show-stopper problem with making this change, please let us know!
>>>>
>>>> I'd like to acknowledge the work of Stefan Sarne, Stuart Marks, and
>>>> Ingemar Aberg participating in discussions leading up to the
>>>> prototype and
>>>> I'd like to especially recognize the contributions of Erik Helin for
>>>> savvy
>>>> Hg manipulations and Erik Joelsson for skillful build wrangling in this
>>>> project.
>>>>
>>>> Please send initial comments by October 18, 2016.
>>>>
>>>> Cheers,
>>>>
>>>> -Joe
>>>>
>>>> Q: What about the set of forests for JDK 10? Are we going to have
>>>> master,
>>>> dev, client, hotspot, etc. the same set as in 9?
>>>> A: That is a separate question from the repository consolidation, but
>>>> there will likely be simplifications here too. Discussions on that
>>>> point
>>>> will come later.
>>>>
>>>> Q: I usually just build the code in repo X today. Will I have have to
>>>> build the *whole JDK* now?
>>>> A: Not necessarily. The same top-level build targets should work in the
>>>> consolidated forest.
>>>>
>>>> Q: Does disk usage change?
>>>> A: The total disk usage of the current forest compared to the
>>>> consolidated
>>>> forest is nearly the same.
>>>>
>>>> Q: In more detail, how were the changesets imported?
>>>> A: The scripts used for the consolidation conversion are attached to
>>>> the
>>>> JEP.
>>>>
>>>> Q: What happens to the Hg hashes?
>>>> A: The conversion scheme used in the prototype does *not* preserve Hg
>>>> hashes of changesets compared the current forests. However, the bug ids
>>>> are preserved and can be searched for. In addition, one or more
>>>> pre-consolidation forests should be archived in perpetuity so that
>>>> URLs in
>>>> bug comments continue to work, etc.
>>>>
>>>> A mapping of the old hashes to the corresponding new hashes might be
>>>> generated and placed in the final new repo.
>>>>
>>>> Q: I'm allergic to tabs; what about jcheck?
>>>> A: If history is preserved, the checking done by jcheck needs to be
>>>> modified for the consolidated forest. One way to do this is to
>>>> augment the
>>>> white lists used in jcheck with the conflicting changesets. This
>>>> approach
>>>> may not be elegant, but it is effective and doesn't appear to
>>>> appreciably
>>>> impact jcheck running times.
>>>>
>>>> Q: Will the future 9 update forest also have this consolidation
>>>> restructuring?
>>>> A: The script used to do the consolidation conversion is
>>>> deterministic and
>>>> could be run to create the  9 update forest in the future at the
>>>> discretion of the 9 update team.
>>>>
>>>> Q: For backports for forwardports, will there be a script to translate
>>>> patch files across the consolidation boundary?
>>>> A: That work is planned, but not yet done; see JDK-8165623: Create
>>>> patch
>>>> translator to update paths pre/post consolidation.
>>>>
>>>> Q: It's the 21st century and I develop using an IDE. That is still
>>>> going
>>>> to work, right?
>>>> A: The prototype to date does include updating the various IDE support
>>>> files, but bug JDK-8167142 has been filed to track that work.
>>>>
>