Consolidated repo vs. old forest size differences
Erik Joelsson
erik.joelsson at oracle.com
Fri Sep 29 11:20:10 UTC 2017
Hello Volker,
As the implementor of the conversion strategy, I have investigated and
played around with this a lot. There are two main contributors to the
increased size.
* First, as Joe said, the file moves, which account for around 350MB. A
similar (though not quite as big) size increase occurred when we
reorganized the source into modules.
* The second is caused by inefficient storage in the "manifest" file in
the metadata. This file can grow large when adjacent changesets have
large diffs between them, typically from non linear history. By merging
together previously unrelated histories, we created quite a few of
those. If you check your .hg dir, you will see that this file is around
400MB.
There is a way to reduce the size of the manifest file, but it requires
a newer version of Mercurial than we have on our servers and the feature
is not well documented (see aggressiveMergeDeltas). Because of this, we
have been hesitant to deploy that change on our main servers. If you are
curious, you can convert your repository locally using a command like this:
hg --config=format.generaldelta=1
--config=format.aggressivemergedeltas=1 clone --pull <jdk10-master>
<new-jdk10-master>
That should result in a manifest file size of around 40MB and bring down
the metadata total size to around 1100MB. All of the conversion was done
with this option enabled, but by pushing to an older server, the
"compression" didn't stick, which it otherwise would have.
/Erik
On 2017-09-29 01:09, Volker Simonis wrote:
> Hi Joe,
>
> thanks for the quick answer. Yes, that explains the double size. So at
> least it is a one time effect only.
>
> Regards,
> Volker
>
>
> On Fri, Sep 29, 2017 at 12:48 AM, joe darcy <joe.darcy at oracle.com> wrote:
>> Hi Volker,
>>
>> When there is a file move in Hg, it starts a fresh snapshot of the file.
>> With the reorganized source structure, having a single top-level src
>> subdirectory instead of jdk/src, langtools/src, etc. all the source files
>> were moved.
>>
>> This accounts for such of the greater size post consolidation. We're looking
>> into some ways to mitigate the impact of the larger size.
>>
>> Thanks,
>>
>> -Joe
>>
>>
>>
>> On 9/28/2017 3:15 PM, Volker Simonis wrote:
>>> Hi,
>>>
>>> not sure if this has been discussed before but at least I couldn't
>>> find any references in the previous mail threads on the repo
>>> consolidation.
>>>
>>> I've just realized that the size of the repository history (i.e.
>>> everything under .hg) has doubled in the new consolidated repo (800mb
>>> vs. 1600mb) and I don't exactly understand why:
>>>
>>> $ du -shc jdk10-hs-old/*/.hg jdk10-hs-old/.hg
>>> 16M jdk10-hs-old/corba/.hg
>>> 141M jdk10-hs-old/hotspot/.hg
>>> 49M jdk10-hs-old/jaxp/.hg
>>> 57M jdk10-hs-old/jaxws/.hg
>>> 453M jdk10-hs-old/jdk/.hg
>>> 76M jdk10-hs-old/langtools/.hg
>>> 33M jdk10-hs-old/nashorn/.hg
>>> 8,1M jdk10-hs-old/.hg
>>> 829M total
>>>
>>> $ du -sh jdk10-hs/.hg
>>> 1,6G jdk10-hs/.hg
>>>
>>> I wonder why this is the case?
>>>
>>> Is this because the consolidated repo has more and bigger merge changes?
>>>
>>> The consolidated repo has a total of 47297 changes with about 13878
>>> merge changes:
>>>
>>> $ hg -R jdk10-hs log --template "{rev}\n" -r tip
>>> 47297
>>> $ hg -R jdk10-hs log --template "{rev}\n" -k Merge | wc
>>> 13878 13878 79600
>>>
>>> The old forest had a total of 43102 changes with about 10408 merge
>>> changes:
>>>
>>> $ bash common/bin/hgforest.sh log --template "{rev}\n" | wc
>>> 43102 86285 1295798
>>> $ bash common/bin/hgforest.sh log -k "Merge" --template "{rev}\n" | wc
>>> 10408 20897 312491
>>>
>>> So the new consolidated repo has about 3000-4000 more changes of which
>>> all are merge changesets. Does anybody know a nice command to sum up
>>> the size of all merge changesets?
>>>
>>> Any other insights or comments? It would be especially interesting to
>>> know how this will evolve in the future.
>>>
>>> Regards,
>>> Volker
>>>
>>> PS: this also partially explains why downloading the new repo takes
>>> considerably longer compared to the old forest (the fact that the
>>> get_sources.sh script downloaded the forest in parallel being the
>>> second reason).
>>
More information about the jdk10-dev
mailing list