Stop using precompiled headers for Linux?

Fri Nov 2 15:34:07 UTC 2018

Hi Magnus,

your winning variant gives me a nice boost on my thinkpad:

pch, standard:
real    17m52.367s
user    52m20.730s
sys     4m53.711s

pch, your variant:
real    15m0.514s
user    46m6.466s
sys     2m38.371s

(non-pch is ~19-20 minutes WTC)

With those numbers, I might start using pch again on low powered machines.

.. Thomas

On Fri, Nov 2, 2018 at 12:14 PM Magnus Ihse Bursie
<magnus.ihse.bursie at oracle.com> wrote:
>
>
> On 2018-11-02 11:39, Magnus Ihse Bursie wrote:
> > On 2018-11-02 00:53, Ioi Lam wrote:
> >> Maybe precompiled.hpp can be periodically (weekly?) updated by a
> >> robot, which parses the dependencies files generated by gcc, and pick
> >> the most popular N files?
> > I think that's tricky to implement automatically. However, I've done
> > more or less, that, and I've got some wonderful results! :-)
>
> Ok, I'm done running my tests.
>
> TL;DR: I've managed to reduce wall-clock time from 2m 45s (with pch) or
> 2m 23s (without pch), to 1m 55s. The cpu time spent went from 52m 27s
> (with pch) or 55m 30s (without pch) to 41m 10s. This is a huge gain for
> our automated builds! And a clear improvement even for the ordinary
> developer.
>
> The list of included header files is reduced to just 37. The winning
> combination was to include all header files that was included in more
> than 130 different files, but to exclude all files with the name
> "*.inline.hpp". Hopefully, a further gain of not pulling in the
> *.inline.hpp files is that the risk of pch/non-pch failures will diminish.
>
> However, these 37 files in turn pull in an additional 201 header files.
> Of these, three are *.inline.hpp:
> share/jfr/recorder/checkpoint/types/traceid/jfrTraceIdBits.inline.hpp,
> os_cpu/linux_x86/bytes_linux_x86.inline.hpp and
> os_cpu/linux_x86/copy_linux_x86.inline.hpp. This looks like a problem
> with the header files to me.
>
> With some exceptions (mostly related to JFR), these additional 200 files
> have "generic" looking names (like share/gc/g1/g1_globals.hpp), which
> indicate to me that it is reasonable to have them in this list, just as
> the list of the original 37 tended to be quite general and high-level
> includes. However, some files (like
> share/jfr/instrumentation/jfrEventClassTransformer.hpp) has maybe leaked
> in where they should not really be. It might be worth letting a hotspot
> engineer spend some cycles to check up these files and see if anything
> can be improved.
>
> Caveats: I have only run this on my local linux build with the default
> server JVM configuration. Other machines will have different sweet
> spots. Other JVM variants/feature combinations will have different sweet
> spots. And, most importantly, I have not tested this at all on Windows.
> Nevertheless, I'm almost prepared to suggest a patch that uses this
> selection of files if running on gcc, just as is, because of the speed
> improvements I measured.
>
> And some data:
>
> Here is my log from my runs. The "on or above" means the cutoff I used
> for how many files that needed to include the files that were selected.
> As you can see, there is not much difference between cutoffs between
> 130-150, or (without the inline files) between 110 and 150. (There were
> a lot of additional inline files in the positions below 130.) With all
> other equal, I'd prefer a solution with fewer files. That is less likely
> to go bad.
>
> real    2m45.623s
> user    52m27.813s
> sys    5m27.176s
> hotspot with original pch
>
> real    2m23.837s
> user    55m30.448s
> sys    3m39.739s
> hotspot without pch
>
> real    1m59.533s
> user    42m50.019s
> sys    3m0.893s
> hotspot new pch on or above 250
>
> real    1m58.937s
> user    42m18.994s
> sys    3m0.245s
> hotspot new pch on or above 200
>
> real    2m0.729s
> user    42m16.636s
> sys    2m57.125s
> hotspot new pch on or above 170
>
> real    1m58.064s
> user    42m9.618s
> sys    2m57.635s
> hotspot new pch on or above 150
>
> real    1m58.053s
> user    42m9.796s
> sys    2m58.732s
> hotspot new pch on or above 130
>
> real    2m3.364s
> user    42m54.818s
> sys    3m2.737s
> hotspot new pch on or above 100
>
> real    2m6.698s
> user    44m30.434s
> sys    3m12.015s
> hotspot new pch on or above 70
>
> real    2m0.598s
> user    41m17.810s
> sys    2m56.258s
> hotspot new pch on or above 150 without inline
>
> real    1m55.981s
> user    41m10.076s
> sys    2m51.983s
> hotspot new pch on or above 130 without inline
>
> real    1m56.449s
> user    41m10.667s
> sys    2m53.808s
> hotspot new pch on or above 110 without inline
>
> And here is the "winning" list (which I declared as "on or above 130,
> without inline"). I encourage everyone to try this on their own system,
> and report back the results!
>
> #ifndef DONT_USE_PRECOMPILED_HEADER
> # include "classfile/classLoaderData.hpp"
> # include "classfile/javaClasses.hpp"
> # include "classfile/systemDictionary.hpp"
> # include "gc/shared/collectedHeap.hpp"
> # include "gc/shared/gcCause.hpp"
> # include "logging/log.hpp"
> # include "memory/allocation.hpp"
> # include "memory/iterator.hpp"
> # include "memory/memRegion.hpp"
> # include "memory/resourceArea.hpp"
> # include "memory/universe.hpp"
> # include "oops/instanceKlass.hpp"
> # include "oops/klass.hpp"
> # include "oops/method.hpp"
> # include "oops/objArrayKlass.hpp"
> # include "oops/objArrayOop.hpp"
> # include "oops/oop.hpp"
> # include "oops/oopsHierarchy.hpp"
> # include "runtime/atomic.hpp"
> # include "runtime/globals.hpp"
> # include "runtime/handles.hpp"
> # include "runtime/mutex.hpp"
> # include "runtime/orderAccess.hpp"
> # include "runtime/os.hpp"
> # include "runtime/thread.hpp"
> # include "runtime/timer.hpp"
> # include "services/memTracker.hpp"
> # include "utilities/align.hpp"
> # include "utilities/bitMap.hpp"
> # include "utilities/copy.hpp"
> # include "utilities/debug.hpp"
> # include "utilities/exceptions.hpp"
> # include "utilities/globalDefinitions.hpp"
> # include "utilities/growableArray.hpp"
> # include "utilities/macros.hpp"
> # include "utilities/ostream.hpp"
> # include "utilities/ticks.hpp"
> #endif // !DONT_USE_PRECOMPILED_HEADER
>
> /Magnus
>
> >
> > I'd still like to run some more tests, but preliminiary data indicates
> > that there is much to be gained by having a more sensible list of
> > files in the precompiled header.
> >
> > The fewer files we got on this list, the less likely it is to become
> > (drastically) outdated. So I don't think we need to do this
> > automatically, but perhaps manually every now and then when we feel
> > build times are increasing.
> >
> > /Magnus
> >
> >>
> >> - Ioi
> >>
> >>
> >> On 11/1/18 4:38 PM, David Holmes wrote:
> >>> It's not at all obvious to me that the way we use PCH is the
> >>> right/best way to use it. We dump every header we think it would be
> >>> good to precompile into precompiled.hpp and then only ask gcc to
> >>> precompile it. That results in a ~250MB file that has to be read
> >>> into and processed for every source file! That doesn't seem very
> >>> efficient to me.
> >>>
> >>> Cheers,
> >>> David
> >>>
> >>> On 2/11/2018 3:18 AM, Erik Joelsson wrote:
> >>>> Hello,
> >>>>
> >>>> My point here, which wasn't very clear, is that Mac and Linux seem
> >>>> to lose just as much real compile time. The big difference in these
> >>>> tests was rather the number of cpus in the machine (32 threads in
> >>>> the linux box vs 8 on the mac). The total amount of work done was
> >>>> increased when PCH was disabled, that's the user time. Here is my
> >>>> theory on why the real (wall clock) time was not consistent with
> >>>> user time between these experiments can be explained:
> >>>>
> >>>> With pch the time line (simplified) looks like this:
> >>>>
> >>>> 1. Single thread creating PCH
> >>>> 2. All cores compiling C++ files
> >>>>
> >>>> When disabling pch it's just:
> >>>>
> >>>> 1. All cores compiling C++ files
> >>>>
> >>>> To gain speed with PCH, the time spent in 1 much be less than the
> >>>> time saved in 2. The potential time saved in 2 goes down as the
> >>>> number of cpus go up. I'm pretty sure that if I repeated the
> >>>> experiment on Linux on a smaller box (typically one we use in CI),
> >>>> the results would look similar to Macosx, and similarly, if I had
> >>>> access to a much bigger mac, it would behave like the big Linux
> >>>> box. This is why I'm saying this should be done for both or none of
> >>>> these platforms.
> >>>>
> >>>> In addition to this, the experiment only built hotspot. If you we
> >>>> would instead build the whole JDK, then the time wasted in 1 in the
> >>>> PCH case would be negated to a large extent by other build targets
> >>>> running concurrently, so for a full build, PCH is still providing
> >>>> value.
> >>>>
> >>>> The question here is that if the value of PCH isn't very big,
> >>>> perhaps it's not worth it if it's also creating as much grief as
> >>>> described here. There is no doubt that there is value however. And
> >>>> given the examination done by Magnus, it seems this value could be
> >>>> increased.
> >>>>
> >>>> The main reason why we haven't disabled PCH in CI before this. We
> >>>> really really want to get CI builds fast. We don't have a ton of
> >>>> over capacity to just throw at it. PCH made builds faster, so we
> >>>> used them. My other reason is consistency between builds.
> >>>> Supporting multiple different modes of building creates the
> >>>> potential for inconsistencies. For that reason I would definitely
> >>>> not support having PCH on by default, but turned off in our
> >>>> CI/dev-submit. We pick one or the other as the official build
> >>>> configuration, and we stick with the official build configuration
> >>>> for all builds of any official capacity (which includes CI).
> >>>>
> >>>> In the current CI setup, we have a bunch of tiers that execute one
> >>>> after the other. The jdk-submit currently only runs tier1. In tier2
> >>>> I've put slowdebug builds with PCH disabled, just to help verify a
> >>>> common developer configuration. These builds are not meant to be
> >>>> used for testing or anything like that, they are just run for
> >>>> verification, which is why this is ok. We could argue that it would
> >>>> make sense to move the linux-x64-slowdebug without pch build to
> >>>> tier1 so that it's included in dev-submit.
> >>>>
> >>>> /Erik
> >>>>
> >>>> On 2018-11-01 03:38, Magnus Ihse Bursie wrote:
> >>>>>
> >>>>>
> >>>>> On 2018-10-31 00:54, Erik Joelsson wrote:
> >>>>>> Below are the corresponding numbers from a Mac, (Mac Pro (Late
> >>>>>> 2013), 3.7 GHz, Quad-Core Intel Xeon E5, 16 GB). To be clear, the
> >>>>>> -npch is without precompiled headers. Here we see a slight
> >>>>>> degradation when disabling on both user time and wall clock time.
> >>>>>> My guess is that the user time increase is about the same, but
> >>>>>> because of a lower cpu count, the extra load is not as easily
> >>>>>> covered.
> >>>>>>
> >>>>>> These tests were run with just building hotspot. This means that
> >>>>>> the precompiled header is generated alone on one core while
> >>>>>> nothing else is happening, which would explain this degradation
> >>>>>> in build speed. If we were instead building the whole product, we
> >>>>>> would see a better correlation between user and real time.
> >>>>>>
> >>>>>> Given the very small benefit here, it could make sense to disable
> >>>>>> precompiled headers by default for Linux and Mac, just as we did
> >>>>>> with ccache.
> >>>>>>
> >>>>>> I do know that the benefit is huge on Windows though, so we
> >>>>>> cannot remove the feature completely. Any other comments?
> >>>>>
> >>>>> Well, if you show that it is a loss in time on macosx to disable
> >>>>> precompiled headers, and no-one (as far as I've seen) has
> >>>>> complained about PCH on mac, then why not keep them on as default
> >>>>> there? That the gain is small is no argument to lose it. (I
> >>>>> remember a time when you were hunting seconds in the build time ;-))
> >>>>>
> >>>>> On linux, the story seems different, though. People experience PCH
> >>>>> as a problem, and there is a net loss of time, at least on
> >>>>> selected testing machines. It makes sense to turn it off as
> >>>>> default, then.
> >>>>>
> >>>>> /Magnus
> >>>>>
> >>>>>>
> >>>>>> /Erik
> >>>>>>
> >>>>>> macosx-x64
> >>>>>> real     4m13.658s
> >>>>>> user     27m17.595s
> >>>>>> sys     2m11.306s
> >>>>>>
> >>>>>> macosx-x64-npch
> >>>>>> real     4m27.823s
> >>>>>> user     30m0.434s
> >>>>>> sys     2m18.669s
> >>>>>>
> >>>>>> macosx-x64-debug
> >>>>>> real     5m21.032s
> >>>>>> user     35m57.347s
> >>>>>> sys     2m20.588s
> >>>>>>
> >>>>>> macosx-x64-debug-npch
> >>>>>> real     5m33.728s
> >>>>>> user     38m10.311s
> >>>>>> sys     2m27.587s
> >>>>>>
> >>>>>> macosx-x64-slowdebug
> >>>>>> real     3m54.439s
> >>>>>> user     25m32.197s
> >>>>>> sys     2m8.750s
> >>>>>>
> >>>>>> macosx-x64-slowdebug-npch
> >>>>>> real     4m11.987s
> >>>>>> user     27m59.857s
> >>>>>> sys     2m18.093s
> >>>>>>
> >>>>>>
> >>>>>> On 2018-10-30 14:00, Erik Joelsson wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> On 2018-10-30 13:17, Aleksey Shipilev wrote:
> >>>>>>>> On 10/30/2018 06:26 PM, Ioi Lam wrote:
> >>>>>>>>> Is there any advantage of using precompiled headers on Linux?
> >>>>>>>> I have measured it recently on shenandoah repositories, and
> >>>>>>>> fastdebug/release build times have not
> >>>>>>>> improved with or without PCH. Actually, it gets worse when you
> >>>>>>>> touch a single header that is in PCH
> >>>>>>>> list, and you end up recompiling the entire Hotspot. I would be
> >>>>>>>> in favor of disabling it by default.
> >>>>>>> I just did a measurement on my local workstation (2x8 cores x2
> >>>>>>> ht Ubuntu 18.04 using Oracle devkit GCC 7.3.0). I ran "time make
> >>>>>>> hotspot" with clean build directories.
> >>>>>>>
> >>>>>>> linux-x64:
> >>>>>>> real    4m6.657s
> >>>>>>> user    61m23.090s
> >>>>>>> sys    6m24.477s
> >>>>>>>
> >>>>>>> linux-x64-npch
> >>>>>>> real    3m41.130s
> >>>>>>> user    66m11.824s
> >>>>>>> sys    4m19.224s
> >>>>>>>
> >>>>>>> linux-x64-debug
> >>>>>>> real    4m47.117s
> >>>>>>> user    75m53.740s
> >>>>>>> sys    8m21.408s
> >>>>>>>
> >>>>>>> linux-x64-debug-npch
> >>>>>>> real    4m42.877s
> >>>>>>> user    84m30.764s
> >>>>>>> sys    4m54.666s
> >>>>>>>
> >>>>>>> linux-x64-slowdebug
> >>>>>>> real    3m54.564s
> >>>>>>> user    44m2.828s
> >>>>>>> sys    6m22.785s
> >>>>>>>
> >>>>>>> linux-x64-slowdebug-npch
> >>>>>>> real    3m23.092s
> >>>>>>> user    55m3.142s
> >>>>>>> sys    4m10.172s
> >>>>>>>
> >>>>>>> These numbers support your claim. Wall clock time is actually
> >>>>>>> increased with PCH enabled, but total user time is decreased.
> >>>>>>> Does not seem worth it to me.
> >>>>>>>>> It's on by default and we keep having
> >>>>>>>>> breakage where someone would forget to add #include. The
> >>>>>>>>> latest instance is JDK-8213148.
> >>>>>>>> Yes, we catch most of these breakages in CIs. Which tells me
> >>>>>>>> adding it to jdk-submit would cover
> >>>>>>>> most of the breakage during pre-integration testing.
> >>>>>>> jdk-submit is currently running what we call "tier1". We do have
> >>>>>>> builds of Linux slowdebug with precompiled headers disabled in
> >>>>>>> tier2. We also build solaris-sparcv9 in tier1 which does not
> >>>>>>> support precompiled headers at all, so to not be caught in
> >>>>>>> jdk-submit you would have to be in Linux specific code. The
> >>>>>>> example bug does not seem to be that. Mach5/jdk-submit was down
> >>>>>>> over the weekend and yesterday so my suspicion is the offending
> >>>>>>> code in this case was never tested.
> >>>>>>>
> >>>>>>> That said, given that we get practically no benefit from PCH on
> >>>>>>> Linux/GCC, we should probably just turn it off by default for
> >>>>>>> Linux and/or GCC. I think we need to investigate Macos as well
> >>>>>>> here.
> >>>>>>>
> >>>>>>> /Erik
> >>>>>>>> -Aleksey
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
>