[webkit-dev] Growing tired of long build times? Check out this awesome new way to speed up your build... soon (HINT: It's not buying a new computer)

Tue Aug 29 04:43:40 PDT 2017

On 08/29/2017 06:20 AM, Daniel Bates wrote:

> Do we know what is the cause(es) for the slow clean builds? I am assuming that much of the speed up from the "unified source" comes from clang being able to use an in-memory #include cache to avoid disk I/O and re-parsing of a seen header. Have we exhausted all efforts (or have we given up?) removing extraneous #includes? Do we think this pursuing this effort would be more time consuming or have results that pale in comparison to the "unified source" approach?

Whilst having an in-process-memory #include cache is not a bad thing,
it's not the greatest gain as the operating systems should already cache
file reads just fine.

The greatest gain comes from reducing the amount of times C++ headers
are parsed. If you are building a certain .cpp file and include a .h
file, the compiler has to parse it, which can take quite a bit because
C++ is really complex monster, especially when templates are used. Doing
this more than necessary raises build times really quickly.

Header files almost always are include-guarded (either with #pragma once
or traditional #ifndef guards), so including the same header twice
within the same .cpp file (or any of its included files) has no cost. On
the other hand, if you then start building a different .cpp file that
also includes the same header, you have to parse it again because so far
C++ is concerned, every inclusion could add different symbols to the AST
the compiler is building, so their output can't be reused*. In turn we
end up parsing most headers many more times than actually needed (i.e.
for every .cpp file that includes Node.h the compiler has to parse
Node.h and all its dependencies from source, that's a lot of wasted
effort!).

*Note that including twice the same .h within the same .cpp file is not
fast because the output is cached in anyway, but because the entire .h
file is skipped, adding no additional nodes to the AST.

The goal of C++ modules is to fix this problem from its root cause:
instead of literally including text header files, .cpp files declare
dependencies on module files that can be compiled, stored, loaded and
referenced from .cpp files in any order, so you would only parse the
Node module source code once for the entire project, whilst the compiler
could load directly the AST every other time from a cached module object
file.

Note the great advantage of modules comes from the fact that they can be
imported in different contexts and their content is still semantically
equivalent, whereas with plain C/C++ includes every header file may act
differently depending on the preprocessor variables defined by the
includer and other previous inclusions. In the worst case, when they are
not include-guarded (luckily this is not too common, but it still
happens), even including the same file in the same .cpp twice could add
different symbols each time!

Unfortunately C++ modules are a work in progress... There are two
different competing proposals with implementations, one from Clang and
another one from Microsoft, and the C++ technical specification is in a
very early stage too:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4681.pdf

We know for sure modules are very important for the future of C++, but
maybe it's still a bit too early to bet a big project like WebKit on them.

So how else can we avoid parsing the same header files so many times and
speed up our builds? Enter unified builds.

A requirement for unified builds to work correctly is that header files
are coded in such a way they work as independent units, much like C++
modules, i.e. including headers should work no matter in what order you
place them, and in each case they must define the same symbols. On July
31 I wrote about some issues we currently have because of not doing
exactly this in WebKit (particularly, our #include "config.h" lines are
ambiguous). They can be worked around so they will not become blockers
for unified builds, but I still think we should fix them at some point.

Once you have a set of .cpp files whose includes all (1) are guarded
(e.g. by #pragma once) and (2) are independent units according to the
above rule, you can take advantage of unified builds:

Instead of invoking once the compiler for each .cpp file, you create a
new artificial "unified" or "bundle" .cpp file that concatenates (or
#include's) a number of different .cpp files. This way, headers included
within the bundle are parsed only once, even if they are used by
different individual .cpp files, as long as they are within the same
bundle. This can often result in a massive build speed gain.

Unfortunately, there are some pitfalls, as there is a dissonance between
what the programmer thinks the compilation unit is (the individual .cpp
file) and the actual compilation unit used by the compiler (the bundle
.cpp file).

* `static` variables and functions are intended to be scoped to the
individual .cpp file, but the compiler has no way to know this and
instead are scoped to the bundle. This can lead to non-intuitive
name-clashes that we are trying to avoid (e.g. with `namespace FILENAME`).

* Header files that don't work as independent units as they should may
still work somehow or may fail in hard to diagnose ways.

* Editing a .cpp file that is part of a bundle will trigger
recompilation of the entire bundle. This makes changes to small,
independent files slower the more files are grouped per bundle.

* Similar to the issue before, editing a .h depended by a .cpp file will
trigger recompilation of the entire bundle, not the individual file.

It's desirable to bundle .cpp files in a way that minimizes the impact
of the last two issues (e.g. by bundling by features, changing a header
used by all files implementing that feature may trigger the
recompilation of that single feature bundle rather than many scattered
bundles containing a few .cpp files each using that feature).

Even with these issues, editing files depended by many will usually
become much more faster than before, because although more individual
.cpp files will be rebuilt, the number of actual compilation units
(bundles) will be much lower and so will be the number of times header
files are re-parsed.

Compared to modules, unified builds are really a dirty hack. Modules
don't have any of these issues: they are potentially faster and more
reliable. If only they existed now as an standard rather than some
experimental implementation with uncertain tooling, we would
definitively use them.

In absence of modules, unified builds still provide really good speedups
for our dime.

-- Alicia