Design Document: Invalidation of remote repositories
Design documents are not descriptions of the current functionality of Bazel. Always go to the documentation for current information.
Status: Implemented
Author: Damien Martin-Guillerez
Design document published: 18 October 2016
State at commit 808a651
Remote repositories are fetched the first time a build that depends on a repository is launched. The next time the same build happens, the already fetched repositories are not refetched, saving on download times or other expensive operations.
This behavior is also enforced even when the Bazel server
is restarted by serializing the repository rule in the workspace
file. A file named @<repositoryName>.marker
is
created
for each repository with a
fingerprint of the serialized rule. On
next fetch, if that fingerprint has not changed, the rule is not
refetched. This is not applied if the repository rule is marked
as
local
because fetching a local repository is assumed to be
fast.
Shortcomings
These consideration were well-suited when the implementation of repository rules were not depending on Skylark file. With the introduction of Skylark repositories, several issues appeared:
- Change in the skylark implementation of the rule does not trigger a refetch of the rule, nor does a change in one of the template files that relies on that rule: the rule marker does not contains this information.
- There is no way to re-configure a repository used for
auto-configuration,
leading to
excessive uses of
bazel clean --expunge
. - The invalidation behavior of repository rules are unclear and difficult to explain.
Proposed solution
Invalidation on the environment
Right now rules are not invalidated on the environment:
- Invalidation on accessing
repository_ctx.os.environ
would generate invalidation on environment variable that might be volatile (e.g.CC
when you want to use one C++ compiler and you reset your environment) and might miss other environment variables due to computed variable names. - There is no way to represent environment variables that influence
repository_ctx.execute
.
This document proposes to add a way to declare a dependency on an
environment variable value that would trigger a refetch of a
repository. An optional attribute environ
would be added to the
repository_rule
method, taking a list of strings and would trigger invalidation of the
repository on any of change to those environment variables. E.g.:
my_repo = repository_rule(impl = _impl, environ = ["FOO", "BAR"])
my_repo
would be refetched on any change to the environment
variables FOO
or BAR
but not if the environment variable BAZ
would changes.
To be consistent with the
new environment specification
mechanism, the environment available through
repository_ctx.os.environ
or transmitted to
repository_ctx.execute
will take values from the --action_env
flag, when specified. I.e. if
--action_env FOO=BAR --action_env BAR
are specified, and the
environment set FOO=BAZ
, BAR=FOO
, BAZ=BAR
, then the actual
repository_ctx.os.environ
map would contain {"FOO": "BAR", "BAR":
"FOO", "BAZ": "BAR" }
. This would ensure that the environment seen by
repository rules is consistent with the one seen by actions (a
repository rule see more than an action, leaving the rule
writer the ability to filter the environment more finely).
Both these changes should allow Bazel to do auto-configuration based on environment variables:
- Setting some environment variables would actually retrigger auto-configuration, corresponding to how the rule writter designed it (and not based on some assumption from Bazel).
- The user set specific environment variables through the
--action_env
flag, and fix this environment usingbazel info client-env
.
Serialization of Skyframe dependencies
A local
rule will be invalidated when any of its skyframe
dependencies change. For non-local
rule, a marker file
will be stored on the external directory with a summary of the
dependencies of the rule. At each fetch operation, we check
the existence of the marker file and verify each dependency.
If one of them have changed, we would refetch that repository.
To avoid unnecessary re-download of artifacts, a content-addressable cache has been developed for downloads (and thus not discuted here).
The marker file will be a manifest containing the following items:
- A fingerprint of the serialized rule and the rule specific data
(e.g., maven server information for
maven_jar
). - The declared environment (list of name, value pairs) through the
environ
attribute of the repository rule. - The list of
FileValue
-s requested bygetPathFromLabel
and the corresponding file content digest. - The transtive hash of the
Extension
defining the repository rule. This transitive hash is computed from the hash of the current extension and the extension loaded from it. This means that a repository function will get invalidated as soon as the extension file content changes, which is an over invalidation. However, getting an optimal result would require correct serialization of Skylark extensions.
Implementation plan
- Modify the
SkylarkRepositoryFunction#getClientEnvironment
method to get the values from the--action_env
flag. - Adds a
markerData
map argument toRepositoryFunction#fetch
soSkylarkRepositoryFunction
can include those change. This attribute should be mutable so a repository can add more data to be stored in the marker file. Adds a corresponding function for verification,verifyMarkerManifest
, that would take a marker data map and return a tri-state: true if the repository is up to date, false if it needs refetch and null if additional Skyframe dependency need to be resolved for answering. - Add the
environ
attribute to therepository_rule
function and the dependency on the Skyframe values for the environment. Also create aSkyFunction
for processed environment after the--action_env
flag. - Adds the
environ
values to the marker file through thegetMarkerManifest
function. - Adds the
FileValue
-s to the marker file, adding all the files requested through thegetPath
method to a specific builder that will be passed to theSkylarkRepositoryContext
. - Adds the extension to the marker file by passing the
transitiveHashCode
of the SkylarkEnvironment
to the marker manifest.