-
Notifications
You must be signed in to change notification settings - Fork 0
Home
mypy is a project that does static type checking on python code, according to type hints in the code (see PEP484).
bazel-mypy-integration is a project to
enable mypy type-checking for python targets in bazel.
bazel-mypy-integration understands the python dependency graph represented in bazel. Essentially,
it creates a mypy ... invocation for a given set of .py files, as well as the dependencies of
those files.
However, the mypy invocation does not handle caching of dependencies at all. For example, given:
py_library(
name = "lib1",
srcs = [
"lib1_a.py",
"lib1_b.py",
"lib1_c.py",
]
)
py_library(
name = "lib2",
srcs = [
"lib2.py",
],
deps = [
":lib1",
]
)
mypy_test( # This is a concrete mypy test for lib1
name = "lib1_mypy_test",
deps = [":lib1"]
)
mypy_test(
name = "lib2_mypy_test",
deps = [":lib2"]
)Running e.g.: bazel test :lib1_mypy_test :lib2_mypy_test will result in two mypy invocations that
which will both parse (and potentially test, depending on your mypy.ini) all the source files
involved, transitively. E.g. mypy ... -- lib1_a.py lib1_b.py lib1_c.py and mypy ... -- lib1_a.py lib1_b.py lib1_c.py lib2.py. There is no state shared between these, meaning that there is
duplicated effort in parsing the files in common.
Worse still, mypy ships with type stubs for all of the python standard library (via typeshed).
If lib*py imports anything from the python standard library, mypy further parses the stub files,
again without caching anything.
Currently, bazel-mypy-integration passes complete paths to source files, wrt. the workspace root.
This introduces a problem with how bazel py_library targets can operate. Here is an example that
illustrates the problem:
py_library(
name = "lib1",
srcs = [
"srcs/lib1.py",
"srcs/internal/lib1_internal.py",
],
imports = [
"srcs",
]
)The above example adds the workspace-relative path srcs to the PYTHONPATH of all dependents of
lib1. Dependents can depend on this target and do import lib1 and everything works. Furthermore,
lib1.py can do from internal import lib1_internal, and that works fine as well. Now let's
examine the mypy command line produced by bazel-mypy-integration for this library:
MYPYPATH=$PWD:srcs/
mypy ...args... -- srcs/lib1.py srcs/internal/lib1_internal.py
This leads to an exception in mypy (versions >= 0.780) which mentions "Source file found twice"
(see issue #20). The issue
stems from how mypy treats the source file arguments. It expects that if a source file is listed
as srcs/internal/lib1_internal.py then it must be a module with the same path:
srcs.internal.lib1_internal. Since in the above example, srcs/lib1.py does from internal import lib1_internal, and since mypy checks the uniqueness of modules it imports, it fails when it
encounters identical modules srcs.internal.lib1_internal (from the command line) and
internal.lib1_internal (from an import statement).
This means that bazel-mypy-integration's mypy version is fixed at 0.750 to avoid this problem.
This is not a bug in mypy (see discussion in
https://github.com/python/mypy/issues/8944).
mypy natively
supports caching of
type data, through the use of --cache-dir and a hidden option called --cache-map.
bazel-mypy-integration takes advantage of fixed cache locations for each module using argument
triples of the form --cache-map path/to/lib1_a.py <path to mypy's lib1_a.meta.json> <path to mypy's lib1_a.data.json>. These paths represent both where mypy should generate the metadata for a
parsed source file, and also where to find existing generated metadata (for dependencies, for
example). In our examples above, this would translate in to --cache-map triples for each of the 3
source files in lib1 and 4 source files for lib2 (its own, and its dependencies' source files).
By capturing the generated .meta.json/.data.json pairs as part of the rule invocation, we can
propagate mypy's generated metadata to dependents. This means that mypy does not have to
re-generate the metadata by re-parsing the dependent source files. This results in markedly faster
performance, even in shallow dependency trees. It's important to note that the mypy docs mention
the improved performance as well, when caching is enabled.
Similar to the above, typeshed stub parsing can be sped up by propagating the same cache triples and
mypy .meta.json/.data.json pairs for all of the stdlib (which can be a large number of files).
In order to accomplish this, we need to treat the mypy stubs a little differently. typeshed is an
implicit, internal dependency of mypy, meaning that the typeshed pip package is not represented
in bazel at all, it's only available by way of the mypy pip package. To tease out the valid
typeshed stubs from requirement("mypy"), we need to trawl through the files in that package, and
filter the typeshed ones based on path filtering.
For this purpose, a new rule is introduced called mypy_stdlib_cache_library, which is similar to
mypy_aspect and mypy_test, but has an implementation that is used to deal with typeshed stubs
only. There needs to be a singleton instance of this target, defined as part of
bazel-mypy-integration. This singleton is a dependency of all mypy targets.
To solve for the problem described in 2., and to make it possible in general for
bazel-mypy-integration to work with py_* targets that specify an imports attribute, we can
change the above invocation to operate on modules rather than directly on source files:
MYPYPATH=$PWD:srcs/
mypy ...args... -- -m lib1 -m internal.lib1_internal
Now, mypy finds the import lines to match exactly the -m arguments it has received on the
command line, which means there are no duplicate source file errors. In addition, the metadata
generated by mypy will refer to the correct module paths (ie, the generated metadata will know
that they refer to internal.lib1_internal and not srcs.internal.lib1_internal). This is correct
for dependents as well.
In bazel, there is nothing stopping the same source file(s) being a part of multiple py_library
targets. Consider for example:
py_library(
name = "a",
srcs = ["a.py"]
)
py_library(
name = "a_prime",
srcs = ["a.py"]
)
py_binary(
name = "bin",
srcs = ["bin.py"],
deps = [
":a",
":a_prime",
]
)There is nothing preventing this situation, and it's perfectly valid in terms of bazel dependencies and python rules to have this. The runfiles will contain a.py and all is fine.
However, mypy expects --cache-map arguments for each python source file, and it does not want
duplicate --cache-map arguments pointing at the same a.py (it exits with an error in this case;
all --cache-map source files must be unique). Since each py source file must have a unique
--cache-map argument, we have a dilemma, since both a and a_prime specify the same source. If
we were to operate as usual, both sets of --cache-map arguments would be in the transitive set of
--cache-map triples arguments, which fails.
The solution is to just pick one cache map argument (the first one encountered). This is not a
perfect solution, though, since it is possible for the same py source file, at the exact same
location, to produce a different set of mypy metadata (for example, if the python target's
imports path was different). But this seems pathlogical enough to not worry about it. Just picking
the one cache map argument is fine because:
- if
bin.pyimportsa, and there is no difference in cached metadata, then the cached metadata will be used correctly. - if
bin.pyimportsa, and there is a difference in cached metadata (for example, a difference in module name), then themypywill just regenerate the metadata fora.py.
The above scenario breaks down when there are multiple imports specified in a single py_library:
py_library(
name = "lib1",
srcs = [
"srcs/lib1.py",
"srcs/internal/lib1_internal.py",
],
imports = [
"srcs",
"srcs/internal",
]
)If the above looks odd, it kind of is, because it specifies that all modules under srcs and
under srcs/internal can be imported directly. This is not a convention in python, and it seems
like kind of an edge case.
In any case, handling it with bazel-mypy-integration does not seem clear, for the following
reason: Remember that the --cache-map arguments to mypy take a source file path, a .meta.json
path, and a .data.json path. The module path is encoded in the generated metadata. If the same
source files can lead to multiple module paths (internal.lib1_internal and lib1_internal),
that's more information than can be represented in the --cache-map argument, because it requires a
single source file (this seems like an oversite in the design of mypy). In other words, the
--cache-map arguments imply a specific file and module path combination.
In any case, in the above pathlogical case of multiple imports, the best we can do is just pick
the first import path, and refer to the source file through a single module path (e.g. -m internal.lib1_internal).