Reproducible environments for local dev, CI, and nightly builds #2979
Replies: 2 comments 4 replies
-
Yeah, love these goals! And I think the specific bullets listed out are a good way to get there. I think we should consider a generic task runner like Invoke over
It seems like most of this work is already done in #2968 - I think if we want to try using |
Beta Was this translation helpful? Give feedback.
-
|
One note - if we make it harder/unsupported to use PUDL as a library, we are going to have to mess with how we use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently we've had a slew of build issues resulting from changes in our upstream dependencies. We've also known for a while that our previously released
catalystcoop.pudlpackages don't always keep working because of creeping dependency incompatibilities.We've talked in the past year about treating PUDL like an application, not a library -- with the move to writing all of our derived outputs into the database that we publish, the main PUDL repository has become a centralized means of producing data, which is then distributed, rather than a tool that we expect others to use to produce data on their own. That said, ideally the data we produce should be reproducible -- given a commit or tag from the main repo, ideally we or someone else down the line should be able to generate the same data outputs. The biggest weak point in this aspiration has been our python environment, which hasn't used individually pinned dependencies. We've archived docker images in the past, but that's quite a heavyweight system for most people (including us).
In PR #2968 I've created a setup that uses
conda-lockto create a reproducible conda environment, that refers to exact versions of released packages, including hashes, for all of our direct an indirect dependencies, based entirely on dependencies specified inpyproject.toml. Using these conda environments, we should be able to use exactly the same software, operating on exactly the same input data (archived on Zenodo) to produce exactly the same outputs. The software environment shouldn't change unless we update the lockfile, and the lockfile is checked into the github repo, so a given git commit or tag will always contain the full conda environment specification.The trick with the conda lockfiles is that we need to be able to use them to manage our environment in several different places, which have different environment expectations:
pudl-devconda environment, to use the appropriate platform specific rendered environment file (e.g.environments/conda-osx-arm64.lock.yml).pytestdirectly, this works fine, since it runs in your local development environment. However, we're currently using Tox to do several distinct things: Isolate the installation of the PUDL package from the repository, manage virtual environments that are separate from the conda environment, and usepipto install dependencies, and also to store a bunch of script-like logic about what set of commands are run, which environment variables are set, and which sets of optional dependencies are installed for each test environment. And unfortunately, Tox doesn't really integrate with the conda lockfiles (thetox-condaextension is way out of date). So to use the locked conda environment for local testing, these Tox functionalities would need to either be abandoned or migrated to some other system.mambaorg/setup-micromambaaction, somicromambais available in CI, and we can use it to install the locked environment quickly from the master lockfile (no solver is invoked). This means that on theconda-lockfilebranch the locked environment is already in place. However, Tox is then being used to run the tests, which means thatpipis being invoked to build another virtual environment which means we aren't actually using the locked environment for the tests.conda-lockfilebranch the Docker image for the nightly builds has been switched tomambaorg/micromambawhich can build the locked conda environment very quickly usingmicromambaand the explicit master lockfile (no solver is invoked).conda-lockfilebranch already.One Scenario
conda-forge.environments/pytestdirectly to run the tests locally for debugging purposes.tox.iniand into either the GitHub Actions workflow file (where they will definitely be run in CI) or into something like a Makefile so we can run them locally (and also potentially on GitHub Actions). E.g.make docscould do all the things thattox -e docsdoes now -- lint the docs, remove old docs builds, and recreate the docs using sphinx), see catalog of all the commands intox.inibelow.micromambaand the explicit master lockfile to create the python environment that the tests or ETL run in.tox.inito a simpler commonly used tool like aMakefileand then usemaketo run various tests and builds locally and in CI on GitHub in a uniform way.Makefiletargets for tasks like re-locking the conda environment, and usemake conda_lockfileboth locally and in scheduled GitHub Actions.The above is just one possibility, but I think it would:
conda-lockandmake).pyproject.tomlwhich can be consumed by multiple other tools if need be (pip,tox,conda-lock, dependabot, etc.)Stuff in
tox.inilinters: This is stuff that's already done bypre-commit/pre-commit.ciand should also be happening in your IDE. It can just be removed.docs: Can easily be replaced with a Makefile targetmake docsunit/integration/minmax_rows/validate/jupyter/full_integration/full: I think these canpytestcommands and compositions ofpytestcommands can be turned intoMakefiletargets pretty easily.ci: Could be a high-levelMakefiletarget, which runs everything which would be run as part of the CI, but locally. Could imagine it setting$PUDL_OUTPUTand$DAGSTER_HOMEenvironment variables to temporary directories, running theferc_to_sqliteandpudl_etlscripts in such a way as to gather coverage information, using theetl_fastinputs.nuke: Could be another high-levelMakefiletarget that doesn't reset$PUDL_OUTPUTor$DAGSTER_HOMEto a temporary directory, and clobbers everything from scratch, then runs all the tests, validations etc.get_unmapped_ids: Could be a Makefile targetmake unmapped_idsthat invokes the currentpytestcommand or a script which replaces it, and outputs the new IDs for mapping.build/testrelease/release: Can be replaced with the package-on-tag action that has been deployed in some of our other repositories, or it can be set aside if we don't want to create a pip / conda installable version ofcatalystcoop.pudlgoing forward (If we are going to distribute that package, it needs to be tested somewhere!)Parallel ETL in CI; Don't make pytest build PUDL DB
ferc_to_sqliteandpudl_etlscripts to invoke Dagster with multiprocessing support, and to take the database construction work away frompytestfixtures (which was always kind of a hack) before running the integration tests with--live-dbs.Other options?
I'm sure there are other options we could explore (like moving from
toxto the python-based Nox). Any other thoughts or suggestions?Beta Was this translation helpful? Give feedback.
All reactions