-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Colton Loftus edited this page Sep 3, 2024
·
15 revisions
Scheduler IoW Notes
- build files (
implnet_{jobs,ops}_*.py) are tracked in the repo, making a verbose git history and PRs more work to review - multiple organizations have stored configurations in the repo, causing a higher burden on maintainers
- build is done with environment variables and multiple coupled components instead of one build script, making it more challenging to debug, test, and refactor
-
Build the
gleanerconfig.yml- This config builds upon a
gleanerconfigPREFIX.yamlfile that is the base template
- This config builds upon a
-
Source the
nabuconfig.yamlwhich specifies configuration and context for how to retrieve triplet data and how to store it in minio -
Generate the
jobs/ops/sch/andrepositories/directories which container the Python files that describe when to run the job -
Generate the
workspace.yamlfile- Some configurations of the
workspace.yamlfile include agrpc_serverkey. Other just describe the relative path for the Python file which contains references to all the jobs - This might be able to be eliminated or condense into the other config when refactoring
- Some configurations of the
-
Set up the docker swarm configuration using
dagster_setup_docker.sh- Create the docker network
- Create the volume and read in the
gleanerconfig.yamlworkspace.yamlandnabuconfig.yaml
NOTE: After this point the configuration and docker compose has a significant number of env vars, configuration options, and merged configurations that make the proceeding steps a bit unclear
- Run the docker compose project
- Source the
.envfile to hold env variables and pass these into the compose project - Ensure all the config files are contained inside the container
- Check if there is a compose override
.ymlfile and if so, pass it in
- Source the
- This docker compose project will manage:
- traefik for a proxy to access container resources
- dagster for scheduling crawls. This in turn manages the following:
-
postgresappears to be just for storing internal data -
dagitappears to be the config for the actual crawl itself (i.e. uses theGLEANERIO_*env vars. -
daemonappears to source the base config for dagster -
code-tasksandcode-projectseem to be grpc endpoints for interacting with dagster (NOTE: I am a bit unclear on their usage)
-
- the s3 provider (minio in this case), gleaner, and nabu for crawling / storing data
- Once crawling is scheduled and completed, I am assuming that the resulting triples will be output in the specified s3 bucket
- Condense code into one central Python build program
- Use https://github.com/docker/docker-py to control the containers instead of shell scripts. (Makes it easier to test and debug to have it all in one language as a data pipeline)
- By using a cli library like https://typer.tiangolo.com/ we can validate argument correctness and fail early, making it easier to debug instead of reading in the arguments and failing after containers are spun up
- Move all build files to the root of the repo to make it more clear for end users
- (i.e. makefiles,
build/directory, etc.)
- (i.e. makefiles,
- Refactor such that individual organizations store their configuration outside the repo.
- The Python build program should be able to read the configuration files at an arbitrary path that the user specifices
- Add types and doc strings for easier maintenance long term
- Use jinja templating instead of writing raw text to the output files
- Currently jobs are ran by generated by outputting literal function names templated inside a Python file
- Unclear if this is scalable to huge datasets. Probably best to use a generator so we do not need to load everything into the ast
- Create a more clear documentation website