In this chapter, you will get to know the anatomy of a standard BigFlow project, deployment artifacts, and CLI commands.
The BigFlow build command packages your processing code into a Docker image. Thanks to this approach, you can create any environment you want for your workflows. What is more, you don't need to worry about dependencies clashes on the Cloud Composer (Airflow).
There are two types of artifacts which BigFlow produces:
- Deployment artifact — that's a final build product which you deploy.
- Intermediate artifact — an element of a deployment artifact, or a by-product, typically useful for debugging.
The concept of BigFlow deployment artifacts looks like this:
BigFlow turns your project into a standard Python package (which can be uploaded to pypi or
installed locally using pip).
Next, the package is installed on a Docker image with a fixed Python version.
Finally, BigFlow generates Airflow DAG files which use this image.
From each of your workflows, BigFlow generates a DAG file.
Every produced DAG consists only of KubernetesPodOperator objects, which
executes operations on a Docker image.
To build a project you need to use the bigflow build command. The documentation, you are reading, is also a valid BigFlow
project. Go to the docs directory and run the bigflow build command to see how the build process works.
The bigflow build command should produce:
- The
distdirectory with a Python package (intermediate artifact) - The
builddirectory with JUnit test results (intermediate artifact) - The
.imagedirectory with a deployment configuration and Docker image as.tar(deployment artifact) - The
.dagsdirectory with Airflow DAGs, generated from workflows (deployment artifact)
The bigflow build command uses three subcommands to generate all the
artifacts: bigflow build-package, bigflow build-image, bigflow build-dags.
There is also an optional bigflow build-requirements command that allows you
to resolve and freeze the project dependencies.
Now, let us go through each building element in detail, starting from the Python package.
An exemplary BigFlow project, with the standard structure, looks like this:
project_dir/
project_package/
__init__.py
workflow.py
test/
__init__.py
resources/
requirements.in
requirements.txt
Dockerfile
deployment_config.py
setup.py
pyproject.toml
Let us start with the project_package. It's the Python package which contains the processing logic of your workflows.
It also contains Workflow objects, which arranges parts of your processing logic into
a DAG (read the Workflow & Job chapter to learn more about workflows and jobs).
The project_package is used to create a standard Python package, which can be installed using pip.
setup.py is the build script for the project. It turns the project_package into a .whl package.
It's based on the standard Python tool — setuptool.
You can put your tests into the test package. The bigflow build-package command runs tests automatically, before trying to build the package.
The resources directory contains non-Python files. That is the only directory that will be packaged
along the project_package (so you can access these files after installation, from the project_package). Any other files
inside the project_directory won't be available in the project_package. resources can't be nested. So you can't
have a directory inside the resources directory.
The get_resource_absolute_path function allows you to access files from the resources directory.
with open(get_resource_absolute_path('example_resource.txt', Path(__file__))) as f:
print(f.read())Run the above example, using the following command:
bigflow run --workflow resources_workflowResult:
Welcome inside the example resource!
Because every BigFlow project is a standard Python package, we suggest going through the official Python packaging tutorial.
The bigflow build-package command takes three steps to build a Python package from your project:
- Cleans leftovers from a previous build.
- Runs tests from the
testpackage and generates a JUnit xml report, using theunittest-xml-reportingpackage. You can find the generated report inside theproject_dir/build/junit-reportsdirectory. - Runs the
bdist_wheelsetup tools command. It generates a.whlpackage which you can upload topypior install locally -pip install your_generated_package.whl.
Go to the docs project and run the bigflow build-package command to observe the result. Now you can install the
generated package using pip install dist/examples-0.1.0-py3-none-any.whl. After you install the .whl file, you can
run jobs and workflows from the docs/examples package. They are now installed in your virtual environment, so you can run them
anywhere in your directory tree, for example:
cd /tmp
bigflow run --workflow hello_world_workflow --project-package examplesYou probably won't use this command very often, but it's useful for debugging. Sometimes you want to see if your project works as you expect in the form of a package (and not just as a package in your project directory).
Deployment artifacts, like Docker images, need to be versioned. BigFlow provides automatic versioning based on the git tags system. There are two commands you need to know.
The bigflow project-version command prints the current version of your project:
bigflow project-version
>>> 0.34.0
BigFlow follows the standard semver schema:
<major>.<minor>.<patch>
If BigFlow finds a tag on a current commit, it uses it as a current project version. If there are commits after the last tag or a working directory is dirty, it creates a snapshot version with the following schema:
<major>.<minor>.<patch><snapshot_id>
For example:
bigflow project-version
>>> 0.34.0SHAdee9af83SNAPSHOT8650450a
If you are ready to release a new version, you don't have to set a new tag manually. You can use the bigflow release command:
bigflow project-version
>>> 0.34.0SHAdee9af83SNAPSHOT8650450a
bigflow release
bigflow project-version
>>> 0.35.0
If needed, you can specify an identity file for ssh, used to push a tag to a remote repository.
bigflow release --ssh-identity-file /path/to/id_rsa
bigflow release -i keys.pem
To run a job in a desired environment, BigFlow makes use of Docker. Each job is executed from a Docker container,
which runs a Docker image built from your project. The default Dockerfile
generated from the scaffolding tool looks like this:
FROM python:3.7
COPY ./dist /dist
RUN apt-get -y update && apt-get install -y libzbar-dev libc-dev musl-dev
RUN pip install pip==20.2.4
RUN for i in /dist/*.whl; do pip install $i --use-feature=2020-resolver; doneThe basic image installs the generated Python package. With the installed package, you can run a workflow or a job from a Docker environment.
Run the bigflow build-image command inside the docs project. Next, you can run the example workflow using Docker:
docker load -i .image/image-{Generated image version}.tar
docker run {Your loaded image ID} bigflow run --job hello_world_workflow.hello_world --project-package examplesDAGs generated by BigFlow use KubernetesPodOperator to call this Docker command.
BigFlow generates Airflow DAGs from workflows found in your project.
Every generated DAG utilizes only KubernetesPodOperator.
To see how it works, go to the docs project and run the bigflow build-dags command.
One of the generated DAGs, for the resources.py workflow, looks like this:
import datetime
from airflow import DAG
from airflow.contrib.operators import kubernetes_pod_operator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime.datetime(2020, 8, 30),
'email_on_failure': False,
'email_on_retry': False,
'execution_timeout': datetime.timedelta(minutes=180),
}
dag = DAG(
'resources_workflow__v0_1_0__2020_08_31_15_00_00',
default_args=default_args,
max_active_runs=1,
schedule_interval='@daily'
)
print_resource_job = kubernetes_pod_operator.KubernetesPodOperator(
task_id='print-resource-job',
name='print-resource-job',
cmds=['bf'],
arguments=['run', '--job', 'resources_workflow.print_resource_job', '--runtime', '{{ execution_date.strftime("%Y-%m-%d %H:%M:%S") }}', '--project-package', 'examples', '--config', '{{var.value.env}}'],
namespace='default',
image='eu.gcr.io/docker_repository_project/my-project:0.1.0',
is_delete_operator_pod=True,
retries=3,
retry_delay=timedelta(seconds=60),
dag=dag)Every job in a workflow maps to
KubernetesPodOperator.
BigFlow sets a reasonable default value for the required operator arguments. You can modify
some of them, by setting properties on a job.
Similarly, you can modify a DAG property, by setting properties on a workflow.
In the resources directory you can find requirements.txt and requirements.in files. You should keep your project
dependencies in the requirements.in file. Then, you can resolve and freeze your project requirements into the requirements.txt,
using the bigflow build-requirements command.
Under the hood, the build-requirements command uses the pip-tools.
The build-requirements command is part of the build command, but it's not mandatory to use requirements.in. That
mechanism is optional, so you can just use requirements.txt alone. BigFlow automatically detects if you have requirements.in
in the resources directory and generates or updates the requirements.txt. If requirements.in is not there, then BigFlow
just skips the build-requirements phase.
Using requirements.in is the recommended way of managing dependencies. It is because it makes your artifacts deterministic (at least
when it comes to resolving dependencies). If you don't use frozen requirements, then each time you install requirements, you
can get a different result. In theory it shouldn't be a problem, because you should get compatible dependencies, but in practice
you might get a incompatible dependency which breaks your processing.
The pyproject.toml file is part of the standard Python packaging toolset.
BigFlow uses that file to describe requirements needed to build the project. That description is especially useful in 2 situations:
- Building the project using systems like Bamboo, Jenkins, Travis, etc.
- Running Apache Beam jobs (go to the chapter about Beam inside BigFlow to understand why).
