-
-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Description
Under certain conditions, gitsync calls can result in Airflow DAG-parsing errors that are not easily cleared as they persist in the metadata database. This should at least be documented (for cases where configuration suffices) but ideally fixed internally in the framework. This ticket will require some research work before implementing a solution.
Example
A combination of
- A DAG that uses submodules
- Gitsync calls not run regularly enough in relation to the DAG-processor calls
can result in submodule cache files (*.pyc) not being in a consistent state when Airflow - detecting changes to DAGs - starts to (re-)parse them. The parsing interval is defined by AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL and defaults to 30s. The period between gitsync calls is defined by the gitsync resource field wait, which defaults to 20s. In this case the cache may be inconsistent when DAG processing starts. Documenting this should be sufficient for many situations/users.
Note
This problem does not seem to happen when the dag-processor process is part of the scheduler pod rather than being a standalone role.
Improving symlinks
The /stackable/app/git-x folder looks like this:
drwxr-sr-x 9 stackable stackable 4096 Jan 8 17:01 .git
drwxr-sr-x 3 stackable stackable 4096 Jan 8 16:58 .worktrees
lrwxrwxrwx 1 stackable stackable 51 Jan 8 16:58 current -> .worktrees/933f524d2aac463b2e5904fe566af1a74b3ff378
with e.g. AIRFLOW__CORE__DAGS_FOLDER=/stackable/app/git-x/current/mount-dags-gitsync/dags_airflow3
current is flipped to a new worktree once the gitsync is complete, but if Airflow is watching current (or something under it) it is not insulated from any filesystem churn that is happening i.e. although the symlink updates are atomic, the file operations through it aren't.
An alternative could be to use the exechook parameter to flip a second symlink to the target DAG folder:
ln -sfn /stackable/app/git-x/current /stackable/app/airflow-dags
export AIRFLOW__CORE__DAGS_FOLDER=/stackable/app/airflow-dags/mount-dags-gitsync/dags_airflow3
Tasks
- extend documentation to highlight possible problematic config combinations and advise around them
- research if the second symlink approach will work
- implement in the gitsync component in operator-rs
- roll out to airflow- and nifi-operators