Implement AiiDA provenance tracking for Airflow via XCom backend and listeners#19
Draft
agoscinski wants to merge 3 commits intophase1from
Draft
Implement AiiDA provenance tracking for Airflow via XCom backend and listeners#19agoscinski wants to merge 3 commits intophase1from
agoscinski wants to merge 3 commits intophase1from
Conversation
…listeners Add comprehensive AiiDA provenance graph creation for Airflow DAG executions: XCom Backend (src/airflow_provider_aiida/xcom/backend.py): - Custom XCom backend that creates AiiDA nodes during serialization/deserialization - WorkChainNode represents entire DAG run (dag_id + run_id) - CalcJobNode represents individual tasks (task_id + run_id + map_index) - Data nodes store XCom values as AiiDA-typed data - Establishes provenance links: * CALL_CALC: WorkChain → CalcJob (workflow calls calculation) * CREATE: CalcJob → Data (task produces output) * INPUT_CALC: Data → CalcJob (data flows to consuming task) - Smart link labeling: * For PythonOperator: extracts parameter names via inspect.signature() * For other operators: deterministic hash based on producer task info - Handles duplicate link prevention and stored node constraints - Monkey-patches link validation for stored nodes when necessary Provenance Listener (src/airflow_provider_aiida/plugins/provenance_listener.py): - Airflow listener plugin that updates AiiDA node states in real-time - Hooks into DAG run lifecycle (success, failure, running) - Hooks into task instance lifecycle (success, failure, running) - Maps Airflow states to AiiDA ProcessStates: * QUEUED/SCHEDULED → CREATED * RUNNING → RUNNING * SUCCESS → FINISHED * FAILED → EXCEPTED * SKIPPED → KILLED - Creates nodes proactively if they don't exist (handles tasks starting before XCom) - Registered as ProvenanceListenerPlugin for automatic discovery Common utilities to handle aiida nodes(src/airflow_provider_aiida/common/utils.py): - _get_or_create_workchain_node: Query by unique_id or create new WorkChainNode - _get_or_create_calcjob_node: Query by unique_id or create new CalcJobNode - _sanitize_link_label: Ensure AiiDA-compatible link labels (alphanumeric + underscore) - All new nodes initialized with ProcessState.CREATED Caveats: - When deserializing no information about the input key is given, so an educated guess has to be made which for the moment fails when maps are used - on_dag_run_running is not called in test run environment, therefore the workchain node is created in on_task_run_running function - Because we have no guarantee from airflow for the order of callbacks (executed by the task instance) and xcom backend (executed by the scheduler) we have to make logic redundant in the xcom backend and listeners Result: Complete AiiDA provenance graph mirroring Airflow DAG structure with real-time state synchronization and proper data lineage tracking.
9023c78 to
1fbadd6
Compare
Closed
TODO for these there should be tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add AiiDA provenance graph creation for Airflow DAG executions:
XCom Backend (src/airflow_provider_aiida/xcom/backend.py):
Provenance Listener (src/airflow_provider_aiida/plugins/provenance_listener.py):
Common utilities to handle aiida nodes(src/airflow_provider_aiida/common/utils.py):
Caveats:
educated guess has to be made which for the moment fails when maps are used
workchain node is created in on_task_run_running function
TODO: