Automatic DagBundle Configuration Loading #59799
Unanswered
raiffeisenbankinternational-bot
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This proposal requests a new feature in Apache Airflow 3.x to automatically load DagBundle configurations from local JSON files using the
file://protocol, with automatic reloading support.Current Limitation: Airflow 3.x requires DagBundle configurations to be manually embedded as JSON strings in
airflow.cfgor set as environment variables, making configurations difficult to manage and update.Proposed Solution: Extend the existing
dag_bundle_config_listparameter to supportfile://URLs pointing to local JSON files, and add adag_bundle_config_list_watchparameter to enable automatic reloading when the file changes.Problem Statement
Current Airflow 3.x DagBundle Configuration
Airflow 3.x introduced DagBundles as a powerful way to load DAGs from multiple Git repositories, enabling decentralized DAG management. However, the configuration mechanism has limitations:
Current Approach:
Limitations:
airflow.cfgor environment variablesCurrent Workaround
Organizations currently implement manual workarounds:
dag_bundle_config.jsonexternally (CI/CD, automation tools)airflow.cfgor set as environment variable (manual step)This manual conversion and restart process breaks automation and introduces operational complexity.
Use Case: Multi-Team Airflow Environment with GitOps Onboarding
Environment Overview
Our organization runs Apache Airflow 3.x serving multiple data engineering teams in a shared platform model. We support over 30 teams with independent DAG repositories, provide automated onboarding via GitHub workflows, maintain decentralized DAG ownership where each team controls their own repository, and follow GitOps-driven configuration management principles.
Automated Onboarding Workflow
When a new team requests access to our Airflow platform, our automated workflow generates their configuration and writes it to a local JSON file. However, to apply these changes, we must either restart Airflow services or manually convert the JSON to a string and update the configuration file.
flowchart TD A[Team Creates Onboarding Issue] --> B[GitHub Workflow Triggered] B --> C[Create Team Repository] C --> D[Generate teams/team-name.yaml] D --> E[Create Pull Request] E --> F[Platform Team Reviews PR] F --> G[PR Merged to Main] G --> H[GitHub Actions: Generate Config] H --> I[Write dag_bundle_config.json] I --> J[Manual: Restart Airflow] J --> K[Team Onboarded] style J fill:#ffccccRed box indicates manual intervention that breaks automation.
Current Pain Point
Each time the configuration changes, operators must either:
airflow.cfgThis introduces operational overhead, service interruptions, inability to add teams without downtime, and configuration management complexity.
What We Want
Ideal Configuration:
Expected Behavior:
dag_bundle_config_list_watch = TrueProposed Solution
Enhanced
dag_bundle_config_listParameterExtend the existing
dag_bundle_config_listconfiguration parameter to support both JSON strings and file:// URLs:New Configuration Parameter:
dag_bundle_config_list_watchAdd a new parameter to enable automatic file watching and reloading:
When
dag_bundle_config_list_watch = True, Airflow monitors the file for changes and automatically reloads the DagBundle configuration without requiring a service restart.Future Possibilities
While this proposal focuses on local file support, the same
dag_bundle_config_listparameter could be extended in the future to support:dag_bundle_config_list = https://config-server.example.com/config.jsondag_bundle_config_list = s3://bucket/path/config.jsonThe
file://protocol provides immediate value with minimal complexity, while establishing a pattern for future enhancements.Implementation Requirements
1. File Protocol Support
file:///etc/airflow/config/dag_bundle_config.jsonfile://config/dag_bundle_config.json$AIRFLOW_HOME[{"name": "...", ...}]2. Loading Behavior
On Scheduler Start:
dag_bundle_config_liststarts withfile://File Watching (when
dag_bundle_config_list_watch = True):inotifyon Linux,FSEventson macOS)3. Backward Compatibility
file://, treat as file path; otherwise treat as JSON stringdag_bundle_config_list_watchdefaults toFalse4. Configuration Validation
5. Error Handling
6. Observability
Logging:
Benefits
Simplified Configuration Management
Clean separation of concerns by keeping configuration in separate JSON files rather than embedded in INI files. This provides easier readability and maintenance of large configurations, better version control with clear diffs showing exactly what changed, and the ability to use standard JSON tools for validation and formatting.
Zero-Downtime Updates
With file watching enabled, teams can be added or removed without restarting Airflow services. Configuration changes take effect automatically within seconds, eliminating service interruptions and enabling continuous operations.
GitOps-Friendly
This approach supports configuration as code stored in Git repositories with all changes tracked through pull requests. It enables automated deployment pipelines and provides self-loading configuration without manual intervention.
Developer Experience
Developers benefit from standard JSON format that's easier to edit than embedded strings in INI files. IDE support with syntax highlighting and validation is readily available, and configurations can be tested and validated before deployment.
Scalability
This solution supports any number of teams without configuration file bloat, makes it easy to add or remove teams dynamically, and is suitable for both small and large deployments.
Cloud-Native Ready
The approach is container-friendly with file mounts, ready for Kubernetes ConfigMaps or volume mounts, and compatible with immutable infrastructure patterns.
Simple Implementation
The solution focuses on file:// protocol first, minimizing complexity with no network dependencies or authentication requirements. It uses standard filesystem operations and can be extended in the future to support remote protocols.
Example Configurations
Example 1: Basic File Reference
Configuration File (
/etc/airflow/config/dag_bundle_config.json):[ { "name": "analytics", "classpath": "airflow.providers.git.bundles.git.GitDagBundle", "kwargs": { "repo_url": "https://github.com/org/airflow-teams-analytics-prod", "tracking_ref": "main", "refresh_interval": 60, "subdir": "dags" } }, { "name": "finance", "classpath": "airflow.providers.git.bundles.git.GitDagBundle", "kwargs": { "repo_url": "https://github.com/org/airflow-teams-finance-prod", "tracking_ref": "main", "refresh_interval": 60, "subdir": "dags" } } ]Example 2: With File Watching Enabled
Now when you update
/etc/airflow/config/dag_bundle_config.json, changes are automatically detected and applied without restarting Airflow.Example 3: Relative Path
Example 4: Traditional JSON String (Unchanged)
Example 5: Environment Variable
Example 6: Kubernetes ConfigMap
airflow.cfg:
Alternative Workarounds (Current State)
While waiting for this feature, organizations implement various workarounds:
Workaround 1: Startup Script with JSON Conversion
Issues: Requires wrapper scripts, not portable, configuration not reloadable without restart
Workaround 2: Config File Template with JSON Embedding
Deployment script:
Issues: Templating complexity, escaping problems, manual deployment steps
Workaround 3: Manual airflow.cfg Editing
Manually copy-paste JSON content into
airflow.cfg:Issues: Error-prone, hard to maintain, difficult to track changes, requires restart
All workarounds share common problems: They are fragile and error-prone, require custom automation or manual intervention, lack official support, are difficult to maintain, and always require service restarts for changes.
Implementation Proposal
Phase 1: File Protocol Support (Initial Implementation)
The initial implementation focuses on the
file://protocol for simplicity and immediate value:file://prefix indag_bundle_config_listparameterThis requires extending the configuration parser in the
dag_processormodule to detect and handlefile://URLs differently from JSON strings.Phase 2: File Watching (Automatic Reload)
Add support for the
dag_bundle_config_list_watchparameter:watchdoglibrary for Python)This enables zero-downtime updates and eliminates the need for service restarts when adding or removing teams.
Configuration Parameters
Add to the
[dag_processor]section:Future Enhancements (Out of Scope for Initial Implementation)
Once the
file://protocol and watching mechanism are proven, the same pattern can be extended to support remote protocols:These would use the same
dag_bundle_config_listparameter with different URL schemes and could reuse the file watching pattern with periodic polling.Technical Considerations
1. File Path Resolution
2. File Watching Implementation
Use the
watchdogPython library or platform-specific mechanisms:Debounce file change events to avoid multiple reloads during atomic writes.
3. Thread Safety
4. Performance
5. Testing
Unit Tests:
Integration Tests:
End-to-End Tests:
Security Implications
1. File Permissions
Best Practices:
0644or0640(read-only for Airflow user)rootorairflow)Example:
2. Configuration Validation
3. Audit Logging
4. Access Control
Migration Path for Existing Users
Step 1: Create JSON Configuration File
Step 2: Update Airflow Configuration
Step 3: Test and Validate
Step 4: Enable File Watching (Optional)
Now you can add teams by simply editing the JSON file - no restart required!
Real-World Impact
Our Organization
Before this feature:
Team onboarding takes 2-4 hours due to manual configuration conversion and restart. Platform team involvement is high, requiring manual intervention. The error rate is approximately 10% due to JSON escaping and formatting issues. Service interruptions occur with every configuration change.
With this feature:
Team onboarding would be reduced to 10-15 minutes with simple file updates. Platform team involvement would be minimal, limited to PR review only. Error rates would drop to less than 1% with standard JSON editing. Zero downtime for configuration changes with file watching enabled.
Scale:
Our deployment currently serves 30 or more onboarded teams managing 150 or more DAG repositories containing over 500 active DAGs across 5 Airflow environments including development, QA, staging, production, and disaster recovery.
Community Benefits
This feature would benefit:
Related Airflow Features
This proposal complements existing Airflow 3.x features:
file://URLs in some parametersThe missing piece this proposal addresses is:
file://protocolThis is a natural extension of Airflow's existing configuration system and DagBundle architecture.
Questions for Discussion
dag_bundle_config_list_watch?file://prefix?$AIRFLOW_HOMEor current working directory?file://is stable, which remote protocol should be prioritized next (HTTP, S3, Git)?Conclusion
Apache Airflow 3.x's DagBundle feature is powerful but limited by requiring JSON strings embedded in configuration files. Adding support for file-based configuration loading via the
file://protocol would:This feature represents a simple, practical improvement that addresses real operational pain points while establishing patterns for future enhancements (HTTP, S3, etc.).
We believe this feature would be valuable to the entire Airflow community and are willing to contribute to its implementation.
Date: 2025-12-25
Target: Apache Airflow 3.x
Beta Was this translation helpful? Give feedback.
All reactions