The proposed architecture presents a modern solution for data pipeline orchestration using leading market technologies:
- Orchestration: Apache Airflow for workflow management
- Processing: Azure Databricks for layered transformations (Bronze, Silver, Gold)
- Storage: Azure Storage (Data Lake) + Delta Lake
- Governance: Unity Catalog for Silver and Gold layers
This project's architecture uses Apache Airflow, installed via Astronomer and running in Docker, to orchestrate notebook execution in Azure Databricks. These notebooks are responsible for processing data in three layers:
- Ingestion of raw data from API: https://api.openbrewerydb.org
- Writing to .json files in Azure Data Lake
- Structuring raw data with geographic partitioning
- Writing to Delta tables in Unity Catalog
- Data aggregation and analytical-ready datasets
- Delta tables managed via Unity Catalog
The primary storage is implemented through Azure Storage integrated with Azure Databricks.
- OS: Windows 10/11 Pro/Enterprise (64-bit)
- WSL 2: Must be enabled
- Docker Desktop: Latest version installed
- Git: For version control
- Hardware: Minimum 4GB RAM (8GB recommended)
# Run as Administrator
wsl --install
wsl --set-default-version 2
β οΈ Restart your computer after running these commands
- Download installer from docker.com
- Run installation with default settings
- After installation:
- Open Docker Desktop
- Go to Settings > General
- Check "Use WSL 2 based engine"
- Go to Settings > Resources > WSL Integration
- Enable your WSL distribution
# Install Astro CLI without dependencies
winget install -e --id Astronomer.Astro --skip-dependenciesVerify installation:
astro versionmkdir airflow-project
cd airflow-project
astro dev initEdit requirements.txt file and add:
apache-airflow-providers-databricks
astro dev startThis will:
- Build Docker images
- Start all services
- Make web UI available on port 8080
http://localhost:8080
Default credentials:
- Username:
admin - Password:
admin
| Command | Action |
|---|---|
astro dev start |
Starts all services |
astro dev stop |
Stops containers |
astro dev restart |
Fully restarts environment |
astro dev logs |
Shows service logs |
astro dev kill |
Removes all containers |
Issue: Docker won't start
β
Solution: Verify virtualization is enabled in BIOS
Issue: WSL permission errors
β
Solution:
wsl --update
wsl --shutdownIssue: DAGs not appearing
β
Solution:
- Verify files are in
/dagsfolder Add the DAG from this repository to the dags folder. Change the parameters to the notebook path in Databricks - Run:
astro dev restart- Add the DAG from this repository to the dags folder. Change the parameters to the notebook path in Databricks
- Drop them in your Airflow project's
dags/folder
-
Access Airflow UI (
http://localhost:8080) -
Navigate to Admin > Connections
-
Click + to add new connection
-
Fill in the fields:
Field Value Connection ID databricks_defaultConnection Type DatabricksHost <your-workspace-url>.cloud.databricks.comPassword <your-token>(token generate in Databricks)
1.Create access token in Databricks:
Go to User Settings > Developer > Click Manage Tokens> Generate New Token
β¨ Done! Your Airflow environment is ready to use! β¨
βοΈ Full Setup: Secure Databricks Access to Azure Storage (Using Access Connector + Unity Catalog)
- Go to the Azure Portal
- Search for "Storage accounts"
- Click β Create
- Subscription: Select your subscription
- Resource Group: Create or use existing
- Storage account name: e.g.,
datalakecompany001 - Region: Same region as your Databricks workspace
- Performance: Standard
- Redundancy: Choose as needed (e.g.,
Locally-redundant storage (LRS))
Click on the Advanced tab and configure:
- β Enable hierarchical namespace β Required for Delta Lake
- β Disable public access
- β Disable blob anonymous access (default should already be off)
- β Enable storage account-level access tiers (optional)
Then click Review + Create β Create
Once the storage account is created:
- Go to the Storage Account > Containers
- Create the following container:
lakehouse
bronze: For raw JSON files (external)silver&gold: Managed by Unity Catalog
- Search "Databricks Access Connector" in Azure Portal
- Click β Create
- Fill in:
- Name:
databricks-access-connector - Region: Same as your Databricks workspace
- Workspace: Select your workspace
Click Review + Create β Create
- Open the Access Connector you just created
- Go to Managed Identity section
- Copy the Object ID β will be used to grant permissions
- Go to your Storage Account > Access Control (IAM)
- Click β Add > Add Role Assignment
- Role:
Storage Blob Data Contributor - Assign access to:
Managed identity - Select:
databricks-access-connector
Click Save
- Go to Databricks Workspace > Catalog > External Data > Credentials
- Click β Create Credential
- Fill in:
- Name:
storage_cred - Authentication type:
Managed Identity - Managed Identity: Select your Databricks Access Connector
Click Create
- Go to Catalog > External Data > External Locations
- Click β Create External Location
- Fill in:
- Name:
lakehouse - URL:
abfss://lakehouse@<storage-account>.dfs.core.windows.net/ - Credential:
storage_cred
- In the Databricks Workspace, click your avatar > User Settings
- Go to the Git Integration tab
- Choose Git provider:
GitHub - Paste your GitHub personal access token (PAT)
- Click Save
π Make sure the PAT has
repo,workflow, andread:userscopes enabled
- In Workspace > Repos, click Add Repo
- Enter the URL of a GitHub repository with sample notebooks (e.g.,
https://github.com/someone/notebook) - After cloning, fork or clone this repo to your own GitHub
You can now work directly from Git in your Databricks notebooks!
β Enable:
- Hierarchical namespace in storage
- Access via Access Connector
- Unity Catalog as the metastore
- RBAC and ACLs on Storage
β Disable:
- Public/anonymous access to Storage
- Legacy Hive metastore in Databricks
β You're all set! Your Databricks workspace now has secure and governed access to Azure Data Lake using Unity Catalog and Access Connector π―
To ensure the reliability and observability of the data pipeline, it is suggested to implement a monitoring and alerting process using Datadog, centralizing the collection of logs and metrics from Apache Airflow and Databricks.
With custom dashboards, it would be possible to monitor:
- Execution status of DAGs and jobs
- Data quality (e.g., nulls, duplicates, anomalies)
- Task performance and execution times
Alerts can be configured based on:
- Critical or recurring failures
- Slowdowns beyond historical thresholds
- Inconsistencies in data quality
These alerts could trigger webhooks to notify teams via Slack or other communication tools.
As a next step, it is suggested to integrate with ServiceNow to automatically create incidents from critical alerts, further improving operational response capability.
This approach provides greater control, traceability, and agility in managing data pipelines.
π Expected Benefits:
- Proactive failure detection
- Fast and structured incident response
- Unified monitoring across platforms
- Scalable and automated support for data operations
