This projects aims to show the benefits of using a Lakehouse Architecture for pipelines, using Databricks to build a pipeline that ingests data from Salesforce
The Ingestion pipeline extracts data from Salesforce into the bronze layer, then, an ETL pipeline processes and transforms that data in the silver layer, finally, bussiness logic is applied in the gold layer.
Orchestration is managed by Lakeflow Declarative Pipelines, which defines the ingestion and transformation steps as a unified DAG (Directed Acyclic Graph). The job uses a dependency-based scheduling model, currently running daily, ensuring the ETL starts immediately upon successful Bronze loading.
- Traditional architectures require separate systems for Data Lake (e.g., S3/ADLS) and Data Warehouse (e.g., Redshift/BigQuery), leading to data redundancy and complex synchronization. A unified Lakehouse Architecture using Unity Catalog eliminates this duplication.
- Legacy pipelines require complex setup and management of separate ETL tools and orchestrators (Airflow / Fivetran / dbt).
- We leverage Lakeflow Declarative Pipelines to define the entire flow (Bronze
$\rightarrow$ Silver$\rightarrow$ Gold) in a single, governed environment, simplifying scheduling and maintenance.
A unified Lakehouse is the best solution, everything is managed in one place: Connections, storage, orchestration and governance.
- Install the Databricks-CLI.
- Install the Databricks VSCode extension (not mandatory, but very useful).
- Copy this repo:
git clone https://github.com/fran-cornachione/salesforce_ingestion_pipeline_databricks - Create an
.envfile and add the following varibles:
EMAIL_RECIPIENTS = [Users on the workspace that will receive an alert whenever the pipeline fails]
CATALOG = [The catalog name for the project]
SCHEMA = [The name of the ingestion / bronze schema]
SILVER_SCHEMA = [The name of the silver schema]
WAREHOUSE_ID = [Your warehouse ID]
These variables are referenced in the project as ${env.VARIABLE_NAME}
First, a Salesforce connection must be created:
Catalog → External Data → Connections → Create Connection
- Connection Name: The name of the connection (e.g:
Salesforce,salesforce_conn) - Connection Type: Salesforce
Then, authenticate with a Salesforce account.
Now, we have to create the Ingestion Pipeline:
Jobs & Pipelines → Ingestion Pipeline → Databricks Connectors → Salesforce
- Pipeline Name: The name for the ingestion pipeline
- Event Log Location: The event log contains audit logs, data quality checks, pipeline progress, and errors
Here, we specify what data will be ingested from Salesforce:
There are many tables that we can choose from Salesforce, in my case I used Account, Opportunities, Contacts and Leads.
Specify where to store the ingested data in Databricks.
In this case, I created a catalog and three schemas following the Medallion Architecture:
CREATE CATALOG salesforce;
CREATE SCHEMA salesforce.bronze;
CREATE SCHEMA salesforce.silver;
CREATE SCHEMA salesforce.gold;In this step, we configure the pipeline schedules and notifications.
Schedules
In my case, I scheduled the pipeline to run every day at 00:00 (daily). But more advanced logic can be applied:
- CRON Syntax: We can use CRON if complex logic is needed, for example, running the pipeline daily, but only from Monday to Friday:
0 0 * * 1-5
- Timezone: A timezone can be selected, for example,
(UTC-06:00) Central TIme (US and Canada)
Notifications
We can add multiple users to receive an email on pipeline failure or success.
The ingestion pipeline can be also configured with YAML, the pipelines and the job are in the salesforce/resources folder as YAML files.
When the pipeline is executed, we will see this:
4 Streaming tables were created (one for each table in the ingestion stage).
Streaming tables are updated incrementally, they only process the new inserted records (UPSERT). If 100 leads were added yesterday, only 100 records will be processed, not the entire table.
The pipeline interface (Lakeflow) provides a real-time health check for every run:
- Upserted (Green): Only new or updated records from Salesforce are processed, ensuring efficient incremental loading.
- Deleted (Orange): Reflects records removed from the source to keep the Lakehouse synchronized.
- Dropped (Grey): A record is dropped when it doesn't pass a quality check (Expectation), For example, if we require an
Emailto create a Lead and it's missing, the pipeline drops that row automatically. This way, we guarantee that only high-quality data reaches the business.
To transform the raw data, I built an ETL pipeline. The transformations are performed in the Silver layer by using CTAS (CREATE TABLE AS SELECT).
The DAG (Directed Acyclic Graph) shows the lineage of each View and its dependecies. This is another strenght of the medallion architecture. Gold depends on Silver, and Silver depends on Bronze.
In this stage, raw data from Bronze is refined and cleaned.
- Selecting only necessaring columns: Some specific columns were not selected as they are not necessary for this stage.
- Normalizing names: (e.g, john doe → John Doe)
- Rounding numeric values: (e.g, 34.999999 → 34.99)
In the Gold layer, I transformed the cleaned Silver data into a Star Schema designed for high-performance analytics / dashboards. The primary goal was to convert raw CRM records into meaningful business metrics and structured entities.






