GitHub - fran-cornachione/salesforce_ingestion_pipeline_databricks: Full workflow built in Databricks to ingest data incrementally from Salesforce using Databricks Asset Bundles (DAB)

Salesforce Ingestion Pipeline (Lakehouse Architecture)

This projects aims to show the benefits of using a Lakehouse Architecture for pipelines, using Databricks to build a pipeline that ingests data from Salesforce

Flow

The Ingestion pipeline extracts data from Salesforce into the bronze layer, then, an ETL pipeline processes and transforms that data in the silver layer, finally, bussiness logic is applied in the gold layer.

Orchestration

Orchestration is managed by Lakeflow Declarative Pipelines, which defines the ingestion and transformation steps as a unified DAG (Directed Acyclic Graph). The job uses a dependency-based scheduling model, currently running daily, ensuring the ETL starts immediately upon successful Bronze loading.

The main problem this project solves:

Storage and Governance Duplication

Traditional architectures require separate systems for Data Lake (e.g., S3/ADLS) and Data Warehouse (e.g., Redshift/BigQuery), leading to data redundancy and complex synchronization. A unified Lakehouse Architecture using Unity Catalog eliminates this duplication.

ETL and Orchestration Complexity

Legacy pipelines require complex setup and management of separate ETL tools and orchestrators (Airflow / Fivetran / dbt).
We leverage Lakeflow Declarative Pipelines to define the entire flow (Bronze $\rightarrow$ Silver $\rightarrow$ Gold) in a single, governed environment, simplifying scheduling and maintenance.

A unified Lakehouse is the best solution, everything is managed in one place: Connections, storage, orchestration and governance.

Setting up the Ingestion Pipeline | UI and Asset Bundles

Setup with Databricks Asset Bundles (DAB)

Install the Databricks-CLI.
Install the Databricks VSCode extension (not mandatory, but very useful).
Copy this repo: git clone https://github.com/fran-cornachione/salesforce_ingestion_pipeline_databricks
Create an .env file and add the following varibles:

  EMAIL_RECIPIENTS = [Users on the workspace that will receive an alert whenever the pipeline fails]
  CATALOG = [The catalog name for the project]
  SCHEMA = [The name of the ingestion / bronze schema]
  SILVER_SCHEMA = [The name of the silver schema]
  WAREHOUSE_ID = [Your warehouse ID]

These variables are referenced in the project as ${env.VARIABLE_NAME}

Databricks Web UI Setup

1. Connection

First, a Salesforce connection must be created:

Catalog → External Data → Connections → Create Connection

Connection Name: The name of the connection (e.g: Salesforce, salesforce_conn)
Connection Type: Salesforce

Then, authenticate with a Salesforce account.

2. Ingestion Setup

Now, we have to create the Ingestion Pipeline:

Jobs & Pipelines → Ingestion Pipeline → Databricks Connectors → Salesforce

Pipeline Name: The name for the ingestion pipeline
Event Log Location: The event log contains audit logs, data quality checks, pipeline progress, and errors

4. Source

Here, we specify what data will be ingested from Salesforce:

There are many tables that we can choose from Salesforce, in my case I used Account, Opportunities, Contacts and Leads.

5. Destination

Specify where to store the ingested data in Databricks.

In this case, I created a catalog and three schemas following the Medallion Architecture:

CREATE CATALOG salesforce;
CREATE SCHEMA salesforce.bronze;
CREATE SCHEMA salesforce.silver;
CREATE SCHEMA salesforce.gold;

6. Schedules and Notifications

In this step, we configure the pipeline schedules and notifications.

Schedules

In my case, I scheduled the pipeline to run every day at 00:00 (daily). But more advanced logic can be applied:

CRON Syntax: We can use CRON if complex logic is needed, for example, running the pipeline daily, but only from Monday to Friday:

0 0 * * 1-5

Timezone: A timezone can be selected, for example, (UTC-06:00) Central TIme (US and Canada)

Notifications

We can add multiple users to receive an email on pipeline failure or success.

The ingestion pipeline can be also configured with YAML, the pipelines and the job are in the salesforce/resources folder as YAML files.

When the pipeline is executed, we will see this:

4 Streaming tables were created (one for each table in the ingestion stage).

Streaming tables are updated incrementally, they only process the new inserted records (UPSERT). If 100 leads were added yesterday, only 100 records will be processed, not the entire table.

The pipeline interface (Lakeflow) provides a real-time health check for every run:

Upserted (Green): Only new or updated records from Salesforce are processed, ensuring efficient incremental loading.
Deleted (Orange): Reflects records removed from the source to keep the Lakehouse synchronized.
Dropped (Grey): A record is dropped when it doesn't pass a quality check (Expectation), For example, if we require an Email to create a Lead and it's missing, the pipeline drops that row automatically. This way, we guarantee that only high-quality data reaches the business.

ETL Pipeline

To transform the raw data, I built an ETL pipeline. The transformations are performed in the Silver layer by using CTAS (CREATE TABLE AS SELECT).

The DAG (Directed Acyclic Graph) shows the lineage of each View and its dependecies. This is another strenght of the medallion architecture. Gold depends on Silver, and Silver depends on Bronze.

Silver Layer: Data Cleansing & Standardization

In this stage, raw data from Bronze is refined and cleaned.

Selecting only necessaring columns: Some specific columns were not selected as they are not necessary for this stage.

Normalizing names: (e.g, john doe → John Doe)
Rounding numeric values: (e.g, 34.999999 → 34.99)

Gold Layer: Bussiness Aggregations

In the Gold layer, I transformed the cleaned Silver data into a Star Schema designed for high-performance analytics / dashboards. The primary goal was to convert raw CRM records into meaningful business metrics and structured entities.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
Media		Media
salesforce		salesforce
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salesforce Ingestion Pipeline (Lakehouse Architecture)

Flow

Orchestration

The main problem this project solves:

Storage and Governance Duplication

ETL and Orchestration Complexity

Setting up the Ingestion Pipeline | UI and Asset Bundles

Setup with Databricks Asset Bundles (DAB)

Databricks Web UI Setup

1. Connection

2. Ingestion Setup

4. Source

5. Destination

6. Schedules and Notifications

ETL Pipeline

Silver Layer: Data Cleansing & Standardization

Gold Layer: Bussiness Aggregations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

fran-cornachione/salesforce_ingestion_pipeline_databricks

Folders and files

Latest commit

History

Repository files navigation

Salesforce Ingestion Pipeline (Lakehouse Architecture)

Flow

Orchestration

The main problem this project solves:

Storage and Governance Duplication

ETL and Orchestration Complexity

Setting up the Ingestion Pipeline | UI and Asset Bundles

Setup with Databricks Asset Bundles (DAB)

Databricks Web UI Setup

1. Connection

2. Ingestion Setup

4. Source

5. Destination

6. Schedules and Notifications

ETL Pipeline

Silver Layer: Data Cleansing & Standardization

Gold Layer: Bussiness Aggregations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages