Skip to content

Full workflow built in Databricks to ingest data incrementally from Salesforce using Databricks Asset Bundles (DAB)

Notifications You must be signed in to change notification settings

fran-cornachione/salesforce_ingestion_pipeline_databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Salesforce Ingestion Pipeline (Lakehouse Architecture)

This projects aims to show the benefits of using a Lakehouse Architecture for pipelines, using Databricks to build a pipeline that ingests data from Salesforce

Flow

The Ingestion pipeline extracts data from Salesforce into the bronze layer, then, an ETL pipeline processes and transforms that data in the silver layer, finally, bussiness logic is applied in the gold layer.

img

Orchestration

Orchestration is managed by Lakeflow Declarative Pipelines, which defines the ingestion and transformation steps as a unified DAG (Directed Acyclic Graph). The job uses a dependency-based scheduling model, currently running daily, ensuring the ETL starts immediately upon successful Bronze loading.

img

The main problem this project solves:

Storage and Governance Duplication

  • Traditional architectures require separate systems for Data Lake (e.g., S3/ADLS) and Data Warehouse (e.g., Redshift/BigQuery), leading to data redundancy and complex synchronization. A unified Lakehouse Architecture using Unity Catalog eliminates this duplication.

ETL and Orchestration Complexity

  • Legacy pipelines require complex setup and management of separate ETL tools and orchestrators (Airflow / Fivetran / dbt).
  • We leverage Lakeflow Declarative Pipelines to define the entire flow (Bronze $\rightarrow$ Silver $\rightarrow$ Gold) in a single, governed environment, simplifying scheduling and maintenance.

A unified Lakehouse is the best solution, everything is managed in one place: Connections, storage, orchestration and governance.


Setting up the Ingestion Pipeline | UI and Asset Bundles

Setup with Databricks Asset Bundles (DAB)

  1. Install the Databricks-CLI.
  2. Install the Databricks VSCode extension (not mandatory, but very useful).
  3. Copy this repo: git clone https://github.com/fran-cornachione/salesforce_ingestion_pipeline_databricks
  4. Create an .env file and add the following varibles:
  EMAIL_RECIPIENTS = [Users on the workspace that will receive an alert whenever the pipeline fails]
  CATALOG = [The catalog name for the project]
  SCHEMA = [The name of the ingestion / bronze schema]
  SILVER_SCHEMA = [The name of the silver schema]
  WAREHOUSE_ID = [Your warehouse ID]

These variables are referenced in the project as ${env.VARIABLE_NAME}

Databricks Web UI Setup

1. Connection

First, a Salesforce connection must be created:

Catalog → External Data → Connections → Create Connection

  • Connection Name: The name of the connection (e.g: Salesforce, salesforce_conn)
  • Connection Type: Salesforce

Then, authenticate with a Salesforce account.

2. Ingestion Setup

Now, we have to create the Ingestion Pipeline:

Jobs & Pipelines → Ingestion Pipeline → Databricks Connectors → Salesforce

img

  • Pipeline Name: The name for the ingestion pipeline
  • Event Log Location: The event log contains audit logs, data quality checks, pipeline progress, and errors

img

4. Source

Here, we specify what data will be ingested from Salesforce:

There are many tables that we can choose from Salesforce, in my case I used Account, Opportunities, Contacts and Leads.

5. Destination

Specify where to store the ingested data in Databricks.

In this case, I created a catalog and three schemas following the Medallion Architecture:

CREATE CATALOG salesforce;
CREATE SCHEMA salesforce.bronze;
CREATE SCHEMA salesforce.silver;
CREATE SCHEMA salesforce.gold;

6. Schedules and Notifications

In this step, we configure the pipeline schedules and notifications.

Schedules

In my case, I scheduled the pipeline to run every day at 00:00 (daily). But more advanced logic can be applied:

  • CRON Syntax: We can use CRON if complex logic is needed, for example, running the pipeline daily, but only from Monday to Friday:
0 0 * * 1-5
  • Timezone: A timezone can be selected, for example, (UTC-06:00) Central TIme (US and Canada)

Notifications

We can add multiple users to receive an email on pipeline failure or success.

img

The ingestion pipeline can be also configured with YAML, the pipelines and the job are in the salesforce/resources folder as YAML files.


When the pipeline is executed, we will see this:

img

4 Streaming tables were created (one for each table in the ingestion stage).

Streaming tables are updated incrementally, they only process the new inserted records (UPSERT). If 100 leads were added yesterday, only 100 records will be processed, not the entire table.

The pipeline interface (Lakeflow) provides a real-time health check for every run:

  • Upserted (Green): Only new or updated records from Salesforce are processed, ensuring efficient incremental loading.
  • Deleted (Orange): Reflects records removed from the source to keep the Lakehouse synchronized.
  • Dropped (Grey): A record is dropped when it doesn't pass a quality check (Expectation), For example, if we require an Email to create a Lead and it's missing, the pipeline drops that row automatically. This way, we guarantee that only high-quality data reaches the business.

ETL Pipeline

To transform the raw data, I built an ETL pipeline. The transformations are performed in the Silver layer by using CTAS (CREATE TABLE AS SELECT).

img

The DAG (Directed Acyclic Graph) shows the lineage of each View and its dependecies. This is another strenght of the medallion architecture. Gold depends on Silver, and Silver depends on Bronze.

Silver Layer: Data Cleansing & Standardization

In this stage, raw data from Bronze is refined and cleaned.

  • Selecting only necessaring columns: Some specific columns were not selected as they are not necessary for this stage.
  • Normalizing names: (e.g, john doe → John Doe)
  • Rounding numeric values: (e.g, 34.999999 → 34.99)

Gold Layer: Bussiness Aggregations

In the Gold layer, I transformed the cleaned Silver data into a Star Schema designed for high-performance analytics / dashboards. The primary goal was to convert raw CRM records into meaningful business metrics and structured entities.

About

Full workflow built in Databricks to ingest data incrementally from Salesforce using Databricks Asset Bundles (DAB)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors