Skip to content

Commit 7e03252

Browse files
authored
feat: add code samples for dbt bigframes integration (#1898)
* feat: add code samples for dbt bigframes integration * fix * improve comments * resolve the comments * add section in readme * fix
1 parent 07bce8e commit 7e03252

File tree

5 files changed

+239
-0
lines changed

5 files changed

+239
-0
lines changed

samples/dbt/.dbt.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
dbt_sample_project:
2+
outputs:
3+
dev: # The target environment name (e.g., dev, prod)
4+
compute_region: us-central1 # Region used for compute operations
5+
dataset: dbt_sample_dateset # BigQuery dataset where dbt will create models
6+
gcs_bucket: dbt_sample_bucket # GCS bucket to store output files
7+
location: US # BigQuery dataset location
8+
method: oauth # Authentication method
9+
priority: interactive # Job priority: "interactive" or "batch"
10+
project: bigframes-dev # GCP project ID
11+
threads: 1 # Number of threads dbt can use for running models in parallel
12+
type: bigquery # Specifies the dbt adapter
13+
target: dev # The default target environment

samples/dbt/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# dbt BigFrames Integration
2+
3+
This repository provides simple examples of using **dbt Python models** with **BigQuery** in **BigFrames** mode.
4+
5+
It includes basic configurations and sample models to help you get started quickly in a typical dbt project.
6+
7+
## Highlights
8+
9+
- `profiles.yml`: configures your connection to BigQuery.
10+
- `dbt_project.yml`: configures your dbt project - **dbt_sample_project**.
11+
- `dbt_bigframes_code_sample_1.py`: An example to read BigQuery data and perform basic transformation.
12+
- `dbt_bigframes_code_sample_2.py`: An example to build an incremental model that leverages BigFrames UDF capabilities.
13+
14+
## Requirements
15+
16+
Before using this project, ensure you have:
17+
18+
- A [Google Cloud account](https://cloud.google.com/free?hl=en)
19+
- A [dbt Cloud account](https://www.getdbt.com/signup) (if using dbt Cloud)
20+
- Python and SQL basics
21+
- Familiarity with dbt concepts and structure
22+
23+
For more, see:
24+
- https://docs.getdbt.com/guides/dbt-python-bigframes
25+
- https://cloud.google.com/bigquery/docs/dataframes-dbt
26+
27+
## Run Locally
28+
29+
Follow these steps to run the Python models using dbt Core.
30+
31+
1. **Install the dbt BigQuery adapter:**
32+
33+
```bash
34+
pip install dbt-bigquery
35+
```
36+
37+
2. **Initialize a dbt project (if not already done):**
38+
39+
```bash
40+
dbt init
41+
```
42+
43+
Follow the prompts to complete setup.
44+
45+
3. **Finish the configuration and add sample code:**
46+
47+
- Edit `~/.dbt/profiles.yml` to finish the configuration.
48+
- Replace or add code samples in `.../models/example`.
49+
50+
4. **Run your dbt models:**
51+
52+
To run all models:
53+
54+
```bash
55+
dbt run
56+
```
57+
58+
Or run a specific model:
59+
60+
```bash
61+
dbt run --select your_model_name
62+
```
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
2+
# Name your project! Project names should contain only lowercase characters
3+
# and underscores. A good package name should reflect your organization's
4+
# name or the intended use of these models
5+
name: 'dbt_sample_project'
6+
version: '1.0.0'
7+
8+
# This setting configures which "profile" dbt uses for this project.
9+
profile: 'dbt_sample_project'
10+
11+
# These configurations specify where dbt should look for different types of files.
12+
# The `model-paths` config, for example, states that models in this project can be
13+
# found in the "models/" directory. You probably won't need to change these!
14+
model-paths: ["models"]
15+
analysis-paths: ["analyses"]
16+
test-paths: ["tests"]
17+
seed-paths: ["seeds"]
18+
macro-paths: ["macros"]
19+
snapshot-paths: ["snapshots"]
20+
21+
clean-targets: # directories to be removed by `dbt clean`
22+
- "target"
23+
- "dbt_packages"
24+
25+
26+
# Configuring models
27+
# Full documentation: https://docs.getdbt.com/docs/configuring-models
28+
29+
# In this example config, we tell dbt to build all models in the example/
30+
# directory as views. These settings can be overridden in the individual model
31+
# files using the `{{ config(...) }}` macro.
32+
models:
33+
dbt_sample_project:
34+
# Optional: These settings (e.g., submission_method, notebook_template_id,
35+
# etc.) can also be defined directly in the Python model using dbt.config.
36+
submission_method: bigframes
37+
# Config indicated by + and applies to all files under models/example/
38+
example:
39+
+materialized: view
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# This example demonstrates one of the most general usages of transforming raw
2+
# BigQuery data into a processed table using a dbt Python model with BigFrames.
3+
# See more from: https://cloud.google.com/bigquery/docs/dataframes-dbt.
4+
#
5+
# Key defaults when using BigFrames in a dbt Python model for BigQuery:
6+
# - The default materialization is 'table' unless specified otherwise. This
7+
# means dbt will create a new BigQuery table from the result of this model.
8+
# - The default timeout for the job is 3600 seconds (60 minutes). This can be
9+
# adjusted if your processing requires more time.
10+
# - If no runtime template is provided, dbt will automatically create and reuse
11+
# a default one for executing the Python code in BigQuery.
12+
#
13+
# BigFrames provides a pandas-like API for BigQuery data, enabling familiar
14+
# data manipulation directly within your dbt project. This code sample
15+
# illustrates a basic pattern for:
16+
# 1. Reading data from an existing BigQuery dataset.
17+
# 2. Processing it using pandas-like DataFrame operations powered by BigFrames.
18+
# 3. Outputting a cleaned and transformed table, managed by dbt.
19+
20+
21+
def model(dbt, session):
22+
# Optional: Override settings from your dbt_project.yml file.
23+
# When both are set, dbt.config takes precedence over dbt_project.yml.
24+
#
25+
# Use `dbt.config(submission_method="bigframes")` to tell dbt to execute
26+
# this Python model using BigQuery DataFrames (BigFrames). This allows you
27+
# to write pandas-like code that operates directly on BigQuery data
28+
# without needing to pull all data into memory.
29+
dbt.config(submission_method="bigframes")
30+
31+
# Define the BigQuery table path from which to read data.
32+
table = "bigquery-public-data.epa_historical_air_quality.temperature_hourly_summary"
33+
34+
# Define the specific columns to select from the BigQuery table.
35+
columns = ["state_name", "county_name", "date_local", "time_local", "sample_measurement"]
36+
37+
# Read data from the specified BigQuery table into a BigFrames DataFrame.
38+
df = session.read_gbq(table, columns=columns)
39+
40+
# Sort the DataFrame by the specified columns. This prepares the data for
41+
# `drop_duplicates` to ensure consistent duplicate removal.
42+
df = df.sort_values(columns).drop_duplicates(columns)
43+
44+
# Group the DataFrame by 'state_name', 'county_name', and 'date_local'. For
45+
# each group, calculate the minimum and maximum of the 'sample_measurement'
46+
# column. The result will be a BigFrames DataFrame with a MultiIndex.
47+
result = df.groupby(["state_name", "county_name", "date_local"])["sample_measurement"]\
48+
.agg(["min", "max"])
49+
50+
# Rename some columns and convert the MultiIndex of the 'result' DataFrame
51+
# into regular columns. This flattens the DataFrame so 'state_name',
52+
# 'county_name', and 'date_local' become regular columns again.
53+
result = result.rename(columns={'min': 'min_temperature', 'max': 'max_temperature'})\
54+
.reset_index()
55+
56+
# Return the processed BigFrames DataFrame.
57+
# In a dbt Python model, this DataFrame will be materialized as a table
58+
return result
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# This example demonstrates how to build an **incremental dbt Python model**
2+
# using BigFrames.
3+
#
4+
# Incremental models are essential for efficiently processing large datasets by
5+
# only transforming new or changed data, rather than reprocessing the entire
6+
# dataset every time. If the target table already exists, dbt will perform a
7+
# merge based on the specified unique keys; otherwise, it will create a new
8+
# table automatically.
9+
#
10+
# This model also showcases the definition and application of a **BigFrames
11+
# User-Defined Function (UDF)** to add a descriptive summary column based on
12+
# temperature data. BigFrames UDFs allow you to execute custom Python logic
13+
# directly within BigQuery, leveraging BigQuery's scalability.
14+
15+
16+
import bigframes.pandas as bpd
17+
18+
def model(dbt, session):
19+
# Optional: override settings from dbt_project.yml.
20+
# When both are set, dbt.config takes precedence over dbt_project.yml.
21+
dbt.config(
22+
# Use BigFrames mode to execute this Python model. This enables
23+
# pandas-like operations directly on BigQuery data.
24+
submission_method="bigframes",
25+
# Materialize this model as an 'incremental' table. This tells dbt to
26+
# only process new or updated data on subsequent runs.
27+
materialized='incremental',
28+
# Use MERGE strategy to update rows during incremental runs.
29+
incremental_strategy='merge',
30+
# Define the composite key that uniquely identifies a row in the
31+
# target table. This key is used by the 'merge' strategy to match
32+
# existing rows for updates during incremental runs.
33+
unique_key=["state_name", "county_name", "date_local"],
34+
)
35+
36+
# Reference an upstream dbt model or an existing BigQuery table as a
37+
# BigFrames DataFrame. It allows you to seamlessly use the output of another
38+
# dbt model as input to this one.
39+
df = dbt.ref("dbt_bigframes_code_sample_1")
40+
41+
# Define a BigFrames UDF to generate a temperature description.
42+
# BigFrames UDFs allow you to define custom Python logic that executes
43+
# directly within BigQuery. This is powerful for complex transformations.
44+
@bpd.udf(dataset='dbt_sample_dataset', name='describe_udf')
45+
def describe(
46+
max_temperature: float,
47+
min_temperature: float,
48+
) -> str:
49+
is_hot = max_temperature > 85.0
50+
is_cold = min_temperature < 50.0
51+
52+
if is_hot and is_cold:
53+
return "Expect both hot and cold conditions today."
54+
if is_hot:
55+
return "Overall, it's a hot day."
56+
if is_cold:
57+
return "Overall, it's a cold day."
58+
return "Comfortable throughout the day."
59+
60+
# Apply the UDF using combine and store the result in a column "describe".
61+
df["describe"] = df["max_temperature"].combine(df["min_temperature"], describe)
62+
63+
# Return the transformed BigFrames DataFrame.
64+
# This DataFrame will be the final output of your incremental dbt model.
65+
# On subsequent runs, only new or changed rows will be processed and merged
66+
# into the target BigQuery table based on the `unique_key`.
67+
return df

0 commit comments

Comments
 (0)