This guide outlines the steps to create the necessary BigQuery datasets and tables for storing CO2 emissions data.
Create two datasets in BigQuery for staging and fact tables:
First, create the dataset in BigQuery. We've created two datasets in our BigQuery database, which are:-
(i) staging dataset:
CREATE SCHEMA IF NOT EXISTS staging
OPTIONS (
location = "US"
);(ii) fact dataset:
CREATE SCHEMA IF NOT EXISTS fact
OPTIONS (
location = "US"
);You could also create them manually via the UI:
- The staging table will temporarily store the processed data before moving it to the fact table:
CREATE TABLE IF NOT EXISTS staging.co2_emissions (
make STRING NOT NULL,
model STRING NOT NULL,
vehicle_class STRING NOT NULL,
engine_size FLOAT64 NOT NULL,
cylinders INT64 NOT NULL,
transmission STRING NOT NULL,
fuel_type STRING NOT NULL,
fuel_consumption_city FLOAT64 NOT NULL,
fuel_consumption_hwy FLOAT64 NOT NULL,
fuel_consumption_comb_lkm FLOAT64 NOT NULL,
fuel_consumption_comb_mpg INT64 NOT NULL,
co2_emissions INT64 NOT NULL
);
- The fact table will store the final, clean data:
CREATE TABLE IF NOT EXISTS fact.co2_emissions (
make STRING NOT NULL,
model STRING NOT NULL,
vehicle_class STRING NOT NULL,
engine_size FLOAT64 NOT NULL,
cylinders INT64 NOT NULL,
transmission STRING NOT NULL,
fuel_type STRING NOT NULL,
fuel_consumption_city FLOAT64 NOT NULL,
fuel_consumption_hwy FLOAT64 NOT NULL,
fuel_consumption_comb_lkm FLOAT64 NOT NULL,
fuel_consumption_comb_mpg INT64 NOT NULL,
co2_emissions INT64 NOT NULL
);
Both tables share identical schema with the following columns:
makemodelvehicle_classengine_sizecylinderstransmissionfuel_typefuel_consumption_cityfuel_consumption_hwyfuel_consumption_comb_lkmfuel_consumption_comb_mpgco2_emissions
Refer more to the schema at Data Dictionary.
- Processed data from Dataproc is first loaded into the staging table
- Data is then merged into the fact table using the upsert operation, automated and orchestrated by Google Cloud Composer
- The staging table is used as a temporary landing zone to ensure data quality before final loading to avoid data duplication during data insertion, as BigQuery doesn't enforce unique key validation




