Skip to content

victor-antoniassi/day-1_sales_data_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinook Sales Simulator

Python PostgreSQL Platform License

A synthetic sales data generator for the Chinook database, designed to simulate realistic daily transactions for data engineering practice and portfolio projects.

This project is fully cross-platform and will run on Windows, macOS, and Linux.


What is This Project?

This project simulates a real-world online sales system that generates synthetic transaction data for practice and learning.

Use Case

Perfect for building data engineering portfolios without needing access to real production databases. Great for:

  • Learning data pipeline development (ETL/ELT)
  • Testing data warehouse architectures
  • Practicing with realistic transactional data
  • Building end-to-end data engineering projects

Key Concept: D-1 Batch Processing

D-1 means "Day minus 1" — processing yesterday's data today.

This is a common pattern in production data engineering:

  • Today is November 1st → Script generates sales for October 31st
  • Today is November 2nd → Script generates sales for November 1st

Why D-1? In real companies, sales data is typically processed in overnight batches. This simulator replicates that realistic workflow, making your portfolio projects more authentic.

How It Fits Into a Data Pipeline

┌─────────────────┐         ┌──────────────┐         ┌─────────────────┐
│  This Simulator │  D-1    │  PostgreSQL  │  Batch  │   Data Pipeline │
│    (Python)     │─Sales──>│    (Neon)    │─Extract>│   (Databricks,  │
│                 │         │              │         │   Airflow, etc) │
└─────────────────┘         └──────────────┘         └─────────────────┘
                                                               │
                                                               ▼
                                                      ┌─────────────────┐
                                                      │   Data Lakehouse│
                                                      │  (Bronze/Silver/│
                                                      │       Gold)     │
                                                      └─────────────────┘

This project is the data source — the starting point for building complete data engineering solutions.


Technical Features

Data Generation

  • Configurable batch operations: Generates user-specified volumes of inserts, updates, and deletes
  • Realistic transaction simulation:
    • Inserts: Multi-item sales (1-5 tracks) with random customer and product selection, timestamped in D-1
    • Updates: Adds additional items to random invoices from last 90 days while recalculating totals
    • Deletes: Complete invoice cancellations from last 90 days with cascading line item removal
  • 90-day modification window: Updates and deletes operate on recent invoices only, simulating real-world return and cancellation policies
  • Temporal distribution: Random timestamp generation across 24-hour period for new sales maintains realistic data patterns
  • Late-arriving changes: Historical modifications within 90-day window create realistic data pipeline challenges

Database Architecture

  • ACID compliance: All operations execute within a single atomic transaction (commit or rollback)
  • Concurrent-safe ID generation: PostgreSQL SEQUENCE objects prevent ID collisions in multi-process scenarios
  • Referential integrity: Foreign key constraints enforced across Invoice, InvoiceLine, Customer, and Track tables
  • Idempotent setup: Database initialization scripts safely re-runnable without data corruption

Engineering Practices

  • Secure credential management: Database connections via neonctl CLI, zero hardcoded passwords
  • Structured logging: Configurable log levels (DEBUG/INFO/WARNING/ERROR) with ISO timestamps
  • Database state validation: Pre-execution checks verify existence of required functions and sequences
  • Performance optimization: Batch processing with progress tracking and summary statistics
  • Cross-platform compatibility: Runs on Windows, macOS, and Linux without modifications

Prerequisites - First Time Setup

If this is your first time, you'll need to install some tools and set up your Neon database. Follow the sections below before running the simulator.

Required Tools

1. Python 3.11+

What it is: The programming language this project is written in.

Check if installed:

python --version  # or python3 --version

Install if needed: python.org/downloads

2. uv (Python Package Manager)

What it is: A modern, fast Python package manager (faster and more reliable than pip).

Why uv instead of pip?: Better dependency resolution, faster installs, and built-in virtual environment management.

Install:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Verify:

uv --version

Learn more: docs.astral.sh/uv

3. Node.js 18+ (Required for neonctl)

What it is: JavaScript runtime needed to run the Neon CLI tool.

Check if installed:

node --version

Install if needed: nodejs.org (Download LTS version)

4. Neon CLI (neonctl)

What it is: Command-line tool to interact with your Neon database from the terminal.

Install via npm:

npm i -g neonctl

Verify:

neonctl --version

Learn more: Neon CLI Docs

5. PostgreSQL Client (psql)

What it is: Command-line tool to connect to PostgreSQL databases (needed to import Chinook data).

Check if installed:

psql --version

Install if needed:

  • macOS: brew install postgresql (or download from postgresql.org)
  • Linux: sudo apt install postgresql-client (Debian/Ubuntu) or equivalent
  • Windows: Download from postgresql.org

Note: You only need the client (psql), not the full PostgreSQL server.


Setting Up Your Neon Database

If you already have a Neon account with the Chinook database, skip to the Setup section below.

Step 1: Create a Neon Account and Project

  1. Go to neon.tech and click Sign Up (free tier available)
  2. Sign up with GitHub, Google, or email
  3. Once logged in, click New Project
  4. Configure your project:
    • Name: "chinook-simulator" (or any name you prefer)
    • Region: Choose the one closest to you
    • PostgreSQL version: 16 (recommended, or latest available)
  5. Click Create Project

Your first project is created with a default database called neondb.

Step 2: Choose Your Database Name

Important decision: You need to choose a name for the database where Chinook data will be stored.

Recommended name: chinook_db (we'll use this throughout the guide)

Why it matters: You'll need to use this EXACT name in:

  • The database creation command below
  • The data import command
  • Your .env configuration file (later in Setup)

Note: You can choose any name you prefer (e.g., chinook, music_store, sales_db), just be consistent everywhere.

Step 3: Import the Chinook Sample Database

The Chinook database contains realistic data for a digital media store (artists, albums, tracks, customers, invoices).

Get Your Connection String

  1. In your Neon project dashboard, click the Connect button (top right)
  2. Copy the connection string (it looks like: postgresql://user:password@host/neondb)
  3. Keep this handy — you'll need it for the next commands

Understanding Connection Strings (Important!)

Your Neon connection string has this format:

postgresql://username:password@ep-cool-name-123456.us-east-2.aws.neon.tech/neondb
                                                                           ^^^^^^^
                                                                      database name

To connect to a different database, replace the database name at the end:

postgresql://username:password@ep-cool-name-123456.us-east-2.aws.neon.tech/chinook_db
                                                                           ^^^^^^^^^^

Import via Command Line

# 1. Create the chinook_db database (using your original connection string with /neondb)
psql "postgresql://username:password@host.neon.tech/neondb" -c "CREATE DATABASE chinook_db;"

# 2. Download the official Chinook SQL file
wget https://raw.githubusercontent.com/neondatabase/postgres-sample-dbs/main/chinook.sql

# 3. Import the data (replace /neondb with /chinook_db in your connection string)
psql "postgresql://username:password@host.neon.tech/chinook_db" -f chinook.sql

# 4. Verify the import (should return 412)
psql "postgresql://username:password@host.neon.tech/chinook_db" -c 'SELECT COUNT(*) FROM "Invoice";'

Remember: Replace username:password@host.neon.tech with your actual Neon credentials.

Expected output: 412 (number of invoices in the sample data)

Troubleshooting:

  • If wget is not installed, download the file manually from the URL in step 2
  • On Windows without wget, use: curl -O https://raw.githubusercontent.com/neondatabase/postgres-sample-dbs/main/chinook.sql

Step 4: Get Your Neon Credentials

You'll need three values to configure this project:

  • NEON_ORG_ID: Your organization ID
  • NEON_PROJECT_ID: Your project ID
  • NEON_DATABASE: The database name you chose in Step 2 (e.g., chinook_db)

Finding Your Organization ID (NEON_ORG_ID)

Via Neon Console:

  1. In the Neon Console, click on your organization name (top-left corner)
  2. Go to Settings (in the organization menu)
  3. Look under General information
  4. Copy the Organization ID (format: org-word-word-12345678)

Via CLI (alternative method):

neonctl orgs list
# Your org ID will be shown in the output

Finding Your Project ID (NEON_PROJECT_ID)

Via Neon Console:

  1. Open your project in the Neon Console
  2. Click Settings (in the left sidebar)
  3. Under General, find and copy the Project ID

Via CLI (alternative method):

neonctl projects list
# Your project ID will be shown in the output

Save these IDs — you'll need them in the next step!


Setup

Prerequisites: Make sure you've completed the Prerequisites and Setting Up Your Neon Database sections above.

  1. Clone the repository:

    git clone https://github.com/victor-antoniassi/day-1_sales_data_generator.git
    cd day-1_sales_data_generator
  2. Create a .env file: Copy the example file and fill in your Neon credentials (from Step 4 above):

    cp .env.example .env
    # Windows (CMD) users: use 'copy' instead of 'cp'

    Edit .env and add the credentials you found earlier:

    NEON_ORG_ID=org-your-actual-org-id
    NEON_PROJECT_ID=your-actual-project-id
    NEON_DATABASE=chinook_db
    

    Critical: The NEON_DATABASE value MUST match the database name you created in Step 2. If you used a different name than chinook_db, use that name here.

    Reminder: See Finding Your Organization ID and Finding Your Project ID if you skipped those steps.

    Optional parameters you can configure:

    # Optional: specify database role (default: project default)
    # NEON_ROLE=your_role_name
    
    # Optional: specify branch (default: main)
    # NEON_BRANCH=main
    
    # Optional: logging level (default: INFO)
    # LOG_LEVEL=DEBUG
    
  3. Install dependencies:

    uv sync
  4. Set up the database: Run the automated setup command. This will:

    • Validate your configuration and test connectivity
    • Update historical invoice data to align with D-1 workflow
    • Create all required simulation functions (INSERT, UPDATE, DELETE) and sequences
    uv run src/main.py setup

    The setup is idempotent and safe to run multiple times.

Usage

All commands are run via the src/main.py script. The simulator supports generating a mix of inserts, updates, and deletes in a single execution.

The script prompts for three space-separated integers:

  1. Number of new sales (Inserts)
  2. Number of sale updates (Updates)
  3. Number of sale cancellations (Deletes)

Interactive Mode (Recommended)

Run the simulator and wait for it to prompt you for input.

uv run src/main.py simulate

You will see a prompt like this. Enter three numbers and press Enter:

Enter the number of INSERTS, UPDATES, and DELETES for D-1 (e.g., '100 5 2'): 100 5 2

Non-Interactive Mode

To pass the numbers directly (useful for automation), you can echo a string with the three values and pipe it to the command.

# Format: "INSERTS UPDATES DELETES"
echo "100 5 2" | uv run src/main.py simulate

Example Output

2025-11-12 10:30:00 - __main__ - INFO - === D-1 Sales Simulator Starting ===
Enter the number of INSERTS, UPDATES, and DELETES for D-1 (e.g., '100 5 2'): 100 5 2
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Database connection established
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Validating database state...
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Database state validation successful
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Starting batch operation in a SINGLE TRANSACTION...
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Starting batch operation for D-1 (2025-11-11)
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Requested: 100 Inserts, 5 Updates, 2 Deletes
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Processing 2 DELETES...
2025-11-12 10:30:06 - d1_sales_simulator - INFO - Processing 5 UPDATES...
2025-11-12 10:30:07 - d1_sales_simulator - INFO - Processing 100 INSERTS...
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Batch successfully prepared. Committing transaction...
2025-11-12 10:30:09 - d1_sales_simulator - INFO - SUCCESS: All operations committed to the database.
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Summary: 100 new sales inserted, 5 updated, 2 deleted.
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Total new revenue: $101.97
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Average new sale: $1.02

Verifying Simulation Results

After running a simulation, a detailed log file in TOML format is created in the simulation_logs/ directory. This log contains a full record of every operation performed (inserts, updates, and deletes).

You can verify the integrity of the simulation by running the verify_simulation.py script. This script automatically:

  1. Finds the most recent simulation log.
  2. Reads the summary and every individual operation from the log file.
  3. Connects to the database and queries it to ensure that the state of the data perfectly matches what was recorded in the log.
  4. Reports on its findings, confirming that every insert, update, and delete was successfully and correctly applied.
uv run src/verify_simulation.py

Example Verification Output

2025-11-13 16:57:17 - __main__ - INFO - === Starting Log-Based Simulation Verification ===
2025-11-13 16:57:17 - __main__ - INFO - Found latest log file: simulation_2025-11-13_19-55-37.toml
2025-11-13 16:57:17 - __main__ - INFO - Verifying 208 operations for D-1: 2025-11-12
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 100 Inserts ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 42 Deletes ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 66 Updates ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verification Summary ---
2025-11-13 16:57:17 - __main__ - INFO - ✅ SUCCESS: All 208 verified operations are consistent with the log.

Architecture & Implementation

Project Structure

chinook_db/
├── src/                              # Python application layer
│   ├── main.py                       # CLI interface and orchestration
│   ├── d1_sales_simulator.py         # Core simulation engine
│   └── verify_simulation.py          # Log-based data validation utility
├── sql/                              # Database schema layer
│   ├── simulation_functions.sql      # PL/pgSQL stored procedures (all DML)
│   └── update_historical_data.sql    # Historical data alignment
├── simulation_logs/                  # Output directory for TOML log files
├── .env.example                      # Configuration template
├── .env                              # Environment variables (gitignored)
├── pyproject.toml                    # Python dependencies & metadata
└── README.md                         # Documentation

Design Decisions

Separation of Concerns:

  • Python layer: Orchestration, CLI interface, connection management, and transaction control
  • PostgreSQL layer: Business logic encapsulated in stored procedures for performance and atomicity

Why PL/pgSQL Functions?

  • Performance: Minimizes network round-trips; all data manipulation occurs server-side
  • Atomicity: Database enforces transactional guarantees at the storage layer
  • Concurrent safety: SEQUENCE objects provide lock-free ID generation
  • Maintainability: Business logic versioned alongside schema in SQL files

Technology Choices:

  • PostgreSQL 16: Advanced ACID compliance, mature MVCC, robust constraint enforcement
  • Neon: Serverless PostgreSQL with connection pooling and auto-scaling capabilities
  • Python 3.11+: Type hints, improved error messages, performance enhancements
  • uv package manager: 10-100x faster than pip with deterministic dependency resolution

Glossary

Quick reference for technical terms used in this project:

Term Definition
D-1 "Day minus 1" — yesterday. In data engineering, processing yesterday's data today (common batch pattern).
OLTP Online Transaction Processing — database optimized for day-to-day operations (inserts, updates). Opposite of OLAP (analytics).
ETL/ELT Extract, Transform, Load / Extract, Load, Transform — processes for moving data between systems.
Synthetic Data Artificially generated data that mimics real data patterns, used for testing and development.
Idempotent Safe to run multiple times without unintended side effects. Running the same operation twice produces the same result.
Batch Processing Processing data in groups (batches) rather than one record at a time. Common in overnight data pipelines.
Sequence PostgreSQL feature for generating unique, sequential IDs safely in concurrent environments.
Lakehouse Modern data architecture combining data lake (storage) and data warehouse (analytics) features.
Medallion Architecture Data organization pattern with Bronze (raw), Silver (cleaned), and Gold (aggregated) layers.

Troubleshooting

"Function 'simulate_new_sale' not found" (or update/delete function)

  • Your database setup is likely incomplete or outdated. Run the automated setup again: uv run src/main.py setup

"neonctl not found"

  • Install Neon CLI: npm i -g neonctl
  • See Prerequisites section for details

"NEON_ORG_ID and NEON_PROJECT_ID must be set"

"psql: command not found"

  • Install PostgreSQL client (see Prerequisites)
  • You only need the client, not the full PostgreSQL server

Need to reset historical data?

  • The setup command is idempotent. If you need to re-align historical dates, you can run it again: uv run src/main.py setup

Want more detailed logs?

  • Add LOG_LEVEL=DEBUG to your .env file

Import failed with "database chinook already exists"

  • That's okay! It means you already imported it. Skip to the verification step.

Verification shows wrong number of invoices

  • Expected: 412 invoices
  • If different, try dropping and recreating: psql "<connection-string>" -c "DROP DATABASE chinook;" then repeat Step 2 of Setting Up Your Neon Database

About

Turns the static Chinook DB into a living OLTP simulator. Generates daily INSERTs, UPDATEs (late-arriving data), and DELETEs (cancellations). The perfect chaotic data source for testing real-world data engineering pipelines (CDC, SCD Type 2).

Topics

Resources

License

Stars

Watchers

Forks

Contributors