A synthetic sales data generator for the Chinook database, designed to simulate realistic daily transactions for data engineering practice and portfolio projects.
This project is fully cross-platform and will run on Windows, macOS, and Linux.
This project simulates a real-world online sales system that generates synthetic transaction data for practice and learning.
Perfect for building data engineering portfolios without needing access to real production databases. Great for:
- Learning data pipeline development (ETL/ELT)
- Testing data warehouse architectures
- Practicing with realistic transactional data
- Building end-to-end data engineering projects
D-1 means "Day minus 1" — processing yesterday's data today.
This is a common pattern in production data engineering:
- Today is November 1st → Script generates sales for October 31st
- Today is November 2nd → Script generates sales for November 1st
Why D-1? In real companies, sales data is typically processed in overnight batches. This simulator replicates that realistic workflow, making your portfolio projects more authentic.
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ This Simulator │ D-1 │ PostgreSQL │ Batch │ Data Pipeline │
│ (Python) │─Sales──>│ (Neon) │─Extract>│ (Databricks, │
│ │ │ │ │ Airflow, etc) │
└─────────────────┘ └──────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Data Lakehouse│
│ (Bronze/Silver/│
│ Gold) │
└─────────────────┘
This project is the data source — the starting point for building complete data engineering solutions.
- Configurable batch operations: Generates user-specified volumes of inserts, updates, and deletes
- Realistic transaction simulation:
- Inserts: Multi-item sales (1-5 tracks) with random customer and product selection, timestamped in D-1
- Updates: Adds additional items to random invoices from last 90 days while recalculating totals
- Deletes: Complete invoice cancellations from last 90 days with cascading line item removal
- 90-day modification window: Updates and deletes operate on recent invoices only, simulating real-world return and cancellation policies
- Temporal distribution: Random timestamp generation across 24-hour period for new sales maintains realistic data patterns
- Late-arriving changes: Historical modifications within 90-day window create realistic data pipeline challenges
- ACID compliance: All operations execute within a single atomic transaction (commit or rollback)
- Concurrent-safe ID generation: PostgreSQL SEQUENCE objects prevent ID collisions in multi-process scenarios
- Referential integrity: Foreign key constraints enforced across Invoice, InvoiceLine, Customer, and Track tables
- Idempotent setup: Database initialization scripts safely re-runnable without data corruption
- Secure credential management: Database connections via
neonctlCLI, zero hardcoded passwords - Structured logging: Configurable log levels (DEBUG/INFO/WARNING/ERROR) with ISO timestamps
- Database state validation: Pre-execution checks verify existence of required functions and sequences
- Performance optimization: Batch processing with progress tracking and summary statistics
- Cross-platform compatibility: Runs on Windows, macOS, and Linux without modifications
If this is your first time, you'll need to install some tools and set up your Neon database. Follow the sections below before running the simulator.
What it is: The programming language this project is written in.
Check if installed:
python --version # or python3 --versionInstall if needed: python.org/downloads
What it is: A modern, fast Python package manager (faster and more reliable than pip).
Why uv instead of pip?: Better dependency resolution, faster installs, and built-in virtual environment management.
Install:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Verify:
uv --versionLearn more: docs.astral.sh/uv
What it is: JavaScript runtime needed to run the Neon CLI tool.
Check if installed:
node --versionInstall if needed: nodejs.org (Download LTS version)
What it is: Command-line tool to interact with your Neon database from the terminal.
Install via npm:
npm i -g neonctlVerify:
neonctl --versionLearn more: Neon CLI Docs
What it is: Command-line tool to connect to PostgreSQL databases (needed to import Chinook data).
Check if installed:
psql --versionInstall if needed:
- macOS:
brew install postgresql(or download from postgresql.org) - Linux:
sudo apt install postgresql-client(Debian/Ubuntu) or equivalent - Windows: Download from postgresql.org
Note: You only need the client (psql), not the full PostgreSQL server.
If you already have a Neon account with the Chinook database, skip to the Setup section below.
- Go to neon.tech and click Sign Up (free tier available)
- Sign up with GitHub, Google, or email
- Once logged in, click New Project
- Configure your project:
- Name: "chinook-simulator" (or any name you prefer)
- Region: Choose the one closest to you
- PostgreSQL version: 16 (recommended, or latest available)
- Click Create Project
Your first project is created with a default database called neondb.
Important decision: You need to choose a name for the database where Chinook data will be stored.
Recommended name: chinook_db (we'll use this throughout the guide)
Why it matters: You'll need to use this EXACT name in:
- The database creation command below
- The data import command
- Your
.envconfiguration file (later in Setup)
Note: You can choose any name you prefer (e.g., chinook, music_store, sales_db), just be consistent everywhere.
The Chinook database contains realistic data for a digital media store (artists, albums, tracks, customers, invoices).
- In your Neon project dashboard, click the Connect button (top right)
- Copy the connection string (it looks like:
postgresql://user:password@host/neondb) - Keep this handy — you'll need it for the next commands
Your Neon connection string has this format:
postgresql://username:password@ep-cool-name-123456.us-east-2.aws.neon.tech/neondb
^^^^^^^
database name
To connect to a different database, replace the database name at the end:
postgresql://username:password@ep-cool-name-123456.us-east-2.aws.neon.tech/chinook_db
^^^^^^^^^^
# 1. Create the chinook_db database (using your original connection string with /neondb)
psql "postgresql://username:password@host.neon.tech/neondb" -c "CREATE DATABASE chinook_db;"
# 2. Download the official Chinook SQL file
wget https://raw.githubusercontent.com/neondatabase/postgres-sample-dbs/main/chinook.sql
# 3. Import the data (replace /neondb with /chinook_db in your connection string)
psql "postgresql://username:password@host.neon.tech/chinook_db" -f chinook.sql
# 4. Verify the import (should return 412)
psql "postgresql://username:password@host.neon.tech/chinook_db" -c 'SELECT COUNT(*) FROM "Invoice";'Remember: Replace username:password@host.neon.tech with your actual Neon credentials.
Expected output: 412 (number of invoices in the sample data)
Troubleshooting:
- If
wgetis not installed, download the file manually from the URL in step 2 - On Windows without
wget, use:curl -O https://raw.githubusercontent.com/neondatabase/postgres-sample-dbs/main/chinook.sql
You'll need three values to configure this project:
- NEON_ORG_ID: Your organization ID
- NEON_PROJECT_ID: Your project ID
- NEON_DATABASE: The database name you chose in Step 2 (e.g.,
chinook_db)
Via Neon Console:
- In the Neon Console, click on your organization name (top-left corner)
- Go to Settings (in the organization menu)
- Look under General information
- Copy the Organization ID (format:
org-word-word-12345678)
Via CLI (alternative method):
neonctl orgs list
# Your org ID will be shown in the outputVia Neon Console:
- Open your project in the Neon Console
- Click Settings (in the left sidebar)
- Under General, find and copy the Project ID
Via CLI (alternative method):
neonctl projects list
# Your project ID will be shown in the outputSave these IDs — you'll need them in the next step!
Prerequisites: Make sure you've completed the Prerequisites and Setting Up Your Neon Database sections above.
-
Clone the repository:
git clone https://github.com/victor-antoniassi/day-1_sales_data_generator.git cd day-1_sales_data_generator -
Create a
.envfile: Copy the example file and fill in your Neon credentials (from Step 4 above):cp .env.example .env # Windows (CMD) users: use 'copy' instead of 'cp'Edit
.envand add the credentials you found earlier:NEON_ORG_ID=org-your-actual-org-id NEON_PROJECT_ID=your-actual-project-id NEON_DATABASE=chinook_dbCritical: The
NEON_DATABASEvalue MUST match the database name you created in Step 2. If you used a different name thanchinook_db, use that name here.Reminder: See Finding Your Organization ID and Finding Your Project ID if you skipped those steps.
Optional parameters you can configure:
# Optional: specify database role (default: project default) # NEON_ROLE=your_role_name # Optional: specify branch (default: main) # NEON_BRANCH=main # Optional: logging level (default: INFO) # LOG_LEVEL=DEBUG -
Install dependencies:
uv sync
-
Set up the database: Run the automated setup command. This will:
- Validate your configuration and test connectivity
- Update historical invoice data to align with D-1 workflow
- Create all required simulation functions (
INSERT,UPDATE,DELETE) and sequences
uv run src/main.py setup
The setup is idempotent and safe to run multiple times.
All commands are run via the src/main.py script. The simulator supports generating a mix of inserts, updates, and deletes in a single execution.
The script prompts for three space-separated integers:
- Number of new sales (Inserts)
- Number of sale updates (Updates)
- Number of sale cancellations (Deletes)
Run the simulator and wait for it to prompt you for input.
uv run src/main.py simulateYou will see a prompt like this. Enter three numbers and press Enter:
Enter the number of INSERTS, UPDATES, and DELETES for D-1 (e.g., '100 5 2'): 100 5 2
To pass the numbers directly (useful for automation), you can echo a string with the three values and pipe it to the command.
# Format: "INSERTS UPDATES DELETES"
echo "100 5 2" | uv run src/main.py simulate2025-11-12 10:30:00 - __main__ - INFO - === D-1 Sales Simulator Starting ===
Enter the number of INSERTS, UPDATES, and DELETES for D-1 (e.g., '100 5 2'): 100 5 2
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Database connection established
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Validating database state...
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Database state validation successful
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Starting batch operation in a SINGLE TRANSACTION...
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Starting batch operation for D-1 (2025-11-11)
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Requested: 100 Inserts, 5 Updates, 2 Deletes
2025-11-12 10:30:05 - d1_sales_simulator - INFO - Processing 2 DELETES...
2025-11-12 10:30:06 - d1_sales_simulator - INFO - Processing 5 UPDATES...
2025-11-12 10:30:07 - d1_sales_simulator - INFO - Processing 100 INSERTS...
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Batch successfully prepared. Committing transaction...
2025-11-12 10:30:09 - d1_sales_simulator - INFO - SUCCESS: All operations committed to the database.
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Summary: 100 new sales inserted, 5 updated, 2 deleted.
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Total new revenue: $101.97
2025-11-12 10:30:09 - d1_sales_simulator - INFO - Average new sale: $1.02
After running a simulation, a detailed log file in TOML format is created in the simulation_logs/ directory. This log contains a full record of every operation performed (inserts, updates, and deletes).
You can verify the integrity of the simulation by running the verify_simulation.py script. This script automatically:
- Finds the most recent simulation log.
- Reads the summary and every individual operation from the log file.
- Connects to the database and queries it to ensure that the state of the data perfectly matches what was recorded in the log.
- Reports on its findings, confirming that every insert, update, and delete was successfully and correctly applied.
uv run src/verify_simulation.py2025-11-13 16:57:17 - __main__ - INFO - === Starting Log-Based Simulation Verification ===
2025-11-13 16:57:17 - __main__ - INFO - Found latest log file: simulation_2025-11-13_19-55-37.toml
2025-11-13 16:57:17 - __main__ - INFO - Verifying 208 operations for D-1: 2025-11-12
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 100 Inserts ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 42 Deletes ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verifying 66 Updates ---
2025-11-13 16:57:17 - __main__ - INFO - --- Verification Summary ---
2025-11-13 16:57:17 - __main__ - INFO - ✅ SUCCESS: All 208 verified operations are consistent with the log.
chinook_db/
├── src/ # Python application layer
│ ├── main.py # CLI interface and orchestration
│ ├── d1_sales_simulator.py # Core simulation engine
│ └── verify_simulation.py # Log-based data validation utility
├── sql/ # Database schema layer
│ ├── simulation_functions.sql # PL/pgSQL stored procedures (all DML)
│ └── update_historical_data.sql # Historical data alignment
├── simulation_logs/ # Output directory for TOML log files
├── .env.example # Configuration template
├── .env # Environment variables (gitignored)
├── pyproject.toml # Python dependencies & metadata
└── README.md # Documentation
Separation of Concerns:
- Python layer: Orchestration, CLI interface, connection management, and transaction control
- PostgreSQL layer: Business logic encapsulated in stored procedures for performance and atomicity
Why PL/pgSQL Functions?
- Performance: Minimizes network round-trips; all data manipulation occurs server-side
- Atomicity: Database enforces transactional guarantees at the storage layer
- Concurrent safety: SEQUENCE objects provide lock-free ID generation
- Maintainability: Business logic versioned alongside schema in SQL files
Technology Choices:
- PostgreSQL 16: Advanced ACID compliance, mature MVCC, robust constraint enforcement
- Neon: Serverless PostgreSQL with connection pooling and auto-scaling capabilities
- Python 3.11+: Type hints, improved error messages, performance enhancements
- uv package manager: 10-100x faster than pip with deterministic dependency resolution
Quick reference for technical terms used in this project:
| Term | Definition |
|---|---|
| D-1 | "Day minus 1" — yesterday. In data engineering, processing yesterday's data today (common batch pattern). |
| OLTP | Online Transaction Processing — database optimized for day-to-day operations (inserts, updates). Opposite of OLAP (analytics). |
| ETL/ELT | Extract, Transform, Load / Extract, Load, Transform — processes for moving data between systems. |
| Synthetic Data | Artificially generated data that mimics real data patterns, used for testing and development. |
| Idempotent | Safe to run multiple times without unintended side effects. Running the same operation twice produces the same result. |
| Batch Processing | Processing data in groups (batches) rather than one record at a time. Common in overnight data pipelines. |
| Sequence | PostgreSQL feature for generating unique, sequential IDs safely in concurrent environments. |
| Lakehouse | Modern data architecture combining data lake (storage) and data warehouse (analytics) features. |
| Medallion Architecture | Data organization pattern with Bronze (raw), Silver (cleaned), and Gold (aggregated) layers. |
"Function 'simulate_new_sale' not found" (or update/delete function)
- Your database setup is likely incomplete or outdated. Run the automated setup again:
uv run src/main.py setup
"neonctl not found"
- Install Neon CLI:
npm i -g neonctl - See Prerequisites section for details
"NEON_ORG_ID and NEON_PROJECT_ID must be set"
- Create
.envfile from.env.exampleand add your credentials - See Setting Up Your Neon Database section for how to find these IDs
"psql: command not found"
- Install PostgreSQL client (see Prerequisites)
- You only need the client, not the full PostgreSQL server
Need to reset historical data?
- The setup command is idempotent. If you need to re-align historical dates, you can run it again:
uv run src/main.py setup
Want more detailed logs?
- Add
LOG_LEVEL=DEBUGto your.envfile
Import failed with "database chinook already exists"
- That's okay! It means you already imported it. Skip to the verification step.
Verification shows wrong number of invoices
- Expected: 412 invoices
- If different, try dropping and recreating:
psql "<connection-string>" -c "DROP DATABASE chinook;"then repeat Step 2 of Setting Up Your Neon Database