A complete business intelligence case study analyzing 100k+ orders from a Brazilian marketplace
This project simulates a real-world analytics engagement for Olist, a Brazilian e-commerce marketplace platform. Using ~100,000 real anonymized orders from 2016β2018, I conducted a comprehensive business analysis covering revenue, customer behavior, product performance, seller quality, and delivery logistics.
Key Achievement: Identified that 97% one-time buyer rate was caused by delivery delays β orders delivered 15+ days late received review scores 2.5 points lower (1.71 vs 4.22), directly driving customer churn. Proposed retention programs and seller quality SLAs projected to generate R$2.6M incremental annual GMV.
| Category | Key Findings |
|---|---|
| Revenue | 706% growth over 2 years, but MoM growth plateauing in 2018 at <5% |
| Customer Retention | 97% one-time buyers β entire growth funded by new acquisition |
| Delivery Performance | 92.1% on-time rate, but 26.4-day avg in Amazonas (2x platform avg) |
| Geographic Inequality | 66.5% of customers in 3 southeastern states; North/Northeast underserved |
| Product Categories | Top 5 categories generate 40% of GMV; 7 categories have >30% freight ratios |
| Seller Quality | Top 20% of sellers generate 80%+ of revenue (Pareto confirmed) |
| Tool | Purpose |
|---|---|
| PostgreSQL | Data warehouse, ETL, analytical views |
| SQL | 40+ business queries across 7 analysis domains |
| Power BI | 5-page interactive dashboard with drill-down capability |
| Excel | Data validation and exploration |
| Git/GitHub | Version control and documentation |
Ecommerce-Sales-Analysis/
β
βββ data/ # Raw source datasets (9 CSV files)
β βββ olist_orders_dataset.csv # Core orders fact table (99,441 rows)
β βββ olist_customers_dataset.csv # Customer dimension (99,441 rows)
β βββ olist_order_items_dataset.csv # Order line items (112,650 rows)
β βββ olist_order_payments_dataset.csv # Payment transactions (103,886 rows)
β βββ olist_order_reviews_dataset.csv # Customer reviews (104,164 rows)
β βββ olist_products_dataset.csv # Product catalogue (32,951 rows)
β βββ olist_sellers_dataset.csv # Seller dimension (3,095 rows)
β βββ olist_geolocation_dataset.csv # ZIP code coordinates (1,000,163 rows)
β βββ product_category_name_translation.csv # PT β EN category lookup (71 rows)
β βββ schema.sql # PostgreSQL table definitions
β
βββ sql/ # All analytical SQL queries
β βββ 01_data_validation.sql # Data quality audit (nulls, dupes, referential integrity)
β βββ 02_correction.sql # Data cleaning corrections
β βββ 02_master_view.sql # master_orders_view β 8-table analytical join
β βββ 03_revenue_analysis.sql # GMV trends, MoM growth, seasonality, freight ratios
β βββ 04_customer_analysis.sql # RFM segmentation, retention, LTV, geographic analysis
β βββ 05_product_analysis.sql # Category performance, quadrant matrix, freight risk
β βββ 06_seller_analysis.sql # Seller tiers, Pareto distribution, quality metrics
β βββ 07_delivery_logistics_analysis.sql # On-time %, delay impact on reviews, state-level SLA
β βββ exploratory_queries.sql # Ad-hoc exploration queries
β
βββ powerbi/ # Power BI dashboard assets
β βββ ecommerce_dashboard.pbit # Power BI template (5-page interactive dashboard)
β βββ Executive Dashboard (Olist Dataset).png
β βββ Revenue and Growth Dashboard (Olist Dataset).png
β βββ Customer Insights Dashboard (Olist Dataset).png
β βββ Product Performance Dashboard (Olist Dataset).png
β βββ Delivery and Logistics Dashboard (Olist Dataset).png
β
βββ excel/ # Exported datasets for Power BI data model
β βββ delivered_orders.csv # All delivered orders with enriched fields
β βββ rfm_customer_segments.csv # RFM scoring and segment labels
β βββ seller_tiers.csv # Seller tier classifications
β βββ category_quadrant.csv # Product category quadrant matrix
β βββ monthly_summary.csv # Monthly GMV and order volume summary
β βββ delivery_delay_impact.csv # Delay buckets vs. average review scores
β
βββ docs/ # Project documentation
β βββ executive_summary.md # C-suite 1-page findings summary
β βββ insights_log.md # Full domain-by-domain analytical findings
β βββ project_walkthrough.md # End-to-end methodology walkthrough
β βββ data_dictionary.md # Legacy data notes (see 02_Data_Understanding/)
β βββ business_questions.md # Original analysis scope definition
β
βββ Insights/ # Output exports and visual results
β βββ insights.md # Consolidated insight summaries
β βββ 05_product_analysis.result.csv # Product analysis query results
β βββ 06_seller_analysis.csv # Seller analysis query results
β βββ 07_delivery_logistics_analysis.csv # Logistics analysis query results
β βββ image.png # Supporting chart exports
β βββ image-1.png
β
βββ .gitignore # Git ignore rules
βββ README.md # You are here
- Validation: 7-query audit covering row counts, nulls, duplicates, referential integrity
- Cleaning: Handled 610 null product categories, standardized naming conventions
- Modeling: Built
master_orders_viewβ single analytical layer joining 8 tables
1. Revenue Analysis
Queries: 7 analytical queries
Key Metrics: GMV, MoM growth %, AOV, freight ratio
Findings:
- R$13.2M total GMV across 96,470 delivered orders
- November 2017 peak (Black Friday): R$987K GMV in a single month
- Freight costs rising from 15.3% (Q1 2017) to 18.0% (Q3 2018) β margin compression risk
- Growth rate flattening in 2018 (May: +0.4% MoM)
Business Impact: Identified unsustainable growth model reliant on acquisition, not retention
2. Customer Behavior Analysis
Queries: 7 analytical queries
Key Metrics: Repeat purchase rate, RFM segments, LTV distribution, geographic concentration
Findings:
- 97% one-time buyers β critical retention crisis
- Orders-to-customers ratio: 1.02:1 (effectively zero repeat purchasing)
- RFM analysis: Champions (25%) generate 51.3% of revenue
- 66.5% customer concentration in SP, RJ, MG (3 states)
Business Impact: Calculated that 3% β 6% repeat rate improvement = R$600K incremental GMV with zero acquisition cost
3. Product Performance Analysis
Queries: 7 analytical queries
Key Metrics: Category GMV, quadrant classification, satisfaction scores, freight ratios
Findings:
- Health & Beauty (9.3% of GMV) is platform anchor category
- 7 categories exceed 30% freight-to-price ratio (casa_conforto_2: 54%)
- Delivery time correlates directly with review scores (proven causal chain)
- Regional preferences: Northeast prefers Health & Beauty, Southeast prefers Bed Bath & Table
Business Impact: Identified R$150K annual losses in structurally unprofitable high-freight categories
4. Seller Performance Analysis
Queries: 7 analytical queries
Key Metrics: Seller tiers (Platinum/Gold/Silver/Bronze), Pareto distribution, geographic concentration
Findings:
- 4.8% of GMV comes from Bronze-tier sellers (review <3.5, late% >40%)
- Top 20% of sellers generate 80%+ of revenue (Pareto confirmed)
- Seller concentration mirrors customer concentration (both in SP/RJ/MG)
Business Impact: Bronze sellers represent reputation liability β implementing SLAs projected to improve platform review score by 0.2 points
5. Delivery & Logistics Analysis
Queries: 7 analytical queries
Key Metrics: On-time %, delivery days, delay impact, state-level performance
Findings:
- 92.1% on-time delivery (above 90% industry standard)
- Amazonas: 26.4 days avg delivery (2x platform average of 12.3 days)
- 2.5-point review score penalty for orders 15+ days late (1.71 vs 4.22)
- Same-state delivery: 9.2 days avg | Cross-state: 14.8 days avg
Business Impact: Proved delivery delay is root cause of retention crisis β 10% on-time improvement prevents ~7,600 negative reviews annually
High-level KPIs and trends β designed for C-suite consumption
Monthly GMV trends, MoM growth rates, and seasonality patterns
RFM segmentation, retention analysis, and geographic distribution
Category quadrant matrix, satisfaction rankings, and freight risk analysis
On-time performance, state-level analysis, and delay impact on reviews
| Priority | Recommendation | Estimated Impact | Complexity |
|---|---|---|---|
| P1 | Post-purchase retention program (email sequence) | +R$600K GMV, +2% repeat rate | Low |
| P2 | Recruit 200 sellers in North/Northeast states | +R$2M GMV, -3 days avg delivery | Medium |
| P3 | Implement seller performance SLAs | +0.2 review score | Low |
| P4 | Renegotiate carrier contracts | -3% freight ratio | High |
| P5 | Regional marketing campaigns by state | +15% campaign ROI | Low |
Total Projected Impact: +R$2.6M annual GMV, +2β3% repeat purchase rate, -3 days delivery time
Technical Skills:
- Advanced SQL: window functions (NTILE, RANK, LAG), CTEs, complex joins, aggregate filters
- Data modeling: star schema design, analytical view layers
- Power BI: DAX measures, relationship modeling, conditional formatting, interactive slicers
- Business intelligence: translating data into executive-ready insights
Business Skills:
- RFM customer segmentation framework
- Pareto analysis and concentration risk assessment
- Causal analysis: proving delivery β satisfaction β retention chain
- Executive communication: prioritizing recommendations by ROI
Author: Shivam Kothiyal
LinkedIn: https://www.linkedin.com/in/shivam-kothiyal-161b531b8/
Portfolio: https://github.com/shivamkothiyal17
Email: shivamkothiyal170@gmail.com
This project uses the publicly available Olist Brazilian E-Commerce Dataset from Kaggle (CC BY-NC-SA 4.0).
Dataset provided by Olist and made available on Kaggle. Analysis methodology inspired by real-world marketplace analytics workflows at companies like Amazon, Shopify, and Mercado Livre.
β If this project helped you, please star the repository!