diff --git a/docs/sql/SQL-Advance/CTE.md b/docs/sql/SQL-Advance/CTE.md new file mode 100644 index 00000000..1fc91ad9 --- /dev/null +++ b/docs/sql/SQL-Advance/CTE.md @@ -0,0 +1,593 @@ +--- +id: common-table-expressions +title: SQL Common Table Expressions (CTEs) +sidebar_label: CTEs (WITH Clause) +sidebar_position: 2 +tags: + [ + sql, + cte, + common table expressions, + with clause, + recursive cte, + sql tutorial, + subqueries, + database queries, + query optimization, + ] +description: Learn about SQL Common Table Expressions (CTEs), how to use the WITH clause, recursive CTEs, syntax, examples, and best practices for writing cleaner, more maintainable queries. +--- + +## What are Common Table Expressions (CTEs)? + +SQL **Common Table Expressions (CTEs)** are temporary named result sets that exist only during the execution of a single query. Defined using the `WITH` clause, CTEs make complex queries more readable and maintainable by breaking them into logical, named components that can be referenced multiple times within the main query. + +:::note +**Key Characteristics of CTEs:** + +- **Temporary & Named**: Creates a named result set that exists only for the query duration. + +- **Improved Readability**: Makes complex queries easier to understand and maintain. + +- **Reusable**: Can be referenced multiple times in the same query without recalculation. + +- **Recursive Capable**: Supports recursive queries for hierarchical data structures. + +- **No Storage Overhead**: Doesn't create physical tables, only logical references. +::: + + +:::success +**When to Use CTEs:** + +- **Complex Subqueries**: Replace nested subqueries with readable named expressions +- **Multiple References**: When you need to reference the same result set multiple times +- **Hierarchical Data**: Traverse organizational charts, category trees, bill of materials +- **Step-by-Step Logic**: Break down complex calculations into logical steps +- **Recursive Operations**: Process parent-child relationships of unknown depth + +**Real-World Example:** +Instead of writing deeply nested subqueries to calculate monthly sales rankings, use CTEs to first calculate monthly totals, then calculate rankings, then filter top performers - each step clearly named and easy to understand. +::: + +:::warning +**⚠️ Important Considerations:** + +- **Scope**: CTEs only exist within the statement where they're defined +- **Not Materialized**: Results aren't stored; may be recalculated if referenced multiple times +- **Database Support**: Supported in PostgreSQL, SQL Server, MySQL 8.0+, Oracle, DB2 +- **Performance**: May not always be faster than alternatives; test with actual data +- **Recursion Limits**: Recursive CTEs have depth limits (varies by database) +::: + +:::info + +## Basic CTE Syntax + +```sql +-- Single CTE +WITH cte_name AS ( + SELECT column1, column2, ... + FROM table_name + WHERE condition +) +SELECT * +FROM cte_name; +``` +```sql +-- Multiple CTEs +WITH +cte1 AS ( + SELECT ... FROM table1 +), +cte2 AS ( + SELECT ... FROM cte1 -- Can reference previous CTEs +), +cte3 AS ( + SELECT ... FROM table2 +) +SELECT * +FROM cte1 +JOIN cte2 ON cte1.id = cte2.id +JOIN cte3 ON cte2.id = cte3.id; +``` + +| **Component** | **Purpose** | **Example** | +|---------------|-------------|-------------| +| WITH | Starts CTE definition | `WITH sales_summary AS` | +| CTE Name | Names the temporary result set | `monthly_totals` | +| AS | Separates name from query | `AS (SELECT ...)` | +| SELECT | Defines the CTE query | `SELECT customer_id, SUM(amount)` | +| Main Query | Uses the CTE | `SELECT * FROM monthly_totals` | + +## CTE vs Subquery vs Temp Table + +| **Feature** | **CTE** | **Subquery** | **Temp Table** | +|-------------|---------|--------------|----------------| +| Readability | Excellent | Poor (nested) | Good | +| Reusability | Yes (in same query) | No | Yes (in session) | +| Performance | Good | Good | Varies | +| Recursion | Yes | No | No | +| Scope | Single statement | Single reference | Session | +| Storage | None | None | Physical | + +::: + +## Practical Examples + + + + ```sql + -- Get total spending per customer + -- Think of CTE as creating a summary table first, then using it + + WITH customer_totals AS ( + SELECT + customer_id, + SUM(total_amount) AS total_spent, + COUNT(*) AS order_count + FROM orders + GROUP BY customer_id + ) + SELECT + c.customer_name, + ct.total_spent, + ct.order_count + FROM customers c + JOIN customer_totals ct ON c.customer_id = ct.customer_id + WHERE ct.total_spent > 1000 + ORDER BY ct.total_spent DESC; + + -- Why use CTE here? + -- 1. Makes the query easier to read + -- 2. Separates the calculation from the final selection + ``` + + + ```sql + -- Calculate customer categories step by step + + WITH + -- Step 1: Get order totals for each customer + order_summary AS ( + SELECT + customer_id, + COUNT(*) AS total_orders, + SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ), + -- Step 2: Categorize customers based on spending + customer_categories AS ( + SELECT + customer_id, + total_orders, + total_spent, + CASE + WHEN total_spent > 5000 THEN 'VIP' + WHEN total_spent > 1000 THEN 'Regular' + ELSE 'Occasional' + END AS category + FROM order_summary + ) + -- Step 3: Get the final result with customer names + SELECT + c.customer_name, + cc.category, + cc.total_orders, + cc.total_spent + FROM customers c + JOIN customer_categories cc ON c.customer_id = cc.customer_id + ORDER BY cc.total_spent DESC; + + -- This breaks down a complex query into simple, logical steps + ``` + + + ```sql + -- WITHOUT CTE (harder to read) + SELECT + c.customer_name, + (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count, + (SELECT SUM(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) AS total_spent + FROM customers c; + + -- WITH CTE (much clearer) + WITH customer_stats AS ( + SELECT + customer_id, + COUNT(*) AS order_count, + SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) + SELECT + c.customer_name, + cs.order_count, + cs.total_spent + FROM customers c + LEFT JOIN customer_stats cs ON c.customer_id = cs.customer_id; + + -- The CTE version is easier to understand and maintain + ``` + + + ```sql + -- Find employee and their manager chain + -- Recursive CTE keeps going until it reaches the top (CEO) + + WITH RECURSIVE employee_chain AS ( + -- Start with one employee + SELECT + employee_id, + employee_name, + manager_id, + 1 AS level + FROM employees + WHERE employee_id = 101 -- Start with employee 101 + + UNION ALL + + -- Keep finding their managers + SELECT + e.employee_id, + e.employee_name, + e.manager_id, + ec.level + 1 + FROM employees e + JOIN employee_chain ec ON e.employee_id = ec.manager_id + ) + SELECT + employee_name, + level, + CASE WHEN level = 1 THEN 'You' + WHEN level = 2 THEN 'Your Manager' + WHEN level = 3 THEN 'Your Manager\'s Manager' + ELSE 'Upper Management' + END AS relationship + FROM employee_chain + ORDER BY level; + + -- Shows the reporting chain: You -> Your Boss -> Their Boss -> etc. + ``` + + + ```sql + -- Generate a list of numbers from 1 to 10 + -- Useful for creating reports with all months, even if no data + + WITH RECURSIVE numbers AS ( + -- Start with 1 + SELECT 1 AS num + + UNION ALL + + -- Add 1 each time until we reach 10 + SELECT num + 1 + FROM numbers + WHERE num < 10 + ) + SELECT num AS month_number + FROM numbers; + + -- Result: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 + -- You can then join this with your sales data to show all months + ``` + + + ```sql + -- Find customers who haven't ordered in the last 30 days + + WITH recent_orders AS ( + SELECT DISTINCT customer_id + FROM orders + WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY) + ) + SELECT + c.customer_id, + c.customer_name, + c.email + FROM customers c + LEFT JOIN recent_orders ro ON c.customer_id = ro.customer_id + WHERE ro.customer_id IS NULL -- No recent orders + ORDER BY c.customer_name; + + -- Perfect for finding inactive customers for marketing campaigns + ``` + + + ```plaintext + -- Sample result for basic CTE example: + + customer_name | total_spent | order_count + -----------------|-------------|------------- + John Smith | 5,250.00 | 12 + Sarah Johnson | 3,890.50 | 8 + Mike Williams | 2,100.75 | 5 + Emily Davis | 1,450.00 | 3 + + -- Only customers who spent more than $1000 are shown + -- Data is sorted by total spending (highest first) + + + -- Sample result for recursive employee chain: + + employee_name | level | relationship + -----------------|-------|--------------------------- + Bob Smith | 1 | You + Alice Johnson | 2 | Your Manager + Carol White | 3 | Your Manager's Manager + David CEO | 4 | Upper Management + + -- Shows the complete reporting chain from employee to CEO + ``` + + + +## Advanced CTE Patterns + +:::tip +**Complex Scenarios:** + +1. **Running Totals with CTEs**: + ```sql + WITH daily_revenue AS ( + SELECT + DATE(order_date) AS order_day, + SUM(total_amount) AS daily_total + FROM orders + WHERE YEAR(order_date) = 2024 + GROUP BY DATE(order_date) + ) + SELECT + order_day, + daily_total, + SUM(daily_total) OVER (ORDER BY order_day) AS running_total, + AVG(daily_total) OVER ( + ORDER BY order_day + ROWS BETWEEN 6 PRECEDING AND CURRENT ROW + ) AS seven_day_avg + FROM daily_revenue + ORDER BY order_day; + ``` + +2. **CTEs with Window Functions**: + ```sql + WITH product_sales AS ( + SELECT + product_id, + category, + SUM(quantity) AS units_sold, + SUM(quantity * unit_price) AS revenue + FROM order_items + GROUP BY product_id, category + ) + SELECT + product_id, + category, + revenue, + RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS category_rank, + ROUND(revenue / SUM(revenue) OVER (PARTITION BY category) * 100, 2) AS category_percentage, + ROUND(revenue / SUM(revenue) OVER () * 100, 2) AS total_percentage + FROM product_sales + ORDER BY category, category_rank; + ``` + +3. **Chained CTEs for Complex Calculations**: + ```sql + WITH + base_metrics AS ( + SELECT product_id, SUM(revenue) AS total_revenue + FROM sales GROUP BY product_id + ), + growth_metrics AS ( + SELECT product_id, total_revenue, + LAG(total_revenue) OVER (ORDER BY product_id) AS prev_revenue + FROM base_metrics + ), + final_metrics AS ( + SELECT product_id, total_revenue, + (total_revenue - prev_revenue) / NULLIF(prev_revenue, 0) * 100 AS growth_rate + FROM growth_metrics + ) + SELECT * FROM final_metrics WHERE growth_rate > 10; + ``` +::: + +## Recursive CTE Deep Dive + +:::info +**Understanding Recursive CTEs:** + +Recursive CTEs have two parts: +1. **Anchor Member**: Initial query that doesn't reference the CTE +2. **Recursive Member**: Query that references the CTE itself + +```sql +WITH RECURSIVE cte_name AS ( + -- Anchor member (executed once) + SELECT initial_data + FROM base_table + WHERE starting_condition + + UNION ALL + + -- Recursive member (executed repeatedly) + SELECT next_data + FROM base_table + INNER JOIN cte_name ON join_condition + WHERE termination_condition +) +SELECT * FROM cte_name; +``` + +**Key Points:** +- Always include a termination condition to prevent infinite loops +- Use `UNION ALL` (not `UNION`) for better performance +- Most databases have maximum recursion depth limits +- Great for hierarchies, graphs, and tree structures +::: + +**Common Recursive Patterns:** + +1. **Bill of Materials (BOM)**: + ```sql + WITH RECURSIVE parts_explosion AS ( + -- Anchor: Top-level product + SELECT + product_id, + component_id, + quantity, + 1 AS level, + CAST(product_id AS VARCHAR(1000)) AS path + FROM product_components + WHERE product_id = 'BIKE-001' + + UNION ALL + + -- Recursive: Sub-components + SELECT + pc.product_id, + pc.component_id, + pe.quantity * pc.quantity, + pe.level + 1, + CONCAT(pe.path, ' > ', pc.product_id) + FROM product_components pc + INNER JOIN parts_explosion pe ON pc.product_id = pe.component_id + WHERE pe.level < 5 + ) + SELECT * FROM parts_explosion; + ``` + +2. **Category Tree Navigation**: + ```sql + WITH RECURSIVE category_tree AS ( + -- Root categories + SELECT + category_id, + category_name, + parent_category_id, + 1 AS depth, + category_name AS full_path + FROM categories + WHERE parent_category_id IS NULL + + UNION ALL + + -- Child categories + SELECT + c.category_id, + c.category_name, + c.parent_category_id, + ct.depth + 1, + CONCAT(ct.full_path, ' / ', c.category_name) + FROM categories c + INNER JOIN category_tree ct ON c.parent_category_id = ct.category_id + ) + SELECT * FROM category_tree ORDER BY full_path; + ``` + +## Performance & Optimization + +:::tip +**Performance Considerations:** + +1. **CTE Materialization**: Some databases materialize CTEs, others don't +2. **Multiple References**: Referencing a CTE multiple times may cause recalculation +3. **Indexing**: Ensure base tables have proper indexes +4. **Recursion Depth**: Deep recursion can be expensive +5. **Row Count**: Large CTEs can impact memory usage + +**Optimization Strategies:** +```sql +-- Add early filtering in CTEs +WITH filtered_orders AS ( + SELECT * + FROM orders + WHERE order_date >= '2024-01-01' -- Filter early + AND status = 'Completed' + -- This runs once, reducing data for subsequent operations +) +SELECT * FROM filtered_orders; +``` +```sql +-- Use indexes on CTE join columns +CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date); + +``` +::: + +## CTE vs Alternatives: When to Use What + +| **Scenario** | **Best Choice** | **Reason** | +|--------------|-----------------|------------| +| Complex multi-step logic | CTE | Readability and maintainability | +| Single use subquery | Subquery | Simpler, less overhead | +| Used across multiple queries | View | Reusable definition | +| Large intermediate results | Temp Table | Better performance with indexes | +| Hierarchical data | Recursive CTE | Only option for recursion | +| Simple filtering | WHERE clause | Direct and efficient | + +## Best Practices & Guidelines + +:::info +**DO's:** +- Use descriptive CTE names that explain the data they contain +- Break complex logic into multiple CTEs for clarity +- Add comments explaining business logic +- Filter data early in the CTE chain +- Use CTEs to replace deeply nested subqueries +- Test recursive CTEs with depth limits + +**DON'Ts:** +- Don't create overly complex CTE chains (keep it under 5-6 levels) +- Don't use CTEs when a simple subquery suffices +- Don't forget termination conditions in recursive CTEs +- Don't assume CTEs are always faster than alternatives +- Don't reference the same CTE dozens of times (consider temp tables) + +**Good Practice Example:** +```sql +-- Well-structured CTE with clear purpose and comments +WITH +-- Calculate base metrics for active customers only +active_customers AS ( + SELECT customer_id, customer_name, email + FROM customers + WHERE status = 'Active' + AND registration_date >= '2023-01-01' +), +-- Aggregate order data for these customers +customer_spending AS ( + SELECT + ac.customer_id, + ac.customer_name, + COUNT(o.order_id) AS order_count, + SUM(o.total_amount) AS total_spent + FROM active_customers ac + LEFT JOIN orders o ON ac.customer_id = o.customer_id + WHERE o.order_date >= '2024-01-01' + GROUP BY ac.customer_id, ac.customer_name +) +-- Final output with segmentation +SELECT + customer_name, + order_count, + total_spent, + CASE + WHEN total_spent > 5000 THEN 'Premium' + WHEN total_spent > 1000 THEN 'Standard' + ELSE 'Basic' + END AS segment +FROM customer_spending +WHERE order_count > 0 +ORDER BY total_spent DESC; +``` +::: + + +## Conclusion + +Common Table Expressions are one of the most powerful features in modern SQL for writing clean, maintainable queries. They shine when you need to break down complex logic into understandable steps, work with hierarchical data, or eliminate repetitive subqueries. While not always the fastest option, their benefits in code clarity and maintainability often outweigh minor performance differences. Master CTEs, and you'll find yourself writing better SQL that your future self (and colleagues) will thank you for. + + \ No newline at end of file diff --git a/docs/sql/SQL-Advance/assets/subqueries.gif b/docs/sql/SQL-Advance/assets/subqueries.gif new file mode 100644 index 00000000..16f8383a Binary files /dev/null and b/docs/sql/SQL-Advance/assets/subqueries.gif differ diff --git a/docs/sql/SQL-Advance/sql-indexes.md b/docs/sql/SQL-Advance/sql-indexes.md new file mode 100644 index 00000000..795705d2 --- /dev/null +++ b/docs/sql/SQL-Advance/sql-indexes.md @@ -0,0 +1,370 @@ +--- +id: sql-indexes +title: SQL Indexes - The Complete Guide +sidebar_label: Indexes +sidebar_position: 1 +tags: + [ + sql, + indexes, + database indexes, + performance, + query optimization, + b-tree, + clustered index, + non-clustered index, + composite index, + sql tutorial, + ] +description: Master SQL Indexes with practical examples. Learn when to create indexes, types of indexes, optimization strategies, and common pitfalls to avoid. +--- +Ever wondered why some SQL queries feel like they run in milliseconds while others take minutes on the same table? +The secret often lies in how well your database uses indexes. +Let’s explore how SQL indexes work and how you can use them to make your queries fly +## What are SQL Indexes? + +SQL **Indexes** are special database structures that dramatically speed up data retrieval operations. Think of them like an index in a book - instead of reading every page to find information, you can jump directly to the right page. + +:::note +**Key Characteristics of Indexes:** + +- **Speed Up Queries**: Can make queries 10x, 100x, or even 1000x faster. + +- **Cost of Storage**: Require additional disk space to store the index structure. + +- **Write Overhead**: Slow down INSERT, UPDATE, and DELETE operations slightly. + +- **Automatic Maintenance**: Database automatically updates indexes when data changes. +::: + + +:::success +**The Phone Book Analogy:** + +Imagine searching for "John Smith" in a phone book: + +**Without Index (Full Table Scan):** You'd have to read every single entry from page 1 to the end until you find John Smith. On a million-entry phone book, this is painfully slow. + +**With Index:** The phone book is already sorted alphabetically (that's an index!). You can jump directly to the "S" section and find John Smith in seconds. + +That's exactly what database indexes do - they organize data in a way that makes searches lightning-fast. + +**Real-World Impact:** +A query that takes 30 seconds on a million-row table without an index might complete in 0.01 seconds with the right index. That's a 3000x performance improvement! +::: + +:::info + +## How Indexes Work Under the Hood + +```sql +-- Without index: Database scans every row +SELECT * FROM employees WHERE employee_id = 12345; +-- Scans: 1, 2, 3, 4, ... 12345 (sequential search) + +-- With index: Database jumps directly to the row +SELECT * FROM employees WHERE employee_id = 12345; +-- Jump directly to: 12345 (index lookup) +``` + +**Index Structure (Simplified B-Tree):** +``` + [50] + / \ + [25] [75] + / \ / \ + [10] [40] [60] [90] +``` + +The database traverses this tree structure to find values quickly. For a million rows, it might only need to check 20-30 nodes instead of a million rows! + +| **Operation** | **Without Index** | **With Index** | **Improvement** | +|---------------|-------------------|----------------|-----------------| +| Find by ID | O(n) - Linear | O(log n) - Logarithmic | Exponential | +| Range search | O(n) | O(log n + k) | Significant | +| Sort | O(n log n) | O(n) or O(1) | Major | + +::: + +## Types of Indexes +| Index Type | Description | Best Use Case | +| ----------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------- | +| **Clustered Index** | Determines the physical order of data in the table. Each table can have only one. | Primary key columns. | +| **Non-Clustered Index** | A separate structure that points to table data. | Columns used in WHERE, JOIN, ORDER BY. | +| **Composite Index** | Index on multiple columns. | Queries filtering on multiple conditions. | +| **Unique Index** | Prevents duplicate values in a column. | Email IDs, Usernames, etc. | +| **Covering Index** | Includes all columns a query needs. | Read-heavy analytical queries. | +| **Partial Index** | Indexes a subset of rows. | Filtering on frequently used conditions (e.g., active users). | +| **Full-Text Index** | Optimized for text search. | Searching within text or document fields. | + + +## Creating and Managing Indexes + +:::tip +**Creating Indexes - Syntax Variations** + +```sql +-- Basic syntax +CREATE INDEX index_name ON table_name(column_name); + +-- Multiple columns +CREATE INDEX idx_name ON table_name(col1, col2, col3); + +-- Unique index +CREATE UNIQUE INDEX idx_name ON table_name(column_name); + +-- With specific algorithm (MySQL) +CREATE INDEX idx_name ON table_name(column_name) USING BTREE; +CREATE INDEX idx_name ON table_name(column_name) USING HASH; + +-- Descending order (useful for ORDER BY DESC queries) +CREATE INDEX idx_name ON table_name(column_name DESC); + +-- Conditional index (PostgreSQL) +CREATE INDEX idx_name ON table_name(column_name) WHERE condition; + +-- Concurrent creation (PostgreSQL - doesn't lock table) +CREATE INDEX CONCURRENTLY idx_name ON table_name(column_name); + +-- With included columns (SQL Server, PostgreSQL 11+) +CREATE INDEX idx_name ON table_name(key_column) +INCLUDE (non_key_column1, non_key_column2); +``` + +**Dropping Indexes** + +```sql +-- Standard syntax +DROP INDEX index_name ON table_name; -- MySQL +DROP INDEX index_name; -- PostgreSQL + +-- SQL Server +DROP INDEX table_name.index_name; + +-- Check if exists first +DROP INDEX IF EXISTS index_name ON table_name; +``` + +**Viewing Indexes** + +```sql +-- MySQL +SHOW INDEXES FROM table_name; +SHOW INDEX FROM table_name WHERE Key_name = 'idx_name'; + +-- PostgreSQL +SELECT * FROM pg_indexes WHERE tablename = 'table_name'; + +-- SQL Server +SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID('table_name'); + +-- Standard SQL (works on most databases) +SELECT * FROM information_schema.statistics +WHERE table_name = 'table_name'; +``` +::: + +## When to Create Indexes + +:::success +**You SHOULD Create an Index When:** + +✅ **Frequently Used in WHERE Clauses** +```sql +-- If you run this query 1000 times per day: +SELECT * FROM users WHERE email = 'user@example.com'; +-- You NEED this index: +CREATE INDEX idx_users_email ON users(email); +``` + +✅ **Used in JOIN Conditions** +```sql +-- Frequently joining these tables: +SELECT o.*, c.customer_name +FROM orders o +JOIN customers c ON o.customer_id = c.customer_id; + +-- Create indexes on join columns: +CREATE INDEX idx_orders_customer_id ON orders(customer_id); +CREATE INDEX idx_customers_id ON customers(customer_id); -- Often already exists as PK +``` + +✅ **Used in ORDER BY** +```sql +-- Common sorting pattern: +SELECT * FROM products ORDER BY category, price DESC; +-- Index helps: +CREATE INDEX idx_products_category_price ON products(category, price DESC); +``` + +✅ **Used in GROUP BY** +```sql +-- Aggregation queries: +SELECT department, COUNT(*) +FROM employees +GROUP BY department; +-- Index helps: +CREATE INDEX idx_employees_department ON employees(department); +``` + +✅ **Foreign Key Columns** +```sql +-- Always index foreign keys: +ALTER TABLE orders ADD FOREIGN KEY (customer_id) REFERENCES customers(id); +CREATE INDEX idx_orders_customer_id ON orders(customer_id); +``` + +✅ **Columns with High Selectivity** +```sql +-- High selectivity: email, SSN, username (unique or near-unique) +CREATE INDEX idx_users_email ON users(email); + +-- NOT low selectivity: gender, boolean flags, status with few values +-- Don't index: gender (only 'M', 'F', 'Other') +``` +::: + +:::danger +**You Should NOT Create an Index When:** + +❌ **Table is Small (< 1000 rows)** +```sql +-- Overhead of index > benefit for tiny tables +-- Database can scan 1000 rows faster than using index +``` + +❌ **Column Has Low Selectivity** +```sql +-- Bad: is_active (only TRUE/FALSE values) +-- Bad: gender (only 2-3 values) +-- Bad: status (only 'active', 'inactive', 'pending') + +-- Exception: If you're filtering 99% of data +-- Partial index can help: +CREATE INDEX idx_users_inactive ON users(last_login) +WHERE is_active = FALSE; -- If only 1% are inactive +``` + +❌ **Column Frequently Updated** +```sql +-- Think twice before indexing columns that change often +-- Example: 'last_updated', 'view_count', 'login_count' +-- Every UPDATE must update the index too +``` + +❌ **Already Covered by Composite Index** +```sql +-- Existing index: +CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date); + +-- Redundant: customer_id is already indexed (leftmost column) +CREATE INDEX idx_orders_customer ON orders(customer_id); -- ❌ Not needed + +-- But this would be useful (different leftmost column): +CREATE INDEX idx_orders_date_customer ON orders(order_date, customer_id); -- ✓ OK +``` + +❌ **Table Has Heavy Write Operations** +```sql +-- Logging tables, temporary staging tables +-- If 90% operations are INSERT, minimize indexes +-- Keep only essential ones +``` +::: + +## Practical Example +```sql +-- Before indexing +SELECT * FROM employees WHERE department_id = 5; +-- Took 2.8s (full table scan) + +-- After indexing +CREATE INDEX idx_department_id ON employees(department_id); +SELECT * FROM employees WHERE department_id = 5; +-- Took 0.03s (index scan) +``` +> After indexing `department_id`, the query optimizer uses an index scan instead of a full table scan — drastically improving performance. +## Common Indexing Mistakes + +:::danger +**Mistake #1: Over-Indexing** +```sql +-- Bad: Index every column "just in case" +CREATE INDEX idx_users_email ON users(email); +CREATE INDEX idx_users_first_name ON users(first_name); +CREATE INDEX idx_users_last_name ON users(last_name); +CREATE INDEX idx_users_phone ON users(phone); +CREATE INDEX idx_users_address ON users(address); +CREATE INDEX idx_users_city ON users(city); +CREATE INDEX idx_users_state ON users(state); +CREATE INDEX idx_users_zip ON users(zip); + +-- Problems: +-- ✗ Slows down INSERT/UPDATE/DELETE +-- ✗ Wastes disk space +-- ✗ Database has to choose between many indexes (confusion) + +-- Good: Index only frequently queried columns +CREATE UNIQUE INDEX idx_users_email ON users(email); +CREATE INDEX idx_users_location ON users(state, city); -- Composite for location queries +``` + +**Mistake #2: Wrong Column Order in Composite Index** +```sql +-- Query pattern: +SELECT * FROM orders +WHERE customer_id = 123 AND order_date >= '2024-01-01'; + +-- Bad: Date first (less selective) +CREATE INDEX idx_orders_wrong ON orders(order_date, customer_id); + +-- Good: Customer first (more selective, used in more queries) +CREATE INDEX idx_orders_right ON orders(customer_id, order_date); +``` + +**Mistake #3: Indexing Low-Cardinality Columns** +```sql +-- Bad: Only 2 values (M/F) +CREATE INDEX idx_users_gender ON users(gender); + +-- Bad: Only 3-4 values +CREATE INDEX idx_orders_status ON orders(status); + +-- Exception: OK if filtering out 99% of data +CREATE INDEX idx_users_suspended +ON users(last_login, email) +WHERE is_suspended = TRUE; -- If only 0.1% are suspended +``` +::: + +## Conclusion +SQL Indexes are one of the most powerful tools for **database performance optimization** — when used wisely. They can transform sluggish queries into lightning-fast ones, enabling your applications to scale efficiently and handle millions (or even billions) of rows seamlessly. + +However, indexes are a **double-edged sword** — while they boost read performance, they come with tradeoffs in **storage cost, maintenance overhead, and slower write operations**. The key is balance: index only what’s necessary based on query patterns, selectivity, and workload characteristics. + +## Key Takeaways: + +- **Understand your queries first** — analyze WHERE, JOIN, ORDER BY, and GROUP BY clauses before creating indexes. + +- **Use the right type of index** — primary, unique, composite, covering, partial, or full-text — depending on your use case. + +- **Monitor and tune continuously** — use query planners and performance metrics (EXPLAIN, ANALYZE, SHOW INDEXES) to verify if indexes are being used effectively. + +- **Avoid over-indexing** — every index adds write overhead. Drop unused or redundant ones regularly. + +- **Think strategically** — use composite indexes following the left-to-right rule and leverage partial indexes for highly specific queries. + +In essence, a well-designed indexing strategy is the foundation of a performant database system. By mastering when and how to use indexes, you’ll unlock the full potential of SQL — delivering faster queries, efficient storage, and a smoother user experience. + +>🏁 Optimize smartly — not by adding more indexes, but by adding the right ones. + +## Further Reading + +- [PostgreSQL Indexing Documentation](https://www.postgresql.org/docs/current/indexes.html) + +- [SQL Server Index Architecture and Design Guide](https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide) + +- [MySQL Index Optimization Tips](https://dev.mysql.com/doc/refman/8.0/en/mysql-indexes.html) \ No newline at end of file diff --git a/docs/sql/SQL-Advance/subqueries.md b/docs/sql/SQL-Advance/subqueries.md new file mode 100644 index 00000000..8a26ae5d --- /dev/null +++ b/docs/sql/SQL-Advance/subqueries.md @@ -0,0 +1,541 @@ +--- +id: sql-subqueries +title: SQL Subqueries #Remember to keep this unique, as it maps with giscus discussions in the recodehive/support/general discussions +sidebar_label: Subqueries #displays in sidebar +sidebar_position: 1 +tags: + [ + sql, + subqueries, + nested queries, + scalar subquery, + correlated subquery, + sql tutorial, + database queries, + ] +description: Learn about SQL subqueries, how they work, different types, syntax, practical examples, and when to use nested queries for complex data retrieval. +--- + +## What are Subqueries? + +SQL **Subqueries** (also called nested queries or inner queries) are queries placed inside another query. Think of them as a question within a question - you ask one thing first, then use that answer to get your final result. + +:::note +**Key Characteristics of Subqueries:** + +- **Nested Structure**: A query inside another query, wrapped in parentheses. + +- **Execution Order**: Inner query runs first, then outer query uses its results. + +- **Versatile Placement**: Can appear in SELECT, FROM, WHERE, or HAVING clauses. + +- **Return Types**: Can return a single value, a row, a column, or a full table. +::: + + + [![GitHub](./assets/subqueries.gif)](https://www.learnsqlonline.org/) + + +:::success +**When to Use Subqueries:** + +- **Complex Filtering**: Find customers who spent more than the average order value +- **Step-by-Step Logic**: Break down complicated queries into manageable pieces +- **Dynamic Comparisons**: Compare rows against calculated values +- **Data Validation**: Check if records exist in other tables +- **Aggregated Filtering**: Filter based on grouped calculations + +**Real-World Example:** +You want to find all employees earning more than their department's average salary. Instead of calculating each department's average separately, you nest a query that figures out the average, then compare against it. +::: + +:::info + +## Basic Subquery Syntax + +```sql +-- Subquery in WHERE clause (most common) +SELECT column1, column2 +FROM table1 +WHERE column1 = (SELECT column FROM table2 WHERE condition); +``` + +| **Component** | **Purpose** | **Example** | +|---------------|-------------|-------------| +| Outer Query | Main query that uses subquery results | `SELECT name FROM employees` | +| Inner Query | Nested query that executes first | `(SELECT AVG(salary) FROM employees)` | +| Parentheses | Required to wrap subqueries | `WHERE salary > (subquery)` | +| Comparison | How outer query uses subquery result | `=, >, <, IN, EXISTS` | + +## Subquery Placement Options + +```sql +-- In WHERE clause +SELECT * FROM products +WHERE price > (SELECT AVG(price) FROM products); + +``` +```sql + +-- In SELECT clause (scalar subquery) +SELECT + name, + salary, + (SELECT AVG(salary) FROM employees) AS avg_salary +FROM employees; +``` +```sql +-- In FROM clause (derived table) +SELECT dept, avg_sal +FROM ( + SELECT department AS dept, AVG(salary) AS avg_sal + FROM employees + GROUP BY department +) AS dept_averages; +``` +```sql +-- In HAVING clause +SELECT department, AVG(salary) +FROM employees +GROUP BY department +HAVING AVG(salary) > (SELECT AVG(salary) FROM employees); +``` + +::: + +## Types of Subqueries + + + + ```sql + -- Returns a single value + -- Find products more expensive than average + SELECT + product_name, + price, + (SELECT AVG(price) FROM products) AS avg_price, + price - (SELECT AVG(price) FROM products) AS price_difference + FROM products + WHERE price > (SELECT AVG(price) FROM products) + ORDER BY price DESC; + + -- Scalar subquery must return exactly one value + ``` + + + ```sql + -- Returns a single column (multiple rows) + -- Find customers who placed orders in 2024 + SELECT + customer_id, + customer_name, + email + FROM customers + WHERE customer_id IN ( + SELECT DISTINCT customer_id + FROM orders + WHERE YEAR(order_date) = 2024 + ) + ORDER BY customer_name; + + -- Use IN, ANY, or ALL with column subqueries + ``` + + + ```sql + -- Returns a single row (multiple columns) + -- Find employee with exact match of max salary and earliest hire date + SELECT + employee_id, + employee_name, + salary, + hire_date, + department + FROM employees + WHERE (salary, hire_date) = ( + SELECT MAX(salary), MIN(hire_date) + FROM employees + ); + + -- Compares multiple columns at once + ``` + + + ```sql + -- Returns a full table (multiple rows and columns) + -- Calculate revenue by customer segment + SELECT + segment, + total_customers, + total_revenue, + ROUND(total_revenue / total_customers, 2) AS avg_revenue_per_customer + FROM ( + SELECT + CASE + WHEN total_spent > 5000 THEN 'VIP' + WHEN total_spent > 1000 THEN 'Regular' + ELSE 'Occasional' + END AS segment, + COUNT(*) AS total_customers, + SUM(total_spent) AS total_revenue + FROM ( + SELECT + customer_id, + SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) customer_totals + GROUP BY segment + ) segment_stats + ORDER BY total_revenue DESC; + + -- Nested subqueries working together + ``` + + + ```sql + -- References outer query (runs for each row) + -- Find employees earning more than their department average + SELECT + e1.employee_name, + e1.department, + e1.salary, + ( + SELECT AVG(e2.salary) + FROM employees e2 + WHERE e2.department = e1.department + ) AS dept_avg_salary, + e1.salary - ( + SELECT AVG(e2.salary) + FROM employees e2 + WHERE e2.department = e1.department + ) AS difference + FROM employees e1 + WHERE e1.salary > ( + SELECT AVG(e2.salary) + FROM employees e2 + WHERE e2.department = e1.department + ) + ORDER BY e1.department, e1.salary DESC; + + -- Inner query references outer query's current row + ``` + + + +## Practical Examples + + + + ```sql + -- Find products selling above their category average + SELECT + p.product_id, + p.product_name, + p.category, + p.price, + ( + SELECT AVG(price) + FROM products p2 + WHERE p2.category = p.category + ) AS category_avg, + ROUND( + (p.price - ( + SELECT AVG(price) + FROM products p2 + WHERE p2.category = p.category + )) / ( + SELECT AVG(price) + FROM products p2 + WHERE p2.category = p.category + ) * 100, + 2 + ) AS percent_above_avg + FROM products p + WHERE p.price > ( + SELECT AVG(price) + FROM products p2 + WHERE p2.category = p.category + ) + ORDER BY p.category, percent_above_avg DESC; + ``` + + + ```sql + -- Find top 3 products in each category by sales + SELECT + category, + product_name, + total_sales, + sales_rank + FROM ( + SELECT + p.category, + p.product_name, + SUM(oi.quantity * oi.unit_price) AS total_sales, + ( + SELECT COUNT(*) + 1 + FROM order_items oi2 + JOIN products p2 ON oi2.product_id = p2.product_id + WHERE p2.category = p.category + GROUP BY oi2.product_id + HAVING SUM(oi2.quantity * oi2.unit_price) > + SUM(oi.quantity * oi.unit_price) + ) AS sales_rank + FROM order_items oi + JOIN products p ON oi.product_id = p.product_id + GROUP BY p.product_id, p.category, p.product_name + ) ranked_products + WHERE sales_rank <= 3 + ORDER BY category, sales_rank; + ``` + + + ```sql + -- Segment customers based on spending compared to overall average + SELECT + c.customer_id, + c.customer_name, + c.email, + COALESCE(customer_stats.total_orders, 0) AS total_orders, + COALESCE(customer_stats.total_spent, 0) AS total_spent, + (SELECT AVG(total_amount) FROM orders) AS overall_avg_order, + CASE + WHEN customer_stats.total_spent > ( + SELECT AVG(total_spent) + FROM ( + SELECT customer_id, SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) AS all_customers + ) * 2 THEN 'Premium' + WHEN customer_stats.total_spent > ( + SELECT AVG(total_spent) + FROM ( + SELECT customer_id, SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) AS all_customers + ) THEN 'Regular' + ELSE 'Basic' + END AS customer_tier + FROM customers c + LEFT JOIN ( + SELECT + customer_id, + COUNT(*) AS total_orders, + SUM(total_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) customer_stats ON c.customer_id = customer_stats.customer_id + ORDER BY total_spent DESC; + ``` + + + ```sql + -- Find customers who have never placed an order + SELECT + c.customer_id, + c.customer_name, + c.email, + c.registration_date, + DATEDIFF(CURRENT_DATE, c.registration_date) AS days_since_registration + FROM customers c + WHERE NOT EXISTS ( + SELECT 1 + FROM orders o + WHERE o.customer_id = c.customer_id + ) + AND c.registration_date < DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY) + ORDER BY c.registration_date; + + -- EXISTS is efficient - stops checking once it finds a match + ``` + + + ```sql + -- Find products more expensive than ANY product in Electronics + SELECT product_name, price, category + FROM products + WHERE price > ANY ( + SELECT price + FROM products + WHERE category = 'Electronics' + ) + AND category != 'Electronics'; + + -- Find products more expensive than ALL products in Electronics + SELECT product_name, price, category + FROM products + WHERE price > ALL ( + SELECT price + FROM products + WHERE category = 'Electronics' + ) + ORDER BY price DESC; + + -- ANY means "at least one", ALL means "every single one" + ``` + + + ```plaintext + -- Sample result for above-average products: + + product_id | product_name | category | price | category_avg | percent_above_avg + -----------|-----------------|-------------|---------|--------------|------------------ + 15 | Laptop Pro | Electronics | 1299.99 | 756.45 | 71.85 + 23 | Gaming Monitor | Electronics | 899.99 | 756.45 | 18.98 + 42 | Premium Mouse | Electronics | 149.99 | 756.45 | -80.17 + 8 | Office Desk | Furniture | 449.99 | 312.50 | 44.00 + + -- Shows products priced above their category's average + -- Includes the average and percentage difference + ``` + + + +## Subqueries vs JOINs + +:::tip +**When to Choose What:** + +**Use Subqueries When:** + +- Logic is clearer with step-by-step thinking +- You need aggregated values for comparison +- Checking for existence/non-existence (EXISTS/NOT EXISTS) +- One-time calculations that don't need repeated access +- Building derived tables for complex analysis + +**Use JOINs When:** +- Combining columns from multiple tables +- Need better performance with large datasets +- Retrieving data from multiple tables simultaneously +- Working with well-indexed foreign keys + +**Example Comparison:** +```sql +-- Using Subquery +SELECT customer_name +FROM customers +WHERE customer_id IN ( + SELECT customer_id + FROM orders + WHERE order_date >= '2024-01-01' +); + +-- Using JOIN (often faster) +SELECT DISTINCT c.customer_name +FROM customers c +INNER JOIN orders o ON c.customer_id = o.customer_id +WHERE o.order_date >= '2024-01-01'; +``` +::: + +## Common Subquery Patterns + +:::info +**Useful Patterns You'll Use Often:** + +1. **Find Records NOT IN Another Table:** + ```sql + SELECT product_name + FROM products + WHERE product_id NOT IN ( + SELECT DISTINCT product_id + FROM order_items + ); + -- Finds products never ordered + ``` + +2. **Compare Against Aggregates:** + ```sql + SELECT employee_name, salary + FROM employees + WHERE salary > (SELECT AVG(salary) FROM employees); + -- Above-average salaries + ``` + +3. **Ranked Results:** + ```sql + SELECT * + FROM products + WHERE price = (SELECT MAX(price) FROM products); + -- Most expensive product + ``` + +4. **Conditional Aggregation:** + ```sql + SELECT + category, + COUNT(*) as product_count + FROM products + GROUP BY category + HAVING COUNT(*) > ( + SELECT AVG(cat_count) + FROM ( + SELECT COUNT(*) as cat_count + FROM products + GROUP BY category + ) AS category_counts + ); + -- Categories with above-average product counts + ``` +::: + +## Common Mistakes to Avoid + +:::warning +**Watch Out For These:** + +1. **Forgetting Parentheses**: Subqueries must be wrapped + ```sql + -- Wrong + WHERE customer_id IN SELECT customer_id FROM orders; + + -- Correct + WHERE customer_id IN (SELECT customer_id FROM orders); + ``` + +2. **Multiple Values When Expecting One**: + ```sql + -- Will error if subquery returns multiple rows + WHERE price = (SELECT price FROM products); + + -- Use IN for multiple values + WHERE price IN (SELECT price FROM products WHERE category = 'Electronics'); + ``` + +3. **NULL Handling with NOT IN**: + ```sql + -- NOT IN can behave unexpectedly with NULLs + -- Use NOT EXISTS instead + WHERE NOT EXISTS ( + SELECT 1 FROM orders WHERE customer_id = customers.id + ); + ``` + +4. **Performance Blindness**: + ```sql + -- Don't nest too deep - hard to read and slow + -- Keep it to 2-3 levels maximum + ``` +::: + +## Best Practices + +1. **Keep It Simple**: If it's hard to understand, consider breaking it down +2. **Name Derived Tables**: Always alias subqueries in FROM clause +3. **Comment Complex Logic**: Future you will thank present you +4. **Test Step by Step**: Run inner queries separately first +5. **Consider Alternatives**: Sometimes a `JOIN` or `CTE` is clearer +6. **Use Appropriate Operators**: EXISTS for existence, IN for lists, = for single values + + +## Conclusion + +Subqueries are your tool for asking layered questions - calculate something first, then use that answer to get what you really want. They make complex logic readable by breaking problems into steps. Start with simple subqueries in `WHERE` clauses, then gradually work up to more complex patterns. + +**Remember :** if your subquery gets too complicated, there's probably a simpler way to write it. Keep it clear, keep it tested, and your queries will be both powerful and maintainable. + + \ No newline at end of file diff --git a/docs/sql/SQL-Advance/window-functions.md b/docs/sql/SQL-Advance/window-functions.md new file mode 100644 index 00000000..fec70285 --- /dev/null +++ b/docs/sql/SQL-Advance/window-functions.md @@ -0,0 +1,460 @@ +--- +id: window-functions +title: SQL Window Functions - Complete Guide +sidebar_label: Window Functions +sidebar_position: 1 +tags: + [ + sql, + window functions, + over clause, + partition by, + row_number, + rank, + dense_rank, + lag, + lead, + analytics, + sql tutorial, + ] +description: Master SQL Window Functions with practical examples. Learn ROW_NUMBER, RANK, LAG, LEAD, and more for powerful data analysis without grouping. +--- + +## What are Window Functions? + +SQL **Window Functions** perform calculations across a set of rows that are related to the current row, but unlike GROUP BY, they don't collapse rows into a single output. Think of them as "looking through a window" at related rows while keeping all individual rows intact. + +:::note +**Key Characteristics of Window Functions:** + +- **No Row Reduction**: Unlike GROUP BY, every row remains in the result set. + +- **Contextual Calculations**: Perform calculations based on a "window" of related rows. + +- **OVER Clause**: The signature syntax that defines the window of rows. + +- **Powerful Analytics**: Perfect for rankings, running totals, comparisons, and trends. +::: + +:::success +**Why Window Functions are Game-Changers:** + +Imagine you have a sales table and want to: +- Show each sale alongside the total sales for that month +- Rank salespeople by performance within each region +- Compare each day's sales to the previous day +- Calculate a running total of revenue + +Without window functions, you'd need complex subqueries or multiple joins. Window functions make these tasks simple and elegant! + +**Real-World Example:** +A sales manager wants to see each salesperson's individual sales while also showing their rank within their region and the regional average - all in one query. Window functions make this trivial. +::: + +:::info + +## Basic Window Function Syntax + +```sql +SELECT + column1, + column2, + WINDOW_FUNCTION() OVER ( + [PARTITION BY partition_column] + [ORDER BY sort_column] + [ROWS/RANGE frame_specification] + ) AS result_column +FROM table_name; +``` + +| **Component** | **Purpose** | **Required?** | +|---------------|-------------|---------------| +| WINDOW_FUNCTION | The calculation to perform | Yes | +| OVER | Defines the window | Yes | +| PARTITION BY | Groups rows into partitions | Optional | +| ORDER BY | Orders rows within partitions | Optional* | +| ROWS/RANGE | Defines frame boundaries | Optional | + +*Required for some functions like ROW_NUMBER, RANK, LAG, LEAD + +## The OVER Clause - Your Window Control Panel + +```sql +-- Simple window: entire table +SUM(amount) OVER () + +-- Partitioned window: separate calculations per group +SUM(amount) OVER (PARTITION BY department) + +-- Ordered window: enables ranking and sequential functions +ROW_NUMBER() OVER (ORDER BY sales DESC) + +-- Complete window: partition + order + frame +SUM(amount) OVER ( + PARTITION BY department + ORDER BY sale_date + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +) +``` + +::: + +## Essential Window Functions + + + + ```sql + -- Assign unique sequential numbers to rows + SELECT + employee_name, + department, + salary, + ROW_NUMBER() OVER (ORDER BY salary DESC) AS overall_rank, + ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank + FROM employees; + + -- Result: Every row gets a unique number + -- Perfect for: Pagination, removing duplicates, creating unique IDs + + -- Practical use: Top 3 earners per department + WITH ranked AS ( + SELECT *, + ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn + FROM employees + ) + SELECT employee_name, department, salary + FROM ranked + WHERE rn <= 3; + ``` + + + ```sql + -- RANK: Gives same rank for ties, skips numbers + -- DENSE_RANK: Gives same rank for ties, no gaps + + SELECT + student_name, + test_score, + RANK() OVER (ORDER BY test_score DESC) AS rank, + DENSE_RANK() OVER (ORDER BY test_score DESC) AS dense_rank, + ROW_NUMBER() OVER (ORDER BY test_score DESC) AS row_num + FROM test_results; + + -- Example output: + -- Name Score RANK DENSE_RANK ROW_NUMBER + -- Alice 95 1 1 1 + -- Bob 95 1 1 2 <- RANK skips 2, DENSE_RANK doesn't + -- Carol 90 3 2 3 + -- Dave 90 3 2 4 + -- Eve 85 5 3 5 + + -- Use RANK for: Competition rankings with ties + -- Use DENSE_RANK for: Category rankings without gaps + -- Use ROW_NUMBER for: Unique sequential numbering + ``` + + + ```sql + -- LAG: Look at previous row + -- LEAD: Look at next row + + SELECT + sale_date, + daily_sales, + LAG(daily_sales, 1) OVER (ORDER BY sale_date) AS yesterday_sales, + LEAD(daily_sales, 1) OVER (ORDER BY sale_date) AS tomorrow_sales, + daily_sales - LAG(daily_sales, 1) OVER (ORDER BY sale_date) AS change_from_yesterday, + ROUND( + ((daily_sales - LAG(daily_sales, 1) OVER (ORDER BY sale_date)) / + LAG(daily_sales, 1) OVER (ORDER BY sale_date)) * 100, + 2 + ) AS percent_change + FROM daily_sales + WHERE sale_date >= '2024-01-01' + ORDER BY sale_date; + + -- Perfect for: Comparing consecutive records, trend analysis + -- LAG(column, n, default) - n rows back, default if NULL + -- LEAD(column, n, default) - n rows forward, default if NULL + ``` + + + ```sql + -- Running totals and moving averages + + SELECT + order_date, + order_amount, + customer_id, + -- Running total for each customer + SUM(order_amount) OVER ( + PARTITION BY customer_id + ORDER BY order_date + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS running_total, + + -- Overall average across all orders + AVG(order_amount) OVER () AS overall_avg, + + -- Customer's average up to this point + AVG(order_amount) OVER ( + PARTITION BY customer_id + ORDER BY order_date + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS customer_running_avg, + + -- 3-day moving average + AVG(order_amount) OVER ( + ORDER BY order_date + ROWS BETWEEN 2 PRECEDING AND CURRENT ROW + ) AS moving_avg_3day, + + -- Count orders per customer up to this point + COUNT(*) OVER ( + PARTITION BY customer_id + ORDER BY order_date + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS order_number + FROM orders + ORDER BY customer_id, order_date; + ``` + + + ```sql + -- Access first or last value in window + + SELECT + employee_name, + department, + hire_date, + salary, + -- First person hired in department + FIRST_VALUE(employee_name) OVER ( + PARTITION BY department + ORDER BY hire_date + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) AS first_hire, + + -- Highest salary in department + FIRST_VALUE(salary) OVER ( + PARTITION BY department + ORDER BY salary DESC + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) AS highest_dept_salary, + + -- Most recent hire in department + LAST_VALUE(employee_name) OVER ( + PARTITION BY department + ORDER BY hire_date + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) AS most_recent_hire, + + -- Compare salary to highest in department + salary - FIRST_VALUE(salary) OVER ( + PARTITION BY department + ORDER BY salary DESC + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) AS salary_gap_from_top + FROM employees; + + -- Note: UNBOUNDED FOLLOWING is crucial for LAST_VALUE + -- Without it, "last" means "current row"! + ``` + + + ```sql + -- Divide rows into N roughly equal groups + + SELECT + product_name, + price, + NTILE(4) OVER (ORDER BY price) AS price_quartile, + NTILE(10) OVER (ORDER BY price) AS price_decile, + CASE NTILE(3) OVER (ORDER BY price) + WHEN 1 THEN 'Budget' + WHEN 2 THEN 'Mid-Range' + WHEN 3 THEN 'Premium' + END AS price_category + FROM products; + + -- Perfect for: Creating equal-sized groups, customer segmentation + -- Each group gets (total_rows / N) or (total_rows / N) + 1 rows + + -- Practical: Customer segmentation by purchase history + WITH customer_metrics AS ( + SELECT + customer_id, + COUNT(*) AS total_orders, + SUM(order_amount) AS total_spent + FROM orders + GROUP BY customer_id + ) + SELECT + customer_id, + total_spent, + NTILE(4) OVER (ORDER BY total_spent DESC) AS value_quartile, + CASE NTILE(4) OVER (ORDER BY total_spent DESC) + WHEN 1 THEN 'VIP - Top 25%' + WHEN 2 THEN 'High Value - 26-50%' + WHEN 3 THEN 'Regular - 51-75%' + WHEN 4 THEN 'Occasional - Bottom 25%' + END AS customer_segment + FROM customer_metrics; + ``` + + + +## Understanding PARTITION BY + +Think of PARTITION BY as creating separate "mini-tables" within your result set. Calculations reset for each partition. + +```sql +-- Without PARTITION BY: One ranking across entire table +SELECT + employee_name, + department, + salary, + RANK() OVER (ORDER BY salary DESC) AS company_rank +FROM employees; + +-- With PARTITION BY: Separate rankings per department +SELECT + employee_name, + department, + salary, + RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank +FROM employees; +``` + +:::tip +**PARTITION BY Best Practices:** + +- Use when you want separate calculations per group +- Can partition by multiple columns: `PARTITION BY region, department` +- Think of it as "invisible GROUP BY" - groups data without collapsing rows +- Each partition is processed independently +::: + +## Frame Specifications - Defining Your Window + +Frame specifications define which rows are included in the calculation. + +```sql +-- Frame clause syntax +ROWS BETWEEN start_boundary AND end_boundary + +-- Common frame specifications: +UNBOUNDED PRECEDING -- From the first row of partition +UNBOUNDED FOLLOWING -- To the last row of partition +CURRENT ROW -- The current row +N PRECEDING -- N rows before current +N FOLLOWING -- N rows after current +``` + + + + ```sql + -- Running total: Sum from start to current row + SELECT + order_date, + amount, + SUM(amount) OVER ( + ORDER BY order_date + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + ) AS running_total + FROM orders; + + -- Shorthand (same result): + SUM(amount) OVER (ORDER BY order_date) + ``` + + + ```sql + -- 7-day moving average + SELECT + sale_date, + daily_sales, + AVG(daily_sales) OVER ( + ORDER BY sale_date + ROWS BETWEEN 6 PRECEDING AND CURRENT ROW + ) AS moving_avg_7day + FROM daily_sales; + + -- Centered moving average (3 days: before, current, after) + SELECT + sale_date, + daily_sales, + AVG(daily_sales) OVER ( + ORDER BY sale_date + ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING + ) AS centered_avg_3day + FROM daily_sales; + ``` + + + ```sql + -- Calculate percentage of total within each group + SELECT + department, + employee_name, + salary, + SUM(salary) OVER ( + PARTITION BY department + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) AS dept_total_salary, + ROUND( + salary * 100.0 / SUM(salary) OVER ( + PARTITION BY department + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ), + 2 + ) AS percent_of_dept_payroll + FROM employees; + ``` + + + + +## Window Functions Quick Reference + +| Function | Purpose | Common Use Case | +|----------|---------|----------------| +| `ROW_NUMBER()` | Unique sequential number | Pagination, removing duplicates | +| `RANK()` | Ranking with gaps | Competition standings | +| `DENSE_RANK()` | Ranking without gaps | Category rankings | +| `NTILE(n)` | Divide into n groups | Customer segmentation | +| `LAG()` | Previous row value | Period-over-period comparison | +| `LEAD()` | Next row value | Forecasting, trend analysis | +| `FIRST_VALUE()` | First value in window | Baseline comparison | +| `LAST_VALUE()` | Last value in window | Final value comparison | +| `SUM()` | Running/windowed total | Cumulative sales | +| `AVG()` | Moving/windowed average | Smoothing trends | +| `COUNT()` | Windowed count | Running count | +| `MIN()`/`MAX()` | Windowed extremes | Range analysis | + +## Practice Problems + +Try these on your own to master window functions: + +1. **Find the top 3 products by revenue in each category** +2. **Calculate each employee's salary as a percentage of their department total** +3. **Show month-over-month growth rate for sales** +4. **Identify customers whose last 3 purchases were all above $100** +5. **Find products that consistently rank in top 10 for 90 consecutive days** + +## Conclusion + +Window functions are incredibly powerful tools that transform how you analyze data in SQL. They allow you to perform complex analytics that would otherwise require multiple queries, subqueries, or even application-level processing. + +Start with simple examples like `ROW_NUMBER()` and `SUM() OVER()`, then gradually incorporate partitioning, frames, and more advanced functions. With practice, window functions will become your go-to solution for sophisticated data analysis. + +**Remember:** +- Window functions keep all rows (unlike GROUP BY) +- The OVER clause defines your window +- PARTITION BY creates separate calculations per group +- ORDER BY is crucial for sequential functions +- Frame specifications control which rows are included + +Happy querying! 🚀 + + \ No newline at end of file diff --git a/sidebars.ts b/sidebars.ts index 2a796354..d7c7e107 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -143,6 +143,17 @@ const sidebars: SidebarsConfig = { "sql/SQL-joins/self-join", ], }, + { + type: 'category', + label: 'SQL Advance', + className: 'custom-sidebar-sql-advance', + items: [ + 'sql/SQL-Advance/sql-subqueries', + 'sql/SQL-Advance/common-table-expressions', + 'sql/SQL-Advance/window-functions', + 'sql/SQL-Advance/sql-indexes' + ], + }, ], }, {