fastapi · alvidofaisal · May 21, 2025
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@
 - 📞 [Traefik](https://traefik.io) as a reverse proxy / load balancer.
 - 🚢 Deployment instructions using Docker Compose, including how to set up a frontend Traefik proxy to handle automatic HTTPS certificates.
 - 🏭 CI (continuous integration) and CD (continuous deployment) based on GitHub Actions.
+- 📊 **New Analytics Module**: Integrated analytics capabilities using DuckDB and Polars for efficient querying of exported data. Features an ETL process for data extraction and OpenTelemetry for performance tracing. For more details, see the [Backend README](./backend/README.md#analytics-module).
 
 ### Dashboard Login
 

diff --git a/backend/README.md b/backend/README.md
@@ -1,5 +1,17 @@
 # FastAPI Project - Backend
 
+## Contents
+
+- [Requirements](#requirements)
+- [Docker Compose](#docker-compose)
+- [General Workflow](#general-workflow)
+- [VS Code](#vs-code)
+- [Docker Compose Override](#docker-compose-override)
+- [Backend tests](#backend-tests)
+- [Migrations](#migrations)
+- [Email Templates](#email-templates)
+- [Analytics Module](#analytics-module)
+
 ## Requirements
 
 * [Docker](https://www.docker.com/).
@@ -170,3 +182,88 @@ The email templates are in `./backend/app/email-templates/`. Here, there are two
 Before continuing, ensure you have the [MJML extension](https://marketplace.visualstudio.com/items?itemName=attilabuti.vscode-mjml) installed in your VS Code.
 
 Once you have the MJML extension installed, you can create a new email template in the `src` directory. After creating the new email template and with the `.mjml` file open in your editor, open the command palette with `Ctrl+Shift+P` and search for `MJML: Export to HTML`. This will convert the `.mjml` file to a `.html` file and now you can save it in the build directory.
+
+## Analytics Module
+
+This section details the integrated analytics capabilities, designed to provide insights into application data without impacting the performance of the primary transactional database.
+
+### Architecture Overview
+
+The analytics architecture employs a dual-database approach:
+
+-   **PostgreSQL**: Serves as the primary transactional database, handling real-time data for Users, Items, and other core application entities.
+-   **DuckDB with Polars**: Used as the analytical processing engine. Data is periodically moved from PostgreSQL to Parquet files, which are then queried efficiently by DuckDB. Polars is utilized for high-performance DataFrame manipulations where needed.
+
+This separation ensures that complex analytical queries do not overload the operational database.
+
+### ETL Process
+
+An ETL (Extract, Transform, Load) process is responsible for populating the analytical data store.
+
+-   **Script**: `backend/app/scripts/export_to_parquet.py`
+-   **Purpose**: This script extracts data from the main PostgreSQL database (specifically the `User` and `Item` tables) and saves it into Parquet files (`users_analytics.parquet`, `items_analytics.parquet`). Parquet format is chosen for its efficiency in analytical workloads.
+-   **Usage**: The script is designed to be run periodically (e.g., as a nightly batch job or via a scheduler like cron) to update the data available for analytics. To run the script manually (ensure your Python environment with backend dependencies is active, or run within the Docker container):
+    ```bash
+    python backend/app/scripts/export_to_parquet.py
+    ```
+-   **Output Location**: The Parquet files are stored in a directory specified by the `PARQUET_DATA_PATH` environment variable. The default location is `backend/data/parquet/`.
+
+### Analytics API Endpoints
+
+New API endpoints provide access to analytical insights. These are available under the `/api/v1/analytics` prefix:
+
+-   **`GET /api/v1/analytics/items_by_user`**:
+    -   **Provides**: A list of users and the total count of items they own.
+    -   **Details**: Only includes users who own at least one item. Results are ordered by the number of items in descending order.
+    -   **Response Model**: `List[UserItemCount]` where `UserItemCount` includes `email: str` and `item_count: int`.
+
+-   **`GET /api/v1/analytics/active_users`**:
+    -   **Provides**: The top 10 most active users, based on the number of items they own.
+    -   **Details**: Users are ordered by item count in descending order. This endpoint uses a left join, so users who may not own any items could theoretically be included if the query were adjusted (currently, it effectively shows top item owners).
+    -   **Response Model**: `List[ActiveUser]` where `ActiveUser` includes `user_id: int`, `email: str`, `full_name: str | None`, and `item_count: int`.
+
+These endpoints query the DuckDB instance, which reads from the Parquet files generated by the ETL script.
+
+### OpenTelemetry Tracing
+
+OpenTelemetry has been integrated into the backend for enhanced observability:
+
+-   **Purpose**: To trace application performance and behavior, helping to identify bottlenecks and understand request flows.
+-   **Export**: Currently, traces are configured to be exported to the console. This is useful for development and debugging. For production, an appropriate OpenTelemetry collector and backend (e.g., Jaeger, Zipkin, Datadog) should be configured.
+-   **Coverage**:
+    -   **Auto-instrumentation**: FastAPI and SQLAlchemy interactions are automatically instrumented, providing traces for API requests and database calls to the PostgreSQL database.
+    -   **Custom Tracing**:
+        -   The analytics module (`backend/app/core/analytics.py`) includes custom spans for DuckDB connection setup and query execution.
+        -   The analytics API routes (`backend/app/api/routes/analytics.py`) have custom spans for their request handlers.
+        -   The ETL script (`backend/app/scripts/export_to_parquet.py`) is instrumented with custom spans for its key operations (database extraction, Parquet file writing).
+
+### Key New Dependencies
+
+The following main dependencies were added to support the analytics features:
+
+-   `duckdb`: An in-process analytical data management system.
+-   `polars`: A fast DataFrame library.
+-   `opentelemetry-api`: Core OpenTelemetry API.
+-   `opentelemetry-sdk`: OpenTelemetry SDK for configuring telemetry.
+-   `opentelemetry-exporter-otlp-proto-http`: OTLP exporter (though console exporter is used by default in current setup).
+-   `opentelemetry-instrumentation-fastapi`: Auto-instrumentation for FastAPI.
+-   `opentelemetry-instrumentation-sqlalchemy`: Auto-instrumentation for SQLAlchemy.
+-   `opentelemetry-instrumentation-psycopg2`: Auto-instrumentation for Psycopg2 (PostgreSQL driver).
+
+Refer to `backend/pyproject.toml` for specific versions.
+
+### New Configuration Options
+
+The following environment variables can be set (e.g., in your `.env` file) to configure the analytics and OpenTelemetry features:
+
+-   **`PARQUET_DATA_PATH`**:
+    -   **Description**: Specifies the directory where the ETL script saves Parquet files and where DuckDB reads them from.
+    -   **Default**: `backend/data/parquet/`
+-   **`SERVICE_NAME`**:
+    -   **Description**: Sets the service name attribute for OpenTelemetry traces. This helps in identifying and filtering traces in a distributed tracing system.
+    -   **Default**: `fastapi-analytics-app` (Note: The ETL script appends "-etl-script" to this name for its traces).
+-   **`OTEL_EXPORTER_OTLP_ENDPOINT`** (Optional, for future use):
+    -   **Description**: If you configure an OTLP exporter (e.g., for Jaeger or Prometheus), this variable would specify its endpoint URL.
+    -   **Default**: Not set (console exporter is used by default).
+
+These settings are defined in `backend/app/core/config.py`.
diff --git a/backend/app/api/main.py b/backend/app/api/main.py
@@ -1,13 +1,14 @@
 from fastapi import APIRouter
 
-from app.api.routes import items, login, private, users, utils
+from app.api.routes import items, login, private, users, utils, analytics
 from app.core.config import settings
 
 api_router = APIRouter()
 api_router.include_router(login.router)
 api_router.include_router(users.router)
 api_router.include_router(utils.router)
 api_router.include_router(items.router)
+api_router.include_router(analytics.router)
 
 
 if settings.ENVIRONMENT == "local":

diff --git a/backend/app/api/routes/analytics.py b/backend/app/api/routes/analytics.py
@@ -0,0 +1,115 @@
+from fastapi import APIRouter, HTTPException
+import polars as pl
+from typing import List
+from pydantic import BaseModel
+
+# OpenTelemetry Imports
+from opentelemetry import trace
+from opentelemetry.trace import Status, StatusCode # For setting span status
+
+# Import duckdb for its specific error type
+import duckdb 
+
+from app.core.analytics import query_duckdb, PARQUET_DATA_PATH # For logging the path, if needed
+import logging # For logging
+
+logger = logging.getLogger(__name__)
+tracer = trace.get_tracer(__name__) # Initialize OpenTelemetry Tracer
+
+router = APIRouter(prefix="/analytics", tags=["analytics"])
+
+# Pydantic models for response structures
+class UserItemCount(BaseModel):
+    email: str
+    item_count: int
+
+class ActiveUser(BaseModel):
+    user_id: int # Added user_id based on the query
+    email: str
+    full_name: str | None = None # Made full_name optional as it can be NULL
+    item_count: int
+
+@router.get("/items_by_user", response_model=List[UserItemCount])
+def get_items_by_user():
+    """
+    Retrieves a list of users and the count of items they own,
+    ordered by the number of items in descending order.
+    Only users who own at least one item are included.
+    """
+    with tracer.start_as_current_span("analytics_items_by_user_handler") as span:
+        query = """
+        SELECT u.email, COUNT(i.item_id) AS item_count
+        FROM users u
+        JOIN items i ON u.user_id = i.owner_id
+        GROUP BY u.email
+        ORDER BY item_count DESC;
+        """
+        span.set_attribute("analytics.query", query)
+        try:
+            logger.info("Executing query for items_by_user")
+            df: pl.DataFrame = query_duckdb(query) # query_duckdb is already traced
+            result = df.to_dicts()
+            span.set_attribute("response.items_count", len(result))
+            logger.info(f"Successfully retrieved {len(result)} records for items_by_user")
+            span.set_status(Status(StatusCode.OK))
+            return result
+        except ConnectionError as e: # Specific error from get_duckdb_connection if it fails
+            logger.error(f"ConnectionError in /items_by_user: {e}. Ensure Parquet files exist at {PARQUET_DATA_PATH} and are readable.", exc_info=True)
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Analytics service connection error: {e}"))
+            raise HTTPException(status_code=503, detail=f"Analytics service unavailable: Database connection failed. {e}")
+        except duckdb.Error as e: # Catch DuckDB specific errors
+            logger.error(f"DuckDB query error in /items_by_user: {e}", exc_info=True)
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Analytics query error: {e}"))
+            raise HTTPException(status_code=500, detail=f"Analytics query failed: {e}")
+        except Exception as e:
+            logger.error(f"Unexpected error in /items_by_user: {e}", exc_info=True)
+            # Log the PARQUET_DATA_PATH to help diagnose if it's a file not found issue from underlying module
+            logger.info(f"Current PARQUET_DATA_PATH for analytics module: {PARQUET_DATA_PATH}")
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Unexpected error: {e}"))
+            raise HTTPException(status_code=500, detail=f"An unexpected error occurred while fetching items by user. {e}")
+
+@router.get("/active_users", response_model=List[ActiveUser])
+def get_active_users():
+    """
+    Retrieves the top 10 most active users based on the number of items they own.
+    Users are ordered by item count in descending order.
+    Includes users who may not own any items (LEFT JOIN).
+    """
+    with tracer.start_as_current_span("analytics_active_users_handler") as span:
+        # Query updated to match ActiveUser model: user_id, email, full_name, item_count
+        query = """
+        SELECT u.user_id, u.email, u.full_name, COUNT(i.item_id) AS item_count
+        FROM users u
+        LEFT JOIN items i ON u.user_id = i.owner_id
+        GROUP BY u.user_id, u.email, u.full_name  -- Group by all selected non-aggregated columns
+        ORDER BY item_count DESC
+        LIMIT 10;
+        """
+        span.set_attribute("analytics.query", query)
+        try:
+            logger.info("Executing query for active_users")
+            df: pl.DataFrame = query_duckdb(query) # query_duckdb is already traced
+            result = df.to_dicts()
+            span.set_attribute("response.users_count", len(result))
+            logger.info(f"Successfully retrieved {len(result)} records for active_users")
+            span.set_status(Status(StatusCode.OK))
+            return result
+        except ConnectionError as e:
+            logger.error(f"ConnectionError in /active_users: {e}. Ensure Parquet files exist at {PARQUET_DATA_PATH} and are readable.", exc_info=True)
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Analytics service connection error: {e}"))
+            raise HTTPException(status_code=503, detail=f"Analytics service unavailable: Database connection failed. {e}")
+        except duckdb.Error as e: # Catch DuckDB specific errors
+            logger.error(f"DuckDB query error in /active_users: {e}", exc_info=True)
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Analytics query error: {e}"))
+            raise HTTPException(status_code=500, detail=f"Analytics query failed: {e}")
+        except Exception as e:
+            logger.error(f"Unexpected error in /active_users: {e}", exc_info=True)
+            logger.info(f"Current PARQUET_DATA_PATH for analytics module: {PARQUET_DATA_PATH}")
+            span.record_exception(e)
+            span.set_status(Status(StatusCode.ERROR, f"Unexpected error: {e}"))
+            raise HTTPException(status_code=500, detail=f"An unexpected error occurred while fetching active users. {e}")