Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
- 📞 [Traefik](https://traefik.io) as a reverse proxy / load balancer.
- 🚢 Deployment instructions using Docker Compose, including how to set up a frontend Traefik proxy to handle automatic HTTPS certificates.
- 🏭 CI (continuous integration) and CD (continuous deployment) based on GitHub Actions.
- 📊 **New Analytics Module**: Integrated analytics capabilities using DuckDB and Polars for efficient querying of exported data. Features an ETL process for data extraction and OpenTelemetry for performance tracing. For more details, see the [Backend README](./backend/README.md#analytics-module).

### Dashboard Login

Expand Down
97 changes: 97 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# FastAPI Project - Backend

## Contents

- [Requirements](#requirements)
- [Docker Compose](#docker-compose)
- [General Workflow](#general-workflow)
- [VS Code](#vs-code)
- [Docker Compose Override](#docker-compose-override)
- [Backend tests](#backend-tests)
- [Migrations](#migrations)
- [Email Templates](#email-templates)
- [Analytics Module](#analytics-module)

## Requirements

* [Docker](https://www.docker.com/).
Expand Down Expand Up @@ -170,3 +182,88 @@ The email templates are in `./backend/app/email-templates/`. Here, there are two
Before continuing, ensure you have the [MJML extension](https://marketplace.visualstudio.com/items?itemName=attilabuti.vscode-mjml) installed in your VS Code.

Once you have the MJML extension installed, you can create a new email template in the `src` directory. After creating the new email template and with the `.mjml` file open in your editor, open the command palette with `Ctrl+Shift+P` and search for `MJML: Export to HTML`. This will convert the `.mjml` file to a `.html` file and now you can save it in the build directory.

## Analytics Module

This section details the integrated analytics capabilities, designed to provide insights into application data without impacting the performance of the primary transactional database.

### Architecture Overview

The analytics architecture employs a dual-database approach:

- **PostgreSQL**: Serves as the primary transactional database, handling real-time data for Users, Items, and other core application entities.
- **DuckDB with Polars**: Used as the analytical processing engine. Data is periodically moved from PostgreSQL to Parquet files, which are then queried efficiently by DuckDB. Polars is utilized for high-performance DataFrame manipulations where needed.

This separation ensures that complex analytical queries do not overload the operational database.

### ETL Process

An ETL (Extract, Transform, Load) process is responsible for populating the analytical data store.

- **Script**: `backend/app/scripts/export_to_parquet.py`
- **Purpose**: This script extracts data from the main PostgreSQL database (specifically the `User` and `Item` tables) and saves it into Parquet files (`users_analytics.parquet`, `items_analytics.parquet`). Parquet format is chosen for its efficiency in analytical workloads.
- **Usage**: The script is designed to be run periodically (e.g., as a nightly batch job or via a scheduler like cron) to update the data available for analytics. To run the script manually (ensure your Python environment with backend dependencies is active, or run within the Docker container):
```bash
python backend/app/scripts/export_to_parquet.py
```
- **Output Location**: The Parquet files are stored in a directory specified by the `PARQUET_DATA_PATH` environment variable. The default location is `backend/data/parquet/`.

### Analytics API Endpoints

New API endpoints provide access to analytical insights. These are available under the `/api/v1/analytics` prefix:

- **`GET /api/v1/analytics/items_by_user`**:
- **Provides**: A list of users and the total count of items they own.
- **Details**: Only includes users who own at least one item. Results are ordered by the number of items in descending order.
- **Response Model**: `List[UserItemCount]` where `UserItemCount` includes `email: str` and `item_count: int`.

- **`GET /api/v1/analytics/active_users`**:
- **Provides**: The top 10 most active users, based on the number of items they own.
- **Details**: Users are ordered by item count in descending order. This endpoint uses a left join, so users who may not own any items could theoretically be included if the query were adjusted (currently, it effectively shows top item owners).
- **Response Model**: `List[ActiveUser]` where `ActiveUser` includes `user_id: int`, `email: str`, `full_name: str | None`, and `item_count: int`.

These endpoints query the DuckDB instance, which reads from the Parquet files generated by the ETL script.

### OpenTelemetry Tracing

OpenTelemetry has been integrated into the backend for enhanced observability:

- **Purpose**: To trace application performance and behavior, helping to identify bottlenecks and understand request flows.
- **Export**: Currently, traces are configured to be exported to the console. This is useful for development and debugging. For production, an appropriate OpenTelemetry collector and backend (e.g., Jaeger, Zipkin, Datadog) should be configured.
- **Coverage**:
- **Auto-instrumentation**: FastAPI and SQLAlchemy interactions are automatically instrumented, providing traces for API requests and database calls to the PostgreSQL database.
- **Custom Tracing**:
- The analytics module (`backend/app/core/analytics.py`) includes custom spans for DuckDB connection setup and query execution.
- The analytics API routes (`backend/app/api/routes/analytics.py`) have custom spans for their request handlers.
- The ETL script (`backend/app/scripts/export_to_parquet.py`) is instrumented with custom spans for its key operations (database extraction, Parquet file writing).

### Key New Dependencies

The following main dependencies were added to support the analytics features:

- `duckdb`: An in-process analytical data management system.
- `polars`: A fast DataFrame library.
- `opentelemetry-api`: Core OpenTelemetry API.
- `opentelemetry-sdk`: OpenTelemetry SDK for configuring telemetry.
- `opentelemetry-exporter-otlp-proto-http`: OTLP exporter (though console exporter is used by default in current setup).
- `opentelemetry-instrumentation-fastapi`: Auto-instrumentation for FastAPI.
- `opentelemetry-instrumentation-sqlalchemy`: Auto-instrumentation for SQLAlchemy.
- `opentelemetry-instrumentation-psycopg2`: Auto-instrumentation for Psycopg2 (PostgreSQL driver).

Refer to `backend/pyproject.toml` for specific versions.

### New Configuration Options

The following environment variables can be set (e.g., in your `.env` file) to configure the analytics and OpenTelemetry features:

- **`PARQUET_DATA_PATH`**:
- **Description**: Specifies the directory where the ETL script saves Parquet files and where DuckDB reads them from.
- **Default**: `backend/data/parquet/`
- **`SERVICE_NAME`**:
- **Description**: Sets the service name attribute for OpenTelemetry traces. This helps in identifying and filtering traces in a distributed tracing system.
- **Default**: `fastapi-analytics-app` (Note: The ETL script appends "-etl-script" to this name for its traces).
- **`OTEL_EXPORTER_OTLP_ENDPOINT`** (Optional, for future use):
- **Description**: If you configure an OTLP exporter (e.g., for Jaeger or Prometheus), this variable would specify its endpoint URL.
- **Default**: Not set (console exporter is used by default).

These settings are defined in `backend/app/core/config.py`.
3 changes: 2 additions & 1 deletion backend/app/api/main.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
from fastapi import APIRouter

from app.api.routes import items, login, private, users, utils
from app.api.routes import items, login, private, users, utils, analytics
from app.core.config import settings

api_router = APIRouter()
api_router.include_router(login.router)
api_router.include_router(users.router)
api_router.include_router(utils.router)
api_router.include_router(items.router)
api_router.include_router(analytics.router)


if settings.ENVIRONMENT == "local":
Expand Down
115 changes: 115 additions & 0 deletions backend/app/api/routes/analytics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from fastapi import APIRouter, HTTPException
import polars as pl
from typing import List
from pydantic import BaseModel

# OpenTelemetry Imports
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode # For setting span status

# Import duckdb for its specific error type
import duckdb

from app.core.analytics import query_duckdb, PARQUET_DATA_PATH # For logging the path, if needed
import logging # For logging

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__) # Initialize OpenTelemetry Tracer

router = APIRouter(prefix="/analytics", tags=["analytics"])

# Pydantic models for response structures
class UserItemCount(BaseModel):
email: str
item_count: int

class ActiveUser(BaseModel):
user_id: int # Added user_id based on the query
email: str
full_name: str | None = None # Made full_name optional as it can be NULL
item_count: int

@router.get("/items_by_user", response_model=List[UserItemCount])
def get_items_by_user():
"""
Retrieves a list of users and the count of items they own,
ordered by the number of items in descending order.
Only users who own at least one item are included.
"""
with tracer.start_as_current_span("analytics_items_by_user_handler") as span:
query = """
SELECT u.email, COUNT(i.item_id) AS item_count
FROM users u
JOIN items i ON u.user_id = i.owner_id
GROUP BY u.email
ORDER BY item_count DESC;
"""
span.set_attribute("analytics.query", query)
try:
logger.info("Executing query for items_by_user")
df: pl.DataFrame = query_duckdb(query) # query_duckdb is already traced
result = df.to_dicts()
span.set_attribute("response.items_count", len(result))
logger.info(f"Successfully retrieved {len(result)} records for items_by_user")
span.set_status(Status(StatusCode.OK))
return result
except ConnectionError as e: # Specific error from get_duckdb_connection if it fails
logger.error(f"ConnectionError in /items_by_user: {e}. Ensure Parquet files exist at {PARQUET_DATA_PATH} and are readable.", exc_info=True)
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Analytics service connection error: {e}"))
raise HTTPException(status_code=503, detail=f"Analytics service unavailable: Database connection failed. {e}")
except duckdb.Error as e: # Catch DuckDB specific errors
logger.error(f"DuckDB query error in /items_by_user: {e}", exc_info=True)
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Analytics query error: {e}"))
raise HTTPException(status_code=500, detail=f"Analytics query failed: {e}")
except Exception as e:
logger.error(f"Unexpected error in /items_by_user: {e}", exc_info=True)
# Log the PARQUET_DATA_PATH to help diagnose if it's a file not found issue from underlying module
logger.info(f"Current PARQUET_DATA_PATH for analytics module: {PARQUET_DATA_PATH}")
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Unexpected error: {e}"))
raise HTTPException(status_code=500, detail=f"An unexpected error occurred while fetching items by user. {e}")

@router.get("/active_users", response_model=List[ActiveUser])
def get_active_users():
"""
Retrieves the top 10 most active users based on the number of items they own.
Users are ordered by item count in descending order.
Includes users who may not own any items (LEFT JOIN).
"""
with tracer.start_as_current_span("analytics_active_users_handler") as span:
# Query updated to match ActiveUser model: user_id, email, full_name, item_count
query = """
SELECT u.user_id, u.email, u.full_name, COUNT(i.item_id) AS item_count
FROM users u
LEFT JOIN items i ON u.user_id = i.owner_id
GROUP BY u.user_id, u.email, u.full_name -- Group by all selected non-aggregated columns
ORDER BY item_count DESC
LIMIT 10;
"""
span.set_attribute("analytics.query", query)
try:
logger.info("Executing query for active_users")
df: pl.DataFrame = query_duckdb(query) # query_duckdb is already traced
result = df.to_dicts()
span.set_attribute("response.users_count", len(result))
logger.info(f"Successfully retrieved {len(result)} records for active_users")
span.set_status(Status(StatusCode.OK))
return result
except ConnectionError as e:
logger.error(f"ConnectionError in /active_users: {e}. Ensure Parquet files exist at {PARQUET_DATA_PATH} and are readable.", exc_info=True)
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Analytics service connection error: {e}"))
raise HTTPException(status_code=503, detail=f"Analytics service unavailable: Database connection failed. {e}")
except duckdb.Error as e: # Catch DuckDB specific errors
logger.error(f"DuckDB query error in /active_users: {e}", exc_info=True)
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Analytics query error: {e}"))
raise HTTPException(status_code=500, detail=f"Analytics query failed: {e}")
except Exception as e:
logger.error(f"Unexpected error in /active_users: {e}", exc_info=True)
logger.info(f"Current PARQUET_DATA_PATH for analytics module: {PARQUET_DATA_PATH}")
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, f"Unexpected error: {e}"))
raise HTTPException(status_code=500, detail=f"An unexpected error occurred while fetching active users. {e}")
Loading
Loading