feat: Integrate DuckDB/Polars analytics stack and OpenTelemetry #1635
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Integrates a new analytics pipeline using DuckDB and Polars, leveraging Parquet files as an intermediate data store. This allows for faster analytics queries compared to hitting the primary transactional database (Postgres) directly for complex analytical workloads.
Key changes include:
ETL Process: A new script (
backend/app/scripts/export_to_parquet.py
) extracts data from transactional User and Item tables in Postgres and exports it to Parquet files (users_analytics.parquet
,items_analytics.parquet
). This script is intended for periodic execution (e.g., nightly).Analytics Core: A new module (
backend/app/core/analytics.py
) manages DuckDB connections and uses Polars to query data from the Parquet files. DuckDB is configured to read these files as external tables.New Analytics API Endpoints:
GET /api/v1/analytics/items_by_user
: Returns item counts grouped by user.GET /api/v1/analytics/active_users
: Returns most active users by item creation. These endpoints utilize the DuckDB/Polars stack for data retrieval.OpenTelemetry Integration:
Configuration: New configuration options (
PARQUET_DATA_PATH
,SERVICE_NAME
) added tobackend/app/core/config.py
.Testing: Comprehensive unit tests added for the ETL script, analytics module, and API endpoints.
Documentation: Updated
README.md
andbackend/README.md
with details on the new architecture, setup, and usage.This change keeps Postgres and SQLModel for existing transactional operations (users, auth, items) while offloading analytics to a more specialized stack.