Skip to content

Conversation

alvidofaisal
Copy link

Integrates a new analytics pipeline using DuckDB and Polars, leveraging Parquet files as an intermediate data store. This allows for faster analytics queries compared to hitting the primary transactional database (Postgres) directly for complex analytical workloads.

Key changes include:

  • ETL Process: A new script (backend/app/scripts/export_to_parquet.py) extracts data from transactional User and Item tables in Postgres and exports it to Parquet files (users_analytics.parquet, items_analytics.parquet). This script is intended for periodic execution (e.g., nightly).

  • Analytics Core: A new module (backend/app/core/analytics.py) manages DuckDB connections and uses Polars to query data from the Parquet files. DuckDB is configured to read these files as external tables.

  • New Analytics API Endpoints:

    • GET /api/v1/analytics/items_by_user: Returns item counts grouped by user.
    • GET /api/v1/analytics/active_users: Returns most active users by item creation. These endpoints utilize the DuckDB/Polars stack for data retrieval.
  • OpenTelemetry Integration:

    • Added OpenTelemetry for distributed tracing.
    • Includes auto-instrumentation for FastAPI and SQLAlchemy.
    • Custom tracing implemented for the ETL script, the core analytics module, and the new analytics API routes.
    • Traces are currently exported to the console.
  • Configuration: New configuration options (PARQUET_DATA_PATH, SERVICE_NAME) added to backend/app/core/config.py.

  • Testing: Comprehensive unit tests added for the ETL script, analytics module, and API endpoints.

  • Documentation: Updated README.md and backend/README.md with details on the new architecture, setup, and usage.

This change keeps Postgres and SQLModel for existing transactional operations (users, auth, items) while offloading analytics to a more specialized stack.

Integrates a new analytics pipeline using DuckDB and Polars, leveraging
Parquet files as an intermediate data store. This allows for faster
analytics queries compared to hitting the primary transactional database (Postgres)
directly for complex analytical workloads.

Key changes include:

- **ETL Process**: A new script (`backend/app/scripts/export_to_parquet.py`)
  extracts data from transactional User and Item tables in Postgres and
  exports it to Parquet files (`users_analytics.parquet`, `items_analytics.parquet`).
  This script is intended for periodic execution (e.g., nightly).

- **Analytics Core**: A new module (`backend/app/core/analytics.py`)
  manages DuckDB connections and uses Polars to query data from the
  Parquet files. DuckDB is configured to read these files as external tables.

- **New Analytics API Endpoints**:
  - `GET /api/v1/analytics/items_by_user`: Returns item counts grouped by user.
  - `GET /api/v1/analytics/active_users`: Returns most active users by item creation.
  These endpoints utilize the DuckDB/Polars stack for data retrieval.

- **OpenTelemetry Integration**:
  - Added OpenTelemetry for distributed tracing.
  - Includes auto-instrumentation for FastAPI and SQLAlchemy.
  - Custom tracing implemented for the ETL script, the core analytics module,
    and the new analytics API routes.
  - Traces are currently exported to the console.

- **Configuration**: New configuration options (`PARQUET_DATA_PATH`, `SERVICE_NAME`)
  added to `backend/app/core/config.py`.

- **Testing**: Comprehensive unit tests added for the ETL script,
  analytics module, and API endpoints.

- **Documentation**: Updated `README.md` and `backend/README.md` with details
  on the new architecture, setup, and usage.

This change keeps Postgres and SQLModel for existing transactional operations
(users, auth, items) while offloading analytics to a more specialized stack.
@alvidofaisal alvidofaisal deleted the feat/analytics-duckdb-polars-otel branch May 22, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant