Compliance Tracker RAG Web API (FastAPI + Supabase)

Retrieval-Augmented Generation backend for compliance teams. Upload PDF evidence, embed it into a Supabase-backed vector store, audit responses with OpenAI (standard and streaming), and track gaps, sessions, and reports in one FastAPI service. Security features include hardened uploads, rate limiting, idempotency, and detailed audit logging.

Added Value

Compliance-ready RAG pipeline: ingest policies, procedures, and evidence; question them in seconds.
Security-first ingestion: ClamAV scanning, strict validation, idempotent uploads, and per-IP/user throttling.
Managed Supabase vector store with metadata filters, compliance domains, and audit session scoping.
Conversation-aware Q&A with streaming support, persisted history, and audit session linkage.
Role-based access (JWT) plus comprehensive audit logs, compliance gaps, and executive summaries out of the box.

Technology Stack

FastAPI + Uvicorn for the ASGI web layer with rich OpenAPI docs.
LangChain + OpenAI for embeddings, retrieval orchestration, and answer generation.
Supabase (Postgres + pgvector) accessed via supabase-py/PostgREST for documents and metadata.
SQLModel + Pydantic for strongly typed entities, schemas, and validation.
SlowAPI, custom middleware, and structured logging for rate limiting, observability, and security hardening.
Optional ClamAV (clamd) and PikePDF for file scanning and PDF introspection.

Prerequisites

Python 3.13+ (virtual environment recommended).
Supabase project (or Postgres with pgvector enabled) containing the tables below.
OpenAI API key with access to the chosen chat and embedding models.
Optional: ClamAV daemon reachable by clamd for malware scanning during ingestion.

Environment Variables

Create a .env file in the repository root. Defaults in config/config.py apply when a variable is omitted.

# Supabase API
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_service_role_key
SUPABASE_TABLE_DOCUMENTS=documents
SUPABASE_TABLE_CHAT_HISTORY=chat_history
SUPABASE_TABLE_PDF_INGESTION=pdf_ingestion
SUPABASE_TABLE_COMPLIANCE_DOMAINS=compliance_domains
SUPABASE_TABLE_COMPLIANCE_GAPS=compliance_gaps
SUPABASE_TABLE_AUDIT_SESSIONS=audit_sessions
SUPABASE_TABLE_USERS=users
SUPABASE_TABLE_AUDIT_REPORTS=audit_reports
SUPABASE_TABLE_AUDIT_REPORT_VERSIONS=audit_report_versions
SUPABASE_TABLE_AUDIT_REPORT_DISTRIBUTIONS=audit_report_distributions
SUPABASE_TABLE_AUDIT_SESSION_PDF_INGESTIONS=audit_session_pdf_ingestions
SUPABASE_TABLE_AUDIT_LOG=audit_log
SUPABASE_TABLE_ISO_CONTROLS=iso_controls

# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-3.5-turbo
EMBEDDING_MODEL=text-embedding-ada-002

# Authentication / Roles
JWT_SECRET_KEY=your_jwt_secret
JWT_ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=30
REFRESH_TOKEN_EXPIRE_DAYS=7
VALID_USER_ROLES=admin,compliance_officer,reader  # optional override
DEFAULT_USER_ROLE=reader

# RAG / Ingestion
TOP_K=5
PDF_DIR=pdfs/
REPORTS_DIR=reports/
PDF_QUARANTINE_DIR=/tmp/pdf_quarantine
RATE_LIMIT_ENABLED=true
RATE_LIMIT_STORAGE_URI=redis://localhost:6379/0  # optional if using shared limiter backend

# Optional hardening
# RATE_LIMIT_ENABLED=false             # disable SlowAPI limiter (not recommended)
# PDF_QUARANTINE_DIR=/secure/quarantine
# Add other feature flags as needed.

Database Schema

Enable pgvector (and pgcrypto for UUID generation) then create the core tables. Adjust names to match your environment and keep vector dimensions aligned with the embedding model (e.g., 1536 for text-embedding-ada-002).

-- Extensions
create extension if not exists vector;
create extension if not exists pgcrypto;

-- Vectorized document chunks
create table if not exists public.documents (
  id uuid primary key default gen_random_uuid(),
  content text not null,
  embedding vector(1536) not null,
  compliance_domain text,
  document_version text,
  document_tags text[] default array[]::text[],
  approval_status text,
  source_filename text,
  source_page_number integer,
  chunk_index integer,
  uploaded_by uuid,
  approved_by uuid,
  metadata jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);
create index if not exists documents_embedding_idx on public.documents using ivfflat (embedding vector_cosine_ops);
create index if not exists documents_domain_idx on public.documents (compliance_domain);

-- Chat history with conversation linkage
create table if not exists public.chat_history (
  id bigserial primary key,
  conversation_id uuid not null,
  question text not null,
  answer text not null,
  audit_session_id uuid,
  compliance_domain text,
  source_document_ids uuid[] default array[]::uuid[],
  match_threshold numeric(5,4),
  match_count integer,
  user_id uuid,
  total_tokens_used integer,
  response_time_ms integer,
  metadata jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now()
);
create index if not exists chat_history_conversation_idx on public.chat_history (conversation_id, created_at desc);

-- PDF ingestion metadata
create table if not exists public.pdf_ingestion (
  id uuid primary key default gen_random_uuid(),
  filename text not null,
  metadata jsonb not null default '{}'::jsonb,
  ingested_at timestamptz not null default now()
);

-- Compliance catalog tables (minimal columns shown; extend as needed)
create table if not exists public.compliance_domains (
  id uuid primary key default gen_random_uuid(),
  name text not null unique,
  description text,
  created_at timestamptz not null default now()
);

create table if not exists public.audit_sessions (
  id uuid primary key default gen_random_uuid(),
  user_id uuid not null,
  session_name text not null,
  compliance_domain text not null,
  is_active boolean not null default true,
  total_queries integer not null default 0,
  session_summary text,
  audit_report text,
  started_at timestamptz not null default now(),
  ended_at timestamptz,
  ip_address text,
  user_agent text,
  metadata jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

create table if not exists public.compliance_gaps (
  id uuid primary key default gen_random_uuid(),
  audit_session_id uuid not null references public.audit_sessions(id) on delete cascade,
  user_id uuid not null,
  gap_type text not null,
  gap_category text,
  gap_title text not null,
  gap_description text,
  original_question text,
  chat_history_id uuid,
  pdf_ingestion_id uuid,
  detection_method text,
  confidence_score numeric(5,4),
  risk_level text,
  business_impact text,
  status text not null default 'identified',
  assigned_to uuid,
  due_date timestamptz,
  recommendation_text text,
  recommended_actions text[] default array[]::text[],
  related_documents text[] default array[]::text[],
  detected_at timestamptz not null default now(),
  resolution_notes text,
  auto_generated boolean not null default true,
  ip_address text,
  user_agent text,
  session_context jsonb not null default '{}'::jsonb,
  metadata jsonb not null default '{}'::jsonb
);

create table if not exists public.audit_log (
  id uuid primary key default gen_random_uuid(),
  object_type text not null,
  object_id text not null,
  action text not null,
  user_id uuid not null,
  audit_session_id uuid,
  compliance_domain text,
  performed_at timestamptz not null default now(),
  ip_address text,
  user_agent text,
  details jsonb not null default '{}'::jsonb,
  risk_level text,
  tags text[] default array[]::text[]
);

create table if not exists public.iso_controls (
  id uuid primary key default gen_random_uuid(),
  control_id text not null,
  title text not null,
  description text,
  metadata jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now()
);

Dimension note: If you swap embeddings (e.g., move to text-embedding-3-large with 3072 dims), update vector(<dims>), reindex documents_embedding_idx, and re-embed stored vectors.

How to Run

Create a virtual environment: python3 -m venv .venv && source .venv/bin/activate.
Install the project (and dependencies): pip install -e . for runtime, pip install -e ".[dev]" for tooling.
Populate .env and ensure the Supabase/Postgres schema exists.
Start the API with make run or uvicorn app:app --reload.
Visit http://localhost:8000/docs for interactive OpenAPI documentation.

API Overview

POST /v1/auth/signup / POST /v1/auth/login — User registration, login, and token refresh.
POST /v1/ingestions/upload — Secure PDF upload with malware scanning, metadata, and embedding ingestion.
GET /v1/documents — Paginated document listing with rich filtering and tag helpers.
POST /v1/rag/query — Non-streaming compliance Q&A with audit session and domain filters.
POST /v1/rag/query-stream — Streaming Q&A; response headers include x-conversation-id for chat continuity.
GET /v1/history/{conversation_id} — Retrieve prior Q&A turns for a conversation.
GET /v1/compliance-gaps, POST /v1/compliance-gaps — Manage detected compliance gaps and recommendations.
GET /v1/audit-sessions / POST /v1/audit-sessions — Create and monitor audit sessions and document links.
GET /v1/executive-summary, /v1/threat-intelligence, /v1/risk-prioritization, /v1/target-audience — AI-generated compliance summaries and insights for stakeholders.

Example Requests

Upload a PDF (requires JWT bearer token and Idempotency-Key header):

curl -X POST http://localhost:8000/v1/ingestions/upload \
  -H "Authorization: Bearer $TOKEN" \
  -H "Idempotency-Key: $(uuidgen)" \
  -F "file=@/path/to/policy.pdf" \
  -F "compliance_domain=ISO27001" \
  -F "document_tags=reference_document,current"

Ask a compliance question (non-streaming):

curl -X POST http://localhost:8000/v1/rag/query \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "What controls cover incident response?",
    "compliance_domain": "ISO27001",
    "match_count": 5
  }'

Ask a question with streaming output (capture x-conversation-id header):

curl -N -X POST http://localhost:8000/v1/rag/query-stream \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -H "Idempotency-Key: $(uuidgen)" \
  -d '{
    "question": "Summarize open compliance gaps",
    "conversation_id": null,
    "match_threshold": 0.7
  }' -i

Project Structure

.
├── app.py                    # FastAPI application, middleware, router wiring
├── api/                      # Route definitions (auth, rag, ingestion, audit, compliance, etc.)
├── auth/                     # JWT handling, decorators, and token schemas
├── common/                   # Logging, exceptions, validation, responses, middleware
├── config/                   # Pydantic settings, CORS configuration, tagging metadata
├── db/                       # Supabase client factory
├── entities/                 # Domain models used across services and repositories
├── repositories/             # Supabase-backed repositories for each resource
├── services/                 # Business logic (RAG, ingestion, audit, summaries, risk, reports)
├── adapters/                 # Integration helpers and external service adapters
├── tools/                    # CLI utilities and maintenance scripts
├── Makefile                  # Convenience commands (e.g., `make run`)
├── pyproject.toml            # Project metadata and dependency definitions
└── README.md

Configuration and Tuning

CORS origins: update the allowed origins list in config/cors.py for your frontend domains.
Chunking & embeddings: adjust chunk size/overlap and embeddings in services/ingestion_service.py and services/vector_store.py.
Model selection: override OPENAI_MODEL / EMBEDDING_MODEL via .env; ensure the embedding dimension matches the database schema.
Retrieval hyperparameters: tweak TOP_K, match_threshold, match_count, and tag filters per use case.
Table overrides: change Supabase table names via the corresponding SUPABASE_TABLE_* environment variables.
Security policies: align Supabase RLS policies with the API’s role model; service-role key recommended for server-side usage.

Development Guide

Add new endpoints: create routers under api/ and include them in app.py; define request/response models in services/schemas.py.
Business logic: extend services in services/ and inject them via dependency providers in dependencies.py.
Repositories: implement Supabase calls in repositories/; leverage the base repository for filtering and pagination.
Migrations: adopt Alembic or Supabase migration tooling to evolve tables; track schema changes alongside application code.
Testing: factor logic into services; write async-friendly tests with pytest, using mocked Supabase/OpenAI clients.
Observability: enrich structured logs in common/logging.py or hook in tracing/metrics (e.g., OpenTelemetry) as needed.

Troubleshooting

Supabase connectivity: run services.db_check.check_database_connection() (exposed via the health endpoint) to confirm API credentials and table names.
Vector search mismatches: ensure stored embeddings and query embeddings share the same model/dimension; re-index after schema changes.
Upload failures: verify ClamAV (clamd) is accessible, PDF size < 50 MB, and provide a valid Idempotency-Key header.
Rate limit hits: SlowAPI enforces defaults (200/minute, 10/minute on RAG); tune RATE_LIMIT_* env vars or provide Redis storage for clusters.
RLS denials: service-role keys bypass RLS; otherwise add Supabase policies allowing the API role to read/write required tables.
Streaming stalls: check reverse proxies and ensure clients use curl -N or equivalent to keep the connection open.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compliance Tracker RAG Web API (FastAPI + Supabase)

Added Value

Technology Stack

Prerequisites

Environment Variables

Database Schema

How to Run

API Overview

Example Requests

Project Structure

Configuration and Tuning

Development Guide

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
adapters		adapters
api		api
auth		auth
common		common
config		config
db		db
entities		entities
policies		policies
repositories		repositories
security		security
services		services
tools		tools
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
app.py		app.py
check_config.py		check_config.py
dependencies.py		dependencies.py
pyproject.toml		pyproject.toml

michalzagrodzki/compliance-tracker-mvp-api

Folders and files

Latest commit

History

Repository files navigation

Compliance Tracker RAG Web API (FastAPI + Supabase)

Added Value

Technology Stack

Prerequisites

Environment Variables

Database Schema

How to Run

API Overview

Example Requests

Project Structure

Configuration and Tuning

Development Guide

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages