Professional full-stack search and chat application for the DVCon proceedings archive.
- Downloads DVCon paper PDFs into
data/paper/ - Extracts markdown, images, and metadata into
data/ - Enriches title, abstract, authors, affiliations, and bibliography with a local GROBID sidecar by default
- Indexes the corpus for keyword and semantic retrieval
- Supports paper-scoped chat with the OpenAI Responses API
- Provides PDF, markdown, metadata graph, and chat workflows in a React web UI
./scripts/start_backend.shstart_backend.sh now brings up the local GROBID sidecar automatically before starting FastAPI.
start_grobid.sh waits for the GROBID liveness endpoint on 8070 before returning.
./scripts/start_frontend.sh./scripts/start_all.sh.\scripts\start_all.ps1Both start_backend.ps1 and start_all.ps1 start the local GROBID sidecar automatically.
docker compose up -d grobidThe sidecar exposes:
http://127.0.0.1:8070for the main GROBID APIhttp://127.0.0.1:8071for the admin/health port
Run both the app container and the GROBID sidecar together:
docker compose up --buildThis is now the default container runtime path. The app service reads .env, mounts ${DATA_DIR:-data} into /app/${DATA_DIR:-data}, waits for GROBID readiness before starting, and points GROBID_URL at the internal grobid service automatically.
By default, Docker Compose publishes the app on http://127.0.0.1:8011 so it does not collide with the existing local backend on 8010. You can override this with APP_HOST_PORT.
If your Docker installation uses the legacy CLI, docker-compose up --build is equivalent.
Build the image:
docker build -t dvcon-paper-rag .Run the container:
docker run --rm -p 8000:8000 --env-file .env dvcon-paper-ragThen open http://localhost:8000.
If you want the app container to use a host-managed GROBID sidecar instead of Compose, add:
docker run --rm -p 8000:8000 --env-file .env -e GROBID_URL=http://host.docker.internal:8070 dvcon-paper-ragRun a small test ingest:
uv run --project backend ingest --limit 5The ingestion pipeline always produces markdown and extracted images through PyMuPDF / pymupdf4llm. When GROBID is enabled and reachable, it additionally enriches:
- title
- abstract
- structured authors
- affiliations
- bibliography / references
Raw TEI XML is stored at data/tei/{year}/{location}/{slug}.tei.xml.
Copy .env.example to .env and provide:
OPENAI_BASE_URLOPENAI_API_KEYOPENAI_CHAT_MODELAPP_HOST_PORTDATA_DIRGROBID_ENABLEDGROBID_URLGROBID_TIMEOUT_SECONDSLOCAL_EMBEDDING_MODELLOCAL_EMBEDDING_DEVICE
Semantic search uses a local sentence-transformer model, not the OpenAI API, and will prefer CUDA when available.
The default local embedding model in the repo config is BAAI/bge-m3.
The default chat model is now gpt-5-mini.
GROBID is enabled by default. If it is disabled or unavailable, the extractor falls back to the existing heuristic metadata path and still writes markdown and images normally.
The current local test corpus was reset and rebuilt from scratch with 10 indexed papers from event year 2025, then later extended with 8 Horace Chan papers through 2022.
The repo also now includes a checked-in sample corpus under data.example/ containing the 8 Horace Chan papers plus their extracted PDFs, markdown, TEI, and image assets.
If you switch embedding models on an existing corpus, run a forced ingest so Chroma is rebuilt for the new vector dimension.
If the app starts acting like a hardware bug disguised as a software bug, open your favorite AI detective and make it explain itself. Use Cursor, GitHub Copilot, Claude Code, or Codex, but only use the latest and greatest model, don't waste your time on inferror cheaper model.
Please see CONTRIBUTION.md for contribution expectations, issue filing, and fork + pull request workflow.
