DVCon Paper RAG Web App

Professional full-stack search and chat application for the DVCon proceedings archive.

Features

Downloads DVCon paper PDFs into data/paper/
Extracts markdown, images, and metadata into data/
Enriches title, abstract, authors, affiliations, and bibliography with a local GROBID sidecar by default
Indexes the corpus for keyword and semantic retrieval
Supports paper-scoped chat with the OpenAI Responses API
Provides PDF, markdown, metadata graph, and chat workflows in a React web UI

Local run

Backend

./scripts/start_backend.sh

start_backend.sh now brings up the local GROBID sidecar automatically before starting FastAPI. start_grobid.sh waits for the GROBID liveness endpoint on 8070 before returning.

Frontend

./scripts/start_frontend.sh

Both

./scripts/start_all.sh

Windows PowerShell

.\scripts\start_all.ps1

Both start_backend.ps1 and start_all.ps1 start the local GROBID sidecar automatically.

GROBID sidecar only

docker compose up -d grobid

The sidecar exposes:

http://127.0.0.1:8070 for the main GROBID API
http://127.0.0.1:8071 for the admin/health port

Docker

Full stack with Docker Compose

Run both the app container and the GROBID sidecar together:

docker compose up --build

This is now the default container runtime path. The app service reads .env, mounts ${DATA_DIR:-data} into /app/${DATA_DIR:-data}, waits for GROBID readiness before starting, and points GROBID_URL at the internal grobid service automatically.

By default, Docker Compose publishes the app on http://127.0.0.1:8011 so it does not collide with the existing local backend on 8010. You can override this with APP_HOST_PORT. If your Docker installation uses the legacy CLI, docker-compose up --build is equivalent.

Build the image:

docker build -t dvcon-paper-rag .

Run the container:

docker run --rm -p 8000:8000 --env-file .env dvcon-paper-rag

Then open http://localhost:8000.

If you want the app container to use a host-managed GROBID sidecar instead of Compose, add:

docker run --rm -p 8000:8000 --env-file .env -e GROBID_URL=http://host.docker.internal:8070 dvcon-paper-rag

Ingestion

Run a small test ingest:

uv run --project backend ingest --limit 5

The ingestion pipeline always produces markdown and extracted images through PyMuPDF / pymupdf4llm. When GROBID is enabled and reachable, it additionally enriches:

title
abstract
structured authors
affiliations
bibliography / references

Raw TEI XML is stored at data/tei/{year}/{location}/{slug}.tei.xml.

Environment

Copy .env.example to .env and provide:

OPENAI_BASE_URL
OPENAI_API_KEY
OPENAI_CHAT_MODEL
APP_HOST_PORT
DATA_DIR
GROBID_ENABLED
GROBID_URL
GROBID_TIMEOUT_SECONDS
LOCAL_EMBEDDING_MODEL
LOCAL_EMBEDDING_DEVICE

Semantic search uses a local sentence-transformer model, not the OpenAI API, and will prefer CUDA when available. The default local embedding model in the repo config is BAAI/bge-m3. The default chat model is now gpt-5-mini.

GROBID is enabled by default. If it is disabled or unavailable, the extractor falls back to the existing heuristic metadata path and still writes markdown and images normally.

Current Local Test Corpus

The current local test corpus was reset and rebuilt from scratch with 10 indexed papers from event year 2025, then later extended with 8 Horace Chan papers through 2022. The repo also now includes a checked-in sample corpus under data.example/ containing the 8 Horace Chan papers plus their extracted PDFs, markdown, TEI, and image assets.

If you switch embedding models on an existing corpus, run a forced ingest so Chroma is rebuilt for the new vector dimension.

Troubleshooting

If the app starts acting like a hardware bug disguised as a software bug, open your favorite AI detective and make it explain itself. Use Cursor, GitHub Copilot, Claude Code, or Codex, but only use the latest and greatest model, don't waste your time on inferror cheaper model.

Contributing

Please see CONTRIBUTION.md for contribution expectations, issue filing, and fork + pull request workflow.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
backend		backend
data.example		data.example
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTION.md		CONTRIBUTION.md
Dockerfile		Dockerfile
LICENSE		LICENSE
PROGRESS.md		PROGRESS.md
README.md		README.md
compose.yaml		compose.yaml
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DVCon Paper RAG Web App

Features

Local run

Backend

Frontend

Both

Windows PowerShell

GROBID sidecar only

Docker

Full stack with Docker Compose

Ingestion

Environment

Current Local Test Corpus

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DVCon Paper RAG Web App

Features

Local run

Backend

Frontend

Both

Windows PowerShell

GROBID sidecar only

Docker

Full stack with Docker Compose

Ingestion

Environment

Current Local Test Corpus

Troubleshooting

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages