A Federated Knowledge Core for astronomical research, decoupling semantic meaning from structural relationships to enable expert-level RAG and autonomous Deep Research agents.
This repository builds a specialized knowledge corpus from astronomical literature, designed to support Retrieval-Augmented Generation (RAG) for the DESI research portfolio. The system grounds LLM responses in verifiable scientific data, preserves citation topology, and enables multi-step research workflows through Claude Code and MCP integration.
Building a scientific RAG system extends far beyond document aggregation. Astronomical literature contains complex mathematical notation, specialized terminology, and a rapidly evolving research landscape. Most critically, an AI agent for scientific discovery cannot operate on ambiguous information; its knowledge must be traceable, accurate, and contextually rich.
Retrieval-Augmented Generation (RAG) addresses LLM hallucination by grounding responses in retrieved documents. Standard RAG retrieves semantically similar text chunks, but scientific questions often require understanding how papers relate: which work refutes another, who the key authors are, what foundational papers underpin a claim.
This system implements a Federated Knowledge Core that separates:
- What papers say (semantic content via embeddings)
- How papers connect (citation topology via graph)
- Where artifacts live (physical storage for reproducibility)
The architecture enables "Graph-Boosted Retrieval," where semantic search results are refined by citation topology. A query about "DESI void galaxy quenching" retrieves relevant chunks, then expands context to include highly-cited foundational papers that may not semantically match but are topologically indispensable.
This corpus supports the radioastronomy.io DESI research portfolio:
| Project | Focus | Corpus Role |
|---|---|---|
| desi-cosmic-void-galaxies | Environmental quenching, ARD factory | Primary consumer, void science literature |
| desi-qso-anomaly-detection | ML anomaly detection on QSO spectra | QSO/AGN methodology papers |
| desi-quasar-outflows | AGN feedback and outflow energetics | Outflow physics literature |
Seed corpus focus: DESIVAST (void catalog methodology), central to all three projects.
The system decouples content from context, bridged by NASA ADS Bibcode as the universal key.
Data sources prioritized by structure, fidelity, and reliability:
| Level | Source | Content | Fidelity |
|---|---|---|---|
| 1 | DESI, SIMBAD, VizieR | Structured catalog data | Ground truth |
| 2 | FITS Headers | Observational metadata | Instrument provenance |
| 3 | arXiv LaTeX | Clean text from source | High fidelity |
| 4 | PDF Extraction | Text from rendered documents | Best effort |
LaTeX-first extraction is critical. PDF-to-text conversion corrupts mathematical notation, mangles equations, and introduces OCR artifacts that poison the embedding space.
| Phase | Name | Status | Description |
|---|---|---|---|
| 01 | Ideation and Setup | ✅ Complete | GDR review, repo initialization |
| 02 | GitHub Frameout | ✅ Complete | Milestones, tasks, GitHub labels |
| 03 | Acquisition | ✅ Complete | arXiv client, PDF download, source extraction |
| 04 | Extraction | ✅ Complete | LaTeX/PDF text extraction |
| 05 | Storage | ✅ Completed | Database, embeddings, retrieval |
| 06 | Harvester | ✅ Complete | Bulk acquisition, seed corpus population |
| 07 | Hybrid Engine | ⬜ In Progress | Neo4j graph construction |
| 08 | Agent | ⬜ Planned | LangGraph state machine |
| 09 | Interface | ⬜ Planned | MCP servers, Claude Code integration |
The minimal end-to-end loop proving the architecture, split across three milestones:
arXiv ID → download source → LaTeX extraction → clean text + bibcode → PostgreSQL → semantic query → return with attribution
No catalog integration, no Neo4j, no MCP, just the text pipeline.
This project runs on the radioastronomy.io research cluster.
| Component | Resource | Purpose |
|---|---|---|
| PostgreSQL + pgvector | radio-pgsql01 (10.25.20.8) | Semantic layer, embeddings, vector search |
| Neo4j | radio-neo4j01 (10.25.20.21) | Topological layer, citation graphs |
| SMB Storage | radio-fs02 (10.25.20.15) | Physical layer, PDF/LaTeX artifacts |
| GPU Compute | ML01 (A4000, 16GB) | Embedding generation |
| Database | astronomy_rag_corpus |
Dedicated corpus database |
Connection patterns follow the standard /opt/global-env/research.env configuration.
astronomy-rag-corpus/
├── 📂 assets/ # Figures, diagrams, banners
├── 📂 docs/
│ ├── 📂 documentation-standards/ # Templates, tagging strategy
│ └── 📄 data-science-infrastructure.md
├── 📂 internal-files/ # GDR documents, working papers
├── 📂 shared/ # Shared resources
├── 📂 spec/ # Project specifications
├── 📂 src/ # Source code
│ ├── 📂 acquisition/ # arXiv/ADS paper retrieval
│ ├── 📂 extraction/ # LaTeX/PDF text extraction
│ └── 📂 storage/ # Database, embeddings, retrieval
├── 📂 staging/ # Staged work
├── 📂 tests/ # Test suite
├── 📂 work-logs/ # Milestone-based development history
├── 📄 conftest.py # Pytest configuration
├── 📄 requirements.txt # Python dependencies
├── 📄 LICENSE
├── 📄 LICENSE-DATA
└── 📄 README.md # This file
| Category | Technology | Purpose |
|---|---|---|
| Databases | PostgreSQL 16 + pgvector | Vector storage, semantic search |
| Neo4j 5 | Citation graphs, authorship networks | |
| Ingestion | arxiv.py | arXiv paper retrieval |
| ads | NASA ADS bibliographic data | |
| pylatexenc | LaTeX to clean text | |
| PyMuPDF | PDF extraction (fallback) | |
| astropy | FITS header extraction | |
| Orchestration | LangGraph | Stateful agentic workflows |
| Interface | MCP | Claude Code integration |
This repository benefits from open source programs that provide free or discounted tooling to qualifying public repositories.
| Program | Provides | Use Case |
|---|---|---|
| CodeRabbit | AI code review (Pro tier) | PR review with codebase context |
| Atlassian | Jira, Confluence (Standard tier) | Project tracking, documentation |
| Program | Provides | Planned Use |
|---|---|---|
| Snyk | Security scanning | Dependency vulnerability detection |
| SonarCloud | Code quality analysis | Static analysis |
MIT © 2025 VintageDon
- DESI Collaboration for data releases and VAC documentation
- NASA ADS for bibliographic data and API access
- arXiv for open access preprints
- CDS for SIMBAD and VizieR services
Last Updated: 2026-03-29 | Current Phase: 04 Extraction Next


