This document provides a step-by-step guide to generate and index the full Neo4j dataset. It assumes Jupyter Notebook, Airflow, and Neo4j are already set up and running. If not, please follow the setup instructions in the main README to launch the necessary containers.
Use the Airflow DAG to download Acts and Regulations data from S3:
retrieve_data_from_s3_dagThis DAG will automatically pull the relevant raw data files to your local environment.
Manually run the following notebook to index Acts into Neo4j:
- Notebook:
examples/Neo4j/datacleanup_neo4j.ipynb - Status: No Airflow DAG yet
- Note: Execution time varies based on your compute resources.
⚠️ Missing: This notebook is not yet implemented. A placeholder should be added and flagged for development:
- Notebook:
index_regulations.ipynb(to be created)
Run the glossary indexing notebook to create nodes and establish related_Terms edges:
- Notebook:
examples/Embeddings/glossary.ipynb - Planned DAG:
glossary_dag.py
Run the ticket graphics notebook to index visual nodes and relationships:
- Notebook:
examples/Image OCR/get_ticket_Dispute.ipynb - Planned DAG:
ticket_images_dag.py
To ensure reproducibility and automation, the following notebooks will be converted into DAGs:
| Notebook | Planned Airflow DAG |
|---|---|
| index_acts.ipynb | acts_dag.py (optional) |
| index_regulations.ipynb | regulations_dag.py (once implemented) |
This document serves as a reproducibility guide for:
- Regenerating the graph database
- Tracking which notebooks and DAGs were used in production
- Understanding the indexing structure and data flow
All contributors are expected to keep this document updated as new indexing processes or DAGs are added.