The stub now returns:
- 1 test document (
test_document.txt) with stable content - 1 test URL (
https://med.stanford.edu/irt.html)
python test_rag_em_api.pyThis will test the RAG EM API and purge the namespace.
curl -X POST "http://localhost:9090/api/ingest-batch" | python -m json.toolExpected behavior:
- Both doc + URL are NEW → both processed
- Content hashed and stored in database
- Sections ingested to Pinecone (dense + sparse)
- Vector IDs tracked in database
Expected response:
{
"status": "completed",
"run_id": "ingest_2026-02-14T...",
"summary": {
"documents_processed": 2,
"sections_ingested": 10, // depends on content
"documents_skipped": 0,
"documents_failed": 0
}
}docker-compose exec scraper python -c "
from rag_pipeline.database.connection import engine
from sqlalchemy import text
with engine.connect() as conn:
result = conn.execute(text('''
SELECT document_id, file_name, url,
rag_vector_id, rag_ingestion_status,
sections_processed, sections_total,
last_processed_at
FROM document_ingestion_state
ORDER BY last_processed_at DESC
'''))
print('Database Records:')
print('=' * 80)
for row in result:
print(f'Doc: {row[1] or row[2][:50]}')
print(f' ID: {row[0]}')
print(f' Status: {row[4]}')
print(f' Sections: {row[5]}/{row[6]}')
print(f' Vector ID: {row[3][:20]}...' if row[3] else ' Vector ID: None')
print(f' Processed: {row[7]}')
print()
"curl -X POST "http://localhost:9090/api/ingest-batch" | python -m json.toolExpected behavior:
- Hash matches → BOTH skipped
- No processing, no API calls
Expected response:
{
"status": "completed",
"summary": {
"documents_processed": 0,
"sections_ingested": 0,
"documents_skipped": 2, // ← Both skipped!
"documents_failed": 0
}
}Edit the stub to change the document content:
# Open the file
code /Users/irvins/Work/content_pipeline/rag_pipeline/automation/content_fetcher.py
# Find the test_doc content (line ~120)
# Add a new line at the end:
"""
... existing content ...
**MODIFIED CONTENT** - This line was added for re-ingestion testing.
"""curl -X POST "http://localhost:9090/api/ingest-batch" | python -m json.toolExpected behavior:
- Modified doc → hash mismatch → REPROCESSED ✅
- Same URL → hash match → SKIPPED ✅
- Old vector deleted, new vectors created
- Database updated with new hash + vector ID
Expected response:
{
"status": "completed",
"summary": {
"documents_processed": 1, // ← Only modified doc!
"sections_ingested": 8,
"documents_skipped": 1, // ← URL skipped!
"documents_failed": 0
}
}Check database again - you should see:
- New
rag_vector_idfor the modified doc - New
last_processed_attimestamp - Same data for the URL (unchanged)
Change the test URL:
Edit content_fetcher.py line ~135:
test_urls = [
"https://your-test-url.com"
]Change the test document content:
Edit content_fetcher.py line ~120-130 (the content= field)
Add more test items:
return [test_doc, test_doc2], [url1, url2, url3]Once SharePoint integration is ready, revert stub:
def fetch_content_sources_stub():
logger.warning("Using stub content fetcher - returning empty lists")
return [], []And switch orchestrator to use real fetcher:
# In orchestrator.py line ~193
return fetch_content_sources() # Instead of fetch_content_sources_stub()No documents processed:
- Check logs:
docker-compose logs scraper - Verify stub is being used (should see "Using stub content fetcher")
Documents fail to ingest:
- Check RAG EM is running
- Verify
REDCAP_API_TOKENis set - Test RAG EM API:
python test_rag_em_api.py
Hashes don't match on re-run:
- Content might have trailing whitespace differences
- Check
content_hashin database vs computed hash
After all steps:
- ✅ First run: Both items processed
- ✅ Second run: Both items skipped (hash match)
- ✅ Third run: Only modified doc processed, URL skipped
- ✅ Database has correct hashes, vector IDs, timestamps
- ✅ Old vectors deleted on re-ingestion
- ✅ No errors in logs