Research project examining systematic patterns in ethnicity classification within public arrest records using computer vision models.
Status: IRB Review and Legal Review in progress. Software infrastructure being built in anticipation of approval.
See docs/claude.md for AI assistant guidelines and project context.
See docs/planning/system_architecture.md for detailed technical architecture.
correct-crime-data/
├── src/ # Source code (Python package)
│ ├── scrapers/ # Web scraping modules for data collection
│ ├── preprocessing/ # Image preprocessing and quality assessment
│ ├── inference/ # Model inference (API and local)
│ ├── analysis/ # Statistical analysis and reporting
│ ├── validation/ # Model validation and benchmarking
│ ├── database/ # Database models and migrations
│ └── utils/ # Shared utilities and helpers
│
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── fixtures/ # Test fixtures and mock data
│
├── scripts/ # Standalone scripts
│ ├── pilot/ # Phase 1 pilot scripts (simple, self-contained)
│ ├── data_collection/ # Data collection utilities
│ ├── quality_checks/ # Data quality validation scripts
│ └── migrations/ # Database migration scripts
│
├── config/ # Configuration files
│ ├── development/ # Dev environment configs
│ ├── staging/ # Staging environment configs
│ └── production/ # Production environment configs
│
├── data/ # Data storage (gitignored, except .gitkeep)
│ ├── raw/ # Raw scraped data
│ ├── processed/ # Processed and cleaned data
│ ├── validation_set/ # Gold-standard validation data
│ └── exports/ # Analysis outputs and reports
│
├── models/ # Model weights and configs
│ ├── weights/ # Model weight files (gitignored)
│ └── configs/ # Model configuration files
│
├── notebooks/ # Jupyter notebooks
│ ├── exploratory/ # Exploratory data analysis
│ ├── analysis/ # Statistical analysis notebooks
│ └── validation/ # Model validation notebooks
│
├── docker/ # Dockerfiles
│ ├── cpu-worker/ # CPU-bound worker (scraping, preprocessing)
│ ├── gpu-worker/ # GPU-bound worker (inference)
│ └── scraper/ # Standalone scraper container
│
├── .github/ # GitHub Actions workflows
│ └── workflows/ # CI/CD pipeline definitions
│
├── compliance/ # Compliance and governance
│ ├── irb/ # IRB documentation and approvals
│ ├── legal/ # Legal reviews and opinions
│ └── source_registry/ # Registry of approved data sources
│
├── docs/ # Documentation
│ ├── claude.md # AI assistant guidelines
│ ├── planning/ # Planning and architecture docs
│ ├── api/ # API documentation
│ └── user-guides/ # User guides and tutorials
│
├── mlruns/ # MLflow experiment tracking (gitignored)
├── outputs/ # Runtime outputs (gitignored)
│ ├── reports/ # Generated reports
│ ├── visualizations/ # Generated plots and charts
│ └── statistics/ # Statistical analysis outputs
│
└── logs/ # Application logs (gitignored)
- Goal: Validate research question with minimal investment
- Scope: 100-500 records, single jurisdiction
- Budget: <$1,000
- Duration: 4 weeks
- Tech: SQLite, API-based inference (Claude/GPT-4V), simple Python scripts
- Goal: Validate findings across multiple jurisdictions
- Scope: 2,000-5,000 records, 2-3 jurisdictions
- Budget: $3,000-5,000
- Duration: 8 weeks
- Tech: PostgreSQL, Celery, multi-model comparison
- Goal: Large-scale data collection and analysis
- Scope: 50,000-100,000 records, 10-15 jurisdictions
- Budget: $50,000-100,000
- Duration: 14 weeks
- Tech: Local GPU inference, Kafka/Airflow, full observability
- Python 3.10+
- Git
- Docker (optional, for containerized development)
- Clone the repository:
git clone <repository-url>
cd correct-crime-data- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
# Phase 1 (Pilot) minimal dependencies
pip install -r requirements-pilot.txt
# OR Phase 3 (Production) full dependencies
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your API keys and configuration- Initialize database (Phase 1):
python scripts/pilot/init_db.py# Collect data (from official APIs only)
python scripts/pilot/collect_data.py --city chicago --limit 500
# Run quality assessment
python scripts/pilot/assess_quality.py
# Run inference (API-based)
python scripts/pilot/run_inference.py --model claude
# Analyze results
python scripts/pilot/analyze.py# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/- Architecture: docs/planning/system_architecture.md
- AI Assistant Guide: docs/claude.md
- API Docs: Coming soon
- User Guides: Coming soon
IRB Review: In progress with Institutional Review Board Legal Review: In progress with legal counsel Ethics Consultation: Engaging with stakeholders and civil rights organizations
All code and architecture is designed to be flexible and will be modified to comply with any requirements from IRB, legal counsel, and ethics reviewers.
- Focus on technical contributions (see docs/claude.md)
- Write tests for all new code
- Follow PEP 8 style guidelines
- Document functions with docstrings
- Update relevant documentation
[To be determined - pending IRB and legal review]
[Project lead contact information]
This project uses:
- Computer vision models for appearance-based ethnicity estimation
- Official public data sources (with proper approval)
- Statistical methods for rigorous analysis
- Comprehensive data governance and privacy protections