Crime Data Classification Analysis

Research project examining systematic patterns in ethnicity classification within public arrest records using computer vision models.

Status: IRB Review and Legal Review in progress. Software infrastructure being built in anticipation of approval.

Quick Start

See docs/claude.md for AI assistant guidelines and project context.

See docs/planning/system_architecture.md for detailed technical architecture.

Project Structure

correct-crime-data/
├── src/                          # Source code (Python package)
│   ├── scrapers/                 # Web scraping modules for data collection
│   ├── preprocessing/            # Image preprocessing and quality assessment
│   ├── inference/                # Model inference (API and local)
│   ├── analysis/                 # Statistical analysis and reporting
│   ├── validation/               # Model validation and benchmarking
│   ├── database/                 # Database models and migrations
│   └── utils/                    # Shared utilities and helpers
│
├── tests/                        # Test suite
│   ├── unit/                     # Unit tests
│   ├── integration/              # Integration tests
│   └── fixtures/                 # Test fixtures and mock data
│
├── scripts/                      # Standalone scripts
│   ├── pilot/                    # Phase 1 pilot scripts (simple, self-contained)
│   ├── data_collection/          # Data collection utilities
│   ├── quality_checks/           # Data quality validation scripts
│   └── migrations/               # Database migration scripts
│
├── config/                       # Configuration files
│   ├── development/              # Dev environment configs
│   ├── staging/                  # Staging environment configs
│   └── production/               # Production environment configs
│
├── data/                         # Data storage (gitignored, except .gitkeep)
│   ├── raw/                      # Raw scraped data
│   ├── processed/                # Processed and cleaned data
│   ├── validation_set/           # Gold-standard validation data
│   └── exports/                  # Analysis outputs and reports
│
├── models/                       # Model weights and configs
│   ├── weights/                  # Model weight files (gitignored)
│   └── configs/                  # Model configuration files
│
├── notebooks/                    # Jupyter notebooks
│   ├── exploratory/              # Exploratory data analysis
│   ├── analysis/                 # Statistical analysis notebooks
│   └── validation/               # Model validation notebooks
│
├── docker/                       # Dockerfiles
│   ├── cpu-worker/               # CPU-bound worker (scraping, preprocessing)
│   ├── gpu-worker/               # GPU-bound worker (inference)
│   └── scraper/                  # Standalone scraper container
│
├── .github/                      # GitHub Actions workflows
│   └── workflows/                # CI/CD pipeline definitions
│
├── compliance/                   # Compliance and governance
│   ├── irb/                      # IRB documentation and approvals
│   ├── legal/                    # Legal reviews and opinions
│   └── source_registry/          # Registry of approved data sources
│
├── docs/                         # Documentation
│   ├── claude.md                 # AI assistant guidelines
│   ├── planning/                 # Planning and architecture docs
│   ├── api/                      # API documentation
│   └── user-guides/              # User guides and tutorials
│
├── mlruns/                       # MLflow experiment tracking (gitignored)
├── outputs/                      # Runtime outputs (gitignored)
│   ├── reports/                  # Generated reports
│   ├── visualizations/           # Generated plots and charts
│   └── statistics/               # Statistical analysis outputs
│
└── logs/                         # Application logs (gitignored)

Development Phases

Phase 1: Minimal Viable Pilot

Goal: Validate research question with minimal investment
Scope: 100-500 records, single jurisdiction
Budget: <$1,000
Duration: 4 weeks
Tech: SQLite, API-based inference (Claude/GPT-4V), simple Python scripts

Phase 2: Expanded Pilot

Goal: Validate findings across multiple jurisdictions
Scope: 2,000-5,000 records, 2-3 jurisdictions
Budget: $3,000-5,000
Duration: 8 weeks
Tech: PostgreSQL, Celery, multi-model comparison

Phase 3: Production System

Goal: Large-scale data collection and analysis
Scope: 50,000-100,000 records, 10-15 jurisdictions
Budget: $50,000-100,000
Duration: 14 weeks
Tech: Local GPU inference, Kafka/Airflow, full observability

Getting Started

Prerequisites

Python 3.10+
Git
Docker (optional, for containerized development)

Installation

Clone the repository:

git clone <repository-url>
cd correct-crime-data

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

# Phase 1 (Pilot) minimal dependencies
pip install -r requirements-pilot.txt

# OR Phase 3 (Production) full dependencies
pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys and configuration

Initialize database (Phase 1):

python scripts/pilot/init_db.py

Running the Pilot

# Collect data (from official APIs only)
python scripts/pilot/collect_data.py --city chicago --limit 500

# Run quality assessment
python scripts/pilot/assess_quality.py

# Run inference (API-based)
python scripts/pilot/run_inference.py --model claude

# Analyze results
python scripts/pilot/analyze.py

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test suite
pytest tests/unit/
pytest tests/integration/

Documentation

Architecture: docs/planning/system_architecture.md
AI Assistant Guide: docs/claude.md
API Docs: Coming soon
User Guides: Coming soon

Compliance

IRB Review: In progress with Institutional Review Board Legal Review: In progress with legal counsel Ethics Consultation: Engaging with stakeholders and civil rights organizations

All code and architecture is designed to be flexible and will be modified to comply with any requirements from IRB, legal counsel, and ethics reviewers.

Contributing

Focus on technical contributions (see docs/claude.md)
Write tests for all new code
Follow PEP 8 style guidelines
Document functions with docstrings
Update relevant documentation

License

[To be determined - pending IRB and legal review]

Contact

[Project lead contact information]

Acknowledgments

This project uses:

Computer vision models for appearance-based ethnicity estimation
Official public data sources (with proper approval)
Statistical methods for rigorous analysis
Comprehensive data governance and privacy protections

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
codex		codex
data		data
models		models
notebooks		notebooks
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
claude.md		claude.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-pilot.txt		requirements-pilot.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crime Data Classification Analysis

Quick Start

Project Structure

Development Phases

Phase 1: Minimal Viable Pilot

Phase 2: Expanded Pilot

Phase 3: Production System

Getting Started

Prerequisites

Installation

Running the Pilot

Testing

Documentation

Compliance

Contributing

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

treygoff24/correct-crime-data

Folders and files

Latest commit

History

Repository files navigation

Crime Data Classification Analysis

Quick Start

Project Structure

Development Phases

Phase 1: Minimal Viable Pilot

Phase 2: Expanded Pilot

Phase 3: Production System

Getting Started

Prerequisites

Installation

Running the Pilot

Testing

Documentation

Compliance

Contributing

License

Contact

Acknowledgments

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages