Skip to content

v0.2.0 - Major Pipeline Optimization and Infrastructure Improvements

Latest

Choose a tag to compare

@PhilipMathieu PhilipMathieu released this 09 Nov 20:25
· 41 commits to main since this release

Major Changes

Performance Optimizations

  • Migrate from NetworkX to rustworkx for 5-10x faster graph operations
  • Replace shapefiles/GeoJSON with GeoParquet for 2-5x faster I/O operations
  • Replace CSV with Parquet for tabular data storage
  • Implement single Dijkstra algorithm (replaces multiple ego_graph calls)
  • Add bounded Dijkstra algorithm for efficient walk time computation with distance bounding

Infrastructure Improvements

  • DVC to Git LFS migration and automated data update system
  • Automated data source updates with version checking and metadata tracking
  • Data validation scripts for schema and data quality checks
  • Graph conversion caching for improved performance

Testing & Development

  • Comprehensive testing framework with pytest and pytest-cov
  • Enhanced project dependencies including statsmodels for statistical analysis
  • Parallel processing support for walk time calculations
  • Migration scripts for converting existing data files to new formats

Documentation

  • Data dictionary (DATA_DICTIONARY.md) with comprehensive details on data files and workflows
  • Project backlog (BACKLOG.md) for tracking technical debt and feature requests
  • CEJST workflow documentation (README_CEJST.md)
  • H3 implementation details and Census API key setup instructions
  • Enhanced README with updated guidance

Pipeline Enhancements

  • Enhanced pipeline scripts (run_pipeline.sh, run_pipeline.py) with logging and error handling
  • Jupyter notebooks for walk times and merging analysis
  • Updated validation to support Parquet files
  • Improved code organization and consistency across modules