Skip to content

Epic: initial-implementation #68

@jeremymanning

Description

@jeremymanning

Epic: Initial HTFA Implementation

Overview

Create a complete, production-ready HTFA toolbox with automatic BIDS dataset processing, core TFA/HTFA algorithms, rich visualization capabilities, and comprehensive validation through synthetic data testing. The implementation follows scikit-learn patterns and provides both high-level (htfa.fit()) and low-level (TFA.fit(), HTFA.fit()) APIs.

Architecture Decisions

Core Algorithm Design

  • Optimization Framework: SciPy's non-linear least squares for factor estimation, ridge regression for weights
  • Initialization Strategy: K-means clustering of spatial coordinates for robust starting points
  • Convergence Detection: Parameter change monitoring with configurable tolerance
  • Multi-subject Handling: Hierarchical optimization with global template and factor matching

API Design Strategy

  • Input Detection: Automatic detection of BIDS directories vs NumPy arrays in .fit() method
  • Interface Pattern: Scikit-learn BaseEstimator for consistency with ML ecosystem
  • Results Container: Rich HTFAResults class with built-in visualization and export
  • Error Handling: Comprehensive validation with clear, actionable error messages

Data Pipeline Architecture

  • BIDS Integration: pybids for dataset parsing, nilearn for preprocessing
  • Preprocessing: Configurable pipeline with sensible defaults
  • Output Format: BIDS derivatives specification compliance
  • Validation Strategy: Synthetic data generation using HTFA generative process

Technical Approach

Core Algorithm Components

  • TFA Class: Single-subject spatial factor analysis with iterative optimization
  • HTFA Class: Multi-subject hierarchical analysis with global template computation
  • Initialization Module: K-means clustering and parameter initialization utilities
  • Optimization Engine: Robust numerical optimization with convergence monitoring

BIDS Integration Layer

  • Input Parser: Automatic detection and validation of BIDS vs array inputs
  • Preprocessing Pipeline: HTFAPreprocessor class with configurable steps
  • Metadata Handling: Preserve and propagate BIDS metadata through analysis
  • Output Writer: BIDS derivatives-compliant results export

Visualization and Results

  • HTFAResults Container: Comprehensive results storage with metadata
  • Brain Plotting: Leverage nilearn for professional brain visualizations
  • Time Series Plots: matplotlib/seaborn for temporal weight visualization
  • Export Functions: NIfTI reconstruction and BIDS derivatives output

Testing and Validation Framework

  • Synthetic Data Generator: HTFA generative process implementation
  • Parameter Recovery Tests: Validate algorithm accuracy on known ground truth
  • BIDS Test Datasets: Synthetic BIDS-formatted data for integration testing
  • Performance Benchmarking: Runtime and memory usage measurement

Implementation Strategy

Development Approach

  • Core-First: Implement and validate algorithms before integration layers
  • Test-Driven: Synthetic data validation drives correctness verification
  • Incremental Integration: Add BIDS support after core algorithms are stable
  • Modular Design: Clear separation between algorithms, preprocessing, and visualization

Risk Mitigation Strategy

  • Algorithm Validation: Extensive testing with synthetic data before real data
  • Performance Monitoring: Early profiling to identify optimization needs
  • API Evolution: Design for extensibility without breaking changes
  • Error Recovery: Comprehensive input validation and graceful failure handling

Quality Assurance

  • Continuous Testing: >90% coverage with synthetic and edge case testing
  • Type Safety: Full mypy compliance for algorithm correctness
  • Performance Baselines: Establish benchmarks for future optimization
  • Documentation Standards: Google-style docstrings with usage examples

Task Breakdown Preview

High-level task categories that will be created:

  • Core TFA Algorithm: K-means initialization, non-linear optimization, convergence detection
  • Hierarchical HTFA Algorithm: Multi-subject optimization, global template, factor matching
  • Input Detection and BIDS Integration: Automatic parsing, validation, preprocessing pipeline
  • HTFAResults and Visualization: Results container, brain plotting, export functionality
  • Synthetic Data Generation: HTFA generative process, BIDS formatting, parameter recovery
  • API Integration and Polish: High-level API, error handling, documentation
  • Performance Validation: Benchmarking, memory profiling, optimization identification
  • Testing Infrastructure: Comprehensive test suite, CI integration, quality gates

Dependencies

External Package Dependencies

  • NumPy/SciPy: Core numerical computation and optimization
  • scikit-learn: BaseEstimator interface and clustering algorithms
  • nilearn: Neuroimaging preprocessing, visualization, and NIfTI handling
  • pybids: BIDS dataset parsing and validation
  • matplotlib/seaborn: Plotting and visualization
  • pandas: Tabular data handling for metadata

Internal Codebase Dependencies

  • Existing Package Structure: Build upon current htfa/ directory layout
  • Poetry Configuration: Extend current dependency management
  • Testing Framework: Enhance existing pytest infrastructure
  • Linting Setup: Use configured black, mypy, and quality tools

Research and Validation Dependencies

  • Technical Design Document: Algorithm specifications and implementation guidance
  • HTFA Mathematical Foundation: Generative process for synthetic data creation
  • Scikit-learn Patterns: API consistency and interface design
  • BIDS Specification: Output format compliance and metadata handling

Success Criteria (Technical)

Algorithm Correctness

  • Parameter Recovery: >95% accuracy on synthetic datasets across noise levels
  • Convergence Reliability: Stable optimization for >99% of valid input datasets
  • Numerical Stability: Robust performance with ill-conditioned data matrices
  • Multi-subject Consistency: Factor alignment and global template accuracy

Performance Benchmarks

  • Runtime Efficiency: Analysis time within 2x of BrainIAK baseline
  • Memory Efficiency: Linear scaling with dataset size, handle 100+ subjects
  • Preprocessing Speed: BIDS dataset loading and preprocessing < 25% of total runtime
  • Visualization Response: Plot generation < 5 seconds for typical factors

Code Quality Metrics

  • Test Coverage: >90% line coverage across all modules
  • Type Safety: 100% mypy compliance with no type ignores
  • Linting Compliance: Pass all black, isort, and style checks
  • Documentation Coverage: Google-style docstrings for all public APIs

User Experience Benchmarks

  • Installation Time: Complete setup in < 2 minutes on clean environment
  • API Learning Curve: Single-line analysis without prior knowledge
  • Error Clarity: Self-explanatory error messages with actionable solutions
  • Result Accessibility: Publication-ready visualizations without configuration

Estimated Effort

Overall Timeline

  • Core Development: 3-4 weeks of focused implementation
  • Integration and Testing: 1-2 weeks of comprehensive validation
  • Polish and Documentation: 1 week of API refinement and docs

Resource Requirements

  • Primary Developer: 1 full-time equivalent for algorithm implementation
  • Testing Support: 0.5 FTE for synthetic data generation and validation
  • Integration Expertise: 0.25 FTE for BIDS specification compliance

Critical Path Items

  1. TFA Algorithm Implementation: Foundation for all other components
  2. Synthetic Data Generation: Required for comprehensive testing
  3. HTFA Hierarchical Optimization: Most complex algorithmic component
  4. BIDS Integration: Essential for user adoption and workflow integration

Stats

Total tasks: 8
Parallel tasks: 3 (can be worked on simultaneously)
Sequential tasks: 5 (have dependencies)
Estimated total effort: 14-19 days (275 hours)

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions