Epic: initial-implementation


# Epic: Initial HTFA Implementation

## Overview

Create a complete, production-ready HTFA toolbox with automatic BIDS dataset processing, core TFA/HTFA algorithms, rich visualization capabilities, and comprehensive validation through synthetic data testing. The implementation follows scikit-learn patterns and provides both high-level (`htfa.fit()`) and low-level (`TFA.fit()`, `HTFA.fit()`) APIs.

## Architecture Decisions

### Core Algorithm Design
- **Optimization Framework**: SciPy's non-linear least squares for factor estimation, ridge regression for weights
- **Initialization Strategy**: K-means clustering of spatial coordinates for robust starting points
- **Convergence Detection**: Parameter change monitoring with configurable tolerance
- **Multi-subject Handling**: Hierarchical optimization with global template and factor matching

### API Design Strategy
- **Input Detection**: Automatic detection of BIDS directories vs NumPy arrays in `.fit()` method
- **Interface Pattern**: Scikit-learn BaseEstimator for consistency with ML ecosystem
- **Results Container**: Rich HTFAResults class with built-in visualization and export
- **Error Handling**: Comprehensive validation with clear, actionable error messages

### Data Pipeline Architecture
- **BIDS Integration**: pybids for dataset parsing, nilearn for preprocessing
- **Preprocessing**: Configurable pipeline with sensible defaults
- **Output Format**: BIDS derivatives specification compliance
- **Validation Strategy**: Synthetic data generation using HTFA generative process

## Technical Approach

### Core Algorithm Components
- **TFA Class**: Single-subject spatial factor analysis with iterative optimization
- **HTFA Class**: Multi-subject hierarchical analysis with global template computation
- **Initialization Module**: K-means clustering and parameter initialization utilities
- **Optimization Engine**: Robust numerical optimization with convergence monitoring

### BIDS Integration Layer
- **Input Parser**: Automatic detection and validation of BIDS vs array inputs
- **Preprocessing Pipeline**: HTFAPreprocessor class with configurable steps
- **Metadata Handling**: Preserve and propagate BIDS metadata through analysis
- **Output Writer**: BIDS derivatives-compliant results export

### Visualization and Results
- **HTFAResults Container**: Comprehensive results storage with metadata
- **Brain Plotting**: Leverage nilearn for professional brain visualizations
- **Time Series Plots**: matplotlib/seaborn for temporal weight visualization
- **Export Functions**: NIfTI reconstruction and BIDS derivatives output

### Testing and Validation Framework
- **Synthetic Data Generator**: HTFA generative process implementation
- **Parameter Recovery Tests**: Validate algorithm accuracy on known ground truth
- **BIDS Test Datasets**: Synthetic BIDS-formatted data for integration testing
- **Performance Benchmarking**: Runtime and memory usage measurement

## Implementation Strategy

### Development Approach
- **Core-First**: Implement and validate algorithms before integration layers
- **Test-Driven**: Synthetic data validation drives correctness verification
- **Incremental Integration**: Add BIDS support after core algorithms are stable
- **Modular Design**: Clear separation between algorithms, preprocessing, and visualization

### Risk Mitigation Strategy
- **Algorithm Validation**: Extensive testing with synthetic data before real data
- **Performance Monitoring**: Early profiling to identify optimization needs
- **API Evolution**: Design for extensibility without breaking changes
- **Error Recovery**: Comprehensive input validation and graceful failure handling

### Quality Assurance
- **Continuous Testing**: >90% coverage with synthetic and edge case testing
- **Type Safety**: Full mypy compliance for algorithm correctness
- **Performance Baselines**: Establish benchmarks for future optimization
- **Documentation Standards**: Google-style docstrings with usage examples

## Task Breakdown Preview

High-level task categories that will be created:

- [ ] **Core TFA Algorithm**: K-means initialization, non-linear optimization, convergence detection
- [ ] **Hierarchical HTFA Algorithm**: Multi-subject optimization, global template, factor matching
- [ ] **Input Detection and BIDS Integration**: Automatic parsing, validation, preprocessing pipeline
- [ ] **HTFAResults and Visualization**: Results container, brain plotting, export functionality
- [ ] **Synthetic Data Generation**: HTFA generative process, BIDS formatting, parameter recovery
- [ ] **API Integration and Polish**: High-level API, error handling, documentation
- [ ] **Performance Validation**: Benchmarking, memory profiling, optimization identification
- [ ] **Testing Infrastructure**: Comprehensive test suite, CI integration, quality gates

## Dependencies

### External Package Dependencies
- **NumPy/SciPy**: Core numerical computation and optimization
- **scikit-learn**: BaseEstimator interface and clustering algorithms  
- **nilearn**: Neuroimaging preprocessing, visualization, and NIfTI handling
- **pybids**: BIDS dataset parsing and validation
- **matplotlib/seaborn**: Plotting and visualization
- **pandas**: Tabular data handling for metadata

### Internal Codebase Dependencies
- **Existing Package Structure**: Build upon current htfa/ directory layout
- **Poetry Configuration**: Extend current dependency management
- **Testing Framework**: Enhance existing pytest infrastructure
- **Linting Setup**: Use configured black, mypy, and quality tools

### Research and Validation Dependencies
- **Technical Design Document**: Algorithm specifications and implementation guidance
- **HTFA Mathematical Foundation**: Generative process for synthetic data creation
- **Scikit-learn Patterns**: API consistency and interface design
- **BIDS Specification**: Output format compliance and metadata handling

## Success Criteria (Technical)

### Algorithm Correctness
- **Parameter Recovery**: >95% accuracy on synthetic datasets across noise levels
- **Convergence Reliability**: Stable optimization for >99% of valid input datasets
- **Numerical Stability**: Robust performance with ill-conditioned data matrices
- **Multi-subject Consistency**: Factor alignment and global template accuracy

### Performance Benchmarks
- **Runtime Efficiency**: Analysis time within 2x of BrainIAK baseline
- **Memory Efficiency**: Linear scaling with dataset size, handle 100+ subjects
- **Preprocessing Speed**: BIDS dataset loading and preprocessing < 25% of total runtime
- **Visualization Response**: Plot generation < 5 seconds for typical factors

### Code Quality Metrics
- **Test Coverage**: >90% line coverage across all modules
- **Type Safety**: 100% mypy compliance with no type ignores
- **Linting Compliance**: Pass all black, isort, and style checks
- **Documentation Coverage**: Google-style docstrings for all public APIs

### User Experience Benchmarks
- **Installation Time**: Complete setup in < 2 minutes on clean environment
- **API Learning Curve**: Single-line analysis without prior knowledge
- **Error Clarity**: Self-explanatory error messages with actionable solutions
- **Result Accessibility**: Publication-ready visualizations without configuration

## Estimated Effort

### Overall Timeline
- **Core Development**: 3-4 weeks of focused implementation
- **Integration and Testing**: 1-2 weeks of comprehensive validation
- **Polish and Documentation**: 1 week of API refinement and docs

### Resource Requirements
- **Primary Developer**: 1 full-time equivalent for algorithm implementation
- **Testing Support**: 0.5 FTE for synthetic data generation and validation
- **Integration Expertise**: 0.25 FTE for BIDS specification compliance

### Critical Path Items
1. **TFA Algorithm Implementation**: Foundation for all other components
2. **Synthetic Data Generation**: Required for comprehensive testing
3. **HTFA Hierarchical Optimization**: Most complex algorithmic component
4. **BIDS Integration**: Essential for user adoption and workflow integration

## Stats

Total tasks: 8
Parallel tasks: 3 (can be worked on simultaneously)
Sequential tasks: 5 (have dependencies)
Estimated total effort: 14-19 days (275 hours)


Epic: initial-implementation #68

Description

Epic: Initial HTFA Implementation

Overview

Architecture Decisions

Core Algorithm Design

API Design Strategy

Data Pipeline Architecture

Technical Approach

Core Algorithm Components

BIDS Integration Layer

Visualization and Results

Testing and Validation Framework

Implementation Strategy

Development Approach

Risk Mitigation Strategy

Quality Assurance

Task Breakdown Preview

Dependencies

External Package Dependencies

Internal Codebase Dependencies

Research and Validation Dependencies

Success Criteria (Technical)

Algorithm Correctness

Performance Benchmarks

Code Quality Metrics

User Experience Benchmarks

Estimated Effort

Overall Timeline

Resource Requirements

Critical Path Items

Stats

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions