|
| 1 | +# Pydala Codebase Refactoring Plan |
| 2 | + |
| 3 | +## Overview |
| 4 | +This document outlines a comprehensive refactoring plan for the Pydala codebase, focusing on improving code quality, reducing complexity, and enhancing maintainability while ensuring full backward compatibility. |
| 5 | + |
| 6 | +## Analysis Summary |
| 7 | + |
| 8 | +The codebase analysis revealed several areas for improvement: |
| 9 | +- **Code Quality**: Large blocks of commented code, bare exception clauses |
| 10 | +- **Complexity**: Methods with high cyclomatic complexity, duplicate code |
| 11 | +- **Type Safety**: Missing type hints, insufficient input validation |
| 12 | +- **Maintainability**: Magic numbers, scattered configuration |
| 13 | + |
| 14 | +## Priority 1: Code Cleanup and Bug Fixes |
| 15 | + |
| 16 | +### 1. Remove Commented Code in misc.py |
| 17 | +- **Location**: `pydala/helpers/misc.py:15-370` |
| 18 | +- **Issue**: ~355 lines of commented code (90% of file) |
| 19 | +- **Action**: Remove all commented code blocks |
| 20 | +- **Impact**: |
| 21 | + - Reduces file size from ~400 lines to ~45 lines |
| 22 | + - Improves code readability and navigation |
| 23 | + - Eliminates confusion about deprecated functionality |
| 24 | + |
| 25 | +### 2. Fix Critical Bug in catalog.py |
| 26 | +- **Location**: `pydala/catalog.py:63-66` |
| 27 | +- **Current Code**: |
| 28 | + ```python |
| 29 | + # Update the catalog with itself? |
| 30 | + catalog_dict.update(self.to_dict()) # This line is suspicious |
| 31 | + ``` |
| 32 | +- **Issue**: Redundant self-update potentially causing data corruption |
| 33 | +- **Fix**: Remove or properly document this update logic |
| 34 | + |
| 35 | +### 3. Fix Bare Exception Clause |
| 36 | +- **Location**: `pydala/io.py:49` |
| 37 | +- **Current Code**: |
| 38 | + ```python |
| 39 | + except Exception: |
| 40 | + pass |
| 41 | + ``` |
| 42 | +- **Issue**: Masks all errors without logging |
| 43 | +- **Fix**: |
| 44 | + ```python |
| 45 | + except (OSError, IOError) as e: |
| 46 | + logger.warning(f"Failed to process {file_path}: {e}") |
| 47 | + ``` |
| 48 | + |
| 49 | +## Priority 2: Reduce Complexity |
| 50 | + |
| 51 | +### 4. Simplify _get_sort_by Method |
| 52 | +- **Location**: `pydala/table.py:30-93` |
| 53 | +- **Issue**: 63-line method with high cyclomatic complexity |
| 54 | +- **Action**: Extract helper functions: |
| 55 | + - `_parse_string_sort(value: str) -> SortKey` |
| 56 | + - `_parse_callable_sort(value: Callable) -> SortKey` |
| 57 | + - `_validate_sort_key(key: SortKey) -> None` |
| 58 | +- **Benefits**: |
| 59 | + - Improves testability |
| 60 | + - Reduces cognitive complexity |
| 61 | + - Enables better error messages |
| 62 | + |
| 63 | +### 5. Remove Duplicate Scanner Method |
| 64 | +- **Location**: `pydala/table.py` |
| 65 | +- **Issue**: `to_scanner()` and `scanner()` are identical (lines 748-755, 757-764) |
| 66 | +- **Action**: |
| 67 | + - Keep `scanner()` as primary method |
| 68 | + - Mark `to_scanner()` as deprecated with warning |
| 69 | + - Update documentation |
| 70 | + |
| 71 | +### 6. Extract Scanner Parameters |
| 72 | +- **Issue**: Scanner configuration duplicated across 10+ methods |
| 73 | +- **Action**: Create `ScannerConfig` dataclass: |
| 74 | + ```python |
| 75 | + @dataclass |
| 76 | + class ScannerConfig: |
| 77 | + batch_size: int = 131072 |
| 78 | + buffer_size: int = 65536 |
| 79 | + prefetch: int = 2 |
| 80 | + num_threads: int = 4 |
| 81 | + ``` |
| 82 | +- **Files to Update**: `table.py`, `dataset.py`, `scanner.py` |
| 83 | +- **Benefits**: |
| 84 | + - Centralized configuration |
| 85 | + - Easier parameter management |
| 86 | + - Consistent defaults across the codebase |
| 87 | + |
| 88 | +## Priority 3: Improve Type Safety |
| 89 | + |
| 90 | +### 7. Add Type Hints |
| 91 | +- **Focus Areas**: |
| 92 | + - All public methods in core modules |
| 93 | + - Complex methods in `table.py`, `dataset.py` |
| 94 | + - Helper functions in `misc.py` |
| 95 | +- **Example**: |
| 96 | + ```python |
| 97 | + def head( |
| 98 | + self, |
| 99 | + n: int = 5, |
| 100 | + columns: Optional[List[str]] = None |
| 101 | + ) -> "Table": |
| 102 | + ... |
| 103 | + ``` |
| 104 | + |
| 105 | +### 8. Add Input Validation |
| 106 | +- **Approach**: |
| 107 | + - Add parameter validation at public API boundaries |
| 108 | + - Use clear, descriptive error messages |
| 109 | + - Validate types, ranges, and constraints |
| 110 | +- **Example**: |
| 111 | + ```python |
| 112 | + if not isinstance(n, int) or n < 0: |
| 113 | + raise ValueError("n must be a non-negative integer") |
| 114 | + ``` |
| 115 | + |
| 116 | +## Priority 4: Maintainability Improvements |
| 117 | + |
| 118 | +### 9. Create Constants File |
| 119 | +- **New File**: `pydala/constants.py` |
| 120 | +- **Content**: |
| 121 | + ```python |
| 122 | + # Performance tuning |
| 123 | + DEFAULT_BATCH_SIZE = 131072 |
| 124 | + DEFAULT_BUFFER_SIZE = 65536 |
| 125 | + DEFAULT_PREFETCH_COUNT = 2 |
| 126 | + DEFAULT_THREAD_COUNT = 4 |
| 127 | + |
| 128 | + # Validation |
| 129 | + MAX_COLUMN_NAME_LENGTH = 255 |
| 130 | + MIN_PARTITION_SIZE = 1024 |
| 131 | + ``` |
| 132 | +- **Benefits**: |
| 133 | + - Single source of truth |
| 134 | + - Easier tuning |
| 135 | + - Better documentation |
| 136 | + |
| 137 | +### 10. Fix Test Infrastructure |
| 138 | +- **Location**: `tests/test_table.py` |
| 139 | +- **Issues**: |
| 140 | + - Missing imports |
| 141 | + - Incomplete test coverage |
| 142 | + - Outdated test cases |
| 143 | +- **Action**: |
| 144 | + - Fix import statements |
| 145 | + - Add tests for refactored methods |
| 146 | + - Ensure 90%+ code coverage |
| 147 | + |
| 148 | +## Implementation Strategy |
| 149 | + |
| 150 | +### Phase 1: Code Cleanup (Days 1-2) |
| 151 | +1. Remove commented code |
| 152 | +2. Fix critical bugs |
| 153 | +3. Update exception handling |
| 154 | + |
| 155 | +### Phase 2: Complexity Reduction (Days 3-5) |
| 156 | +1. Extract helper methods |
| 157 | +2. Remove duplicate code |
| 158 | +3. Create configuration classes |
| 159 | + |
| 160 | +### Phase 3: Type Safety (Days 6-7) |
| 161 | +1. Add type hints |
| 162 | +2. Add input validation |
| 163 | +3. Update documentation |
| 164 | + |
| 165 | +### Phase 4: Maintainability (Days 8-9) |
| 166 | +1. Create constants file |
| 167 | +2. Update tests |
| 168 | +3. Performance optimization |
| 169 | + |
| 170 | +### Phase 5: Testing and Documentation (Day 10) |
| 171 | +1. Run full test suite |
| 172 | +2. Update API documentation |
| 173 | +3. Create migration guide |
| 174 | + |
| 175 | +## Backward Compatibility Guarantees |
| 176 | + |
| 177 | +1. **API Stability**: No breaking changes to public APIs |
| 178 | +2. **Method Signatures**: All existing parameters remain supported |
| 179 | +3. **Return Types**: No changes to return types |
| 180 | +4. **Error Handling**: Same exceptions thrown for same conditions |
| 181 | +5. **Deprecation Path**: Deprecated methods will warn but continue to work |
| 182 | + |
| 183 | +## Risk Assessment |
| 184 | + |
| 185 | +### Low Risk |
| 186 | +- Removing commented code |
| 187 | +- Adding type hints |
| 188 | +- Creating constants file |
| 189 | + |
| 190 | +### Medium Risk |
| 191 | +- Extracting helper methods |
| 192 | +- Adding input validation |
| 193 | +- Fixing exception handling |
| 194 | + |
| 195 | +### High Risk |
| 196 | +- Bug fixes in core logic |
| 197 | +- Removing duplicate methods |
| 198 | +- Performance optimizations |
| 199 | + |
| 200 | +## Success Metrics |
| 201 | + |
| 202 | +1. **Code Quality**: |
| 203 | + - Reduce cyclomatic complexity by 40% |
| 204 | + - Eliminate all commented code |
| 205 | + - Fix all linting issues |
| 206 | + |
| 207 | +2. **Maintainability**: |
| 208 | + - Achieve 90%+ test coverage |
| 209 | + - Add type hints to 100% of public methods |
| 210 | + - Reduce code duplication by 30% |
| 211 | + |
| 212 | +3. **Performance**: |
| 213 | + - No performance regression |
| 214 | + - 10% improvement in memory usage for large datasets |
| 215 | + |
| 216 | +## Rollback Plan |
| 217 | + |
| 218 | +1. Git tags will be created before each major change |
| 219 | +2. Each commit will be atomic and revertable |
| 220 | +3. Continuous integration will catch regressions early |
| 221 | +4. Feature flags for performance optimizations |
| 222 | + |
| 223 | +## Next Steps |
| 224 | + |
| 225 | +1. Create feature branch for refactoring |
| 226 | +2. Set up continuous integration |
| 227 | +3. Begin with Phase 1 (Code Cleanup) |
| 228 | +4. Regular progress updates to stakeholders |
| 229 | +5. Code reviews for each change |
| 230 | + |
| 231 | +This plan provides a structured approach to improving the Pydala codebase while ensuring stability and backward compatibility throughout the process. |
0 commit comments