|
| 1 | +```markdown |
1 | 2 | # Ingest Module Migration |
2 | 3 |
|
3 | 4 | This document records the migration of the Rust ingest code from `rust_ingest/` to |
@@ -39,3 +40,158 @@ This document records the migration of the Rust ingest code from `rust_ingest/` |
39 | 40 | 2. The code is organized by domain rather than technology |
40 | 41 | 3. Module boundaries are clearer in the new structure |
41 | 42 | 4. Future functionality can be added to the `src` directory with consistent organization |
| 43 | + |
| 44 | +## Shell Scripts Migration Plan |
| 45 | + |
| 46 | +This section outlines the plan to migrate essential shell scripts to Rust. The goal is to replace critical bash scripts with more maintainable, performant, and type-safe Rust implementations. |
| 47 | + |
| 48 | +### Migration Candidates (Prioritized) |
| 49 | + |
| 50 | +1. **text_chunker.sh** - Text chunking utility |
| 51 | +2. **ingest_chunked.sh** - Character-based chunking for Marvel character data |
| 52 | +3. **test_embedding_size.sh** - Embedding size testing utility |
| 53 | +4. **status.sh** - System status checking |
| 54 | +5. **JSON tools** (format_json.sh, validate_json_schema.sh) |
| 55 | + |
| 56 | +### Migration Strategy |
| 57 | + |
| 58 | +#### Phase 1: Core Text Chunking Utilities |
| 59 | + |
| 60 | +1. **Create text chunking module** in `src/utils/chunking.rs` |
| 61 | + - Implement character-based chunking |
| 62 | + - Implement size-based chunking |
| 63 | + - Implement semantic chunking |
| 64 | + - Support all functionality from text_chunker.sh |
| 65 | + |
| 66 | +2. **API Design**: |
| 67 | + |
| 68 | + ```rust |
| 69 | + pub enum ChunkingStrategy { |
| 70 | + Size(usize), // max_size |
| 71 | + Character(usize), // target_size |
| 72 | + Semantic, // natural breaks |
| 73 | + } |
| 74 | + |
| 75 | + pub struct ChunkerOptions { |
| 76 | + strategy: ChunkingStrategy, |
| 77 | + preserve_whitespace: bool, |
| 78 | + delimiter: Option<String>, |
| 79 | + } |
| 80 | + |
| 81 | + pub fn chunk_text(text: &str, options: ChunkerOptions) -> Vec<String>; |
| 82 | + ``` |
| 83 | + |
| 84 | +#### Phase 2: Embed API Integration |
| 85 | + |
| 86 | +1. **Create API module** in `src/core/embedding/ollama_api.rs`: |
| 87 | + - Implement functions for checking Ollama API status |
| 88 | + - Implement embedding generation with proper error handling |
| 89 | + - Support timeout and chunking for larger texts |
| 90 | + |
| 91 | +2. **API Design**: |
| 92 | + |
| 93 | + ```rust |
| 94 | + pub struct OllamaApiClient { |
| 95 | + base_url: String, |
| 96 | + timeout: Duration, |
| 97 | + } |
| 98 | + |
| 99 | + impl OllamaApiClient { |
| 100 | + pub fn new(base_url: &str) -> Self; |
| 101 | + pub fn check_status(&self) -> Result<bool>; |
| 102 | + pub fn generate_embedding(&self, text: &str, model: &str) -> Result<Vec<f32>>; |
| 103 | + } |
| 104 | + ``` |
| 105 | + |
| 106 | +#### Phase 3: Ingest Chunked Implementation |
| 107 | + |
| 108 | +1. **Extend existing ingest module**: |
| 109 | + - Add character-based chunking to `src/ingest/ingest.rs` |
| 110 | + - Support for JSON field extraction and processing |
| 111 | + - Progress logging and status reporting |
| 112 | + |
| 113 | +2. **Command-line interface extensions**: |
| 114 | + |
| 115 | + ```rust |
| 116 | + // CLI Options for chunked ingestion |
| 117 | + pub struct ChunkedIngestOptions { |
| 118 | + source_file: PathBuf, |
| 119 | + chunk_size: usize, |
| 120 | + model: String, |
| 121 | + log_file: Option<PathBuf>, |
| 122 | + } |
| 123 | + ``` |
| 124 | + |
| 125 | +#### Phase 4: CLI Enhancements |
| 126 | + |
| 127 | +1. **Unified CLI interface** in `src/main.rs`: |
| 128 | + |
| 129 | + ```rust |
| 130 | + fn main() { |
| 131 | + let matches = clap::Command::new("harald") |
| 132 | + .subcommand( |
| 133 | + clap::Command::new("chunk") |
| 134 | + .about("Chunk text using various strategies") |
| 135 | + // options |
| 136 | + ) |
| 137 | + .subcommand( |
| 138 | + clap::Command::new("ingest") |
| 139 | + .about("Ingest data into the vector database") |
| 140 | + // options |
| 141 | + ) |
| 142 | + // other subcommands |
| 143 | + .get_matches(); |
| 144 | + |
| 145 | + // handle commands |
| 146 | + } |
| 147 | + ``` |
| 148 | + |
| 149 | +### Implementation Roadmap |
| 150 | + |
| 151 | +1. **Week 1**: Implement text chunking module |
| 152 | + - ✅ Create chunking module in src/utils/chunking.rs |
| 153 | + - ✅ Create binary wrapper in src/utils/chunker_bin.rs |
| 154 | + - ✅ Update build configuration in Cargo.toml |
| 155 | + - ✅ Create compatibility wrapper scripts |
| 156 | + |
| 157 | +2. **Week 2**: Implement Ollama API client |
| 158 | + - ✅ Create Ollama API client module in src/core/embedding/ollama_api.rs |
| 159 | + - ⏳ Implement chunking-aware embedding generation |
| 160 | + - ⏳ Add proper error handling and logging |
| 161 | + - ⏳ Create test cases for API client |
| 162 | + |
| 163 | +3. **Week 3**: Extend ingest module with chunked ingestion |
| 164 | + - ⏳ Integrate text chunking with ingest process |
| 165 | + - ⏳ Implement character-based chunking for large fields |
| 166 | + - ⏳ Support semantic chunking for description fields |
| 167 | + - ⏳ Add progress reporting and better error messages |
| 168 | + |
| 169 | +4. **Week 4**: Create unified CLI and compatibility wrappers |
| 170 | + - ⏳ Design comprehensive CLI interface |
| 171 | + - ⏳ Implement subcommands (ingest, query, chunk, etc.) |
| 172 | + - ⏳ Create compatibility wrappers for all scripts |
| 173 | + - ⏳ Update documentation |
| 174 | + |
| 175 | +### Current Status (July 21, 2025) |
| 176 | + |
| 177 | +- Successfully migrated text_chunker.sh to Rust |
| 178 | +- Created a compatibility wrapper to maintain script interface |
| 179 | +- Implemented both character-based and semantic chunking strategies |
| 180 | +- Started work on the Ollama API client module |
| 181 | +- Compiled and tested the text_chunker binary successfully |
| 182 | + |
| 183 | +### Testing Strategy |
| 184 | + |
| 185 | +1. Create unit tests for each component |
| 186 | +2. Create integration tests that compare output with existing shell scripts |
| 187 | +3. Benchmark performance against shell script implementations |
| 188 | + |
| 189 | +### Compatibility Considerations |
| 190 | + |
| 191 | +During the transition period: |
| 192 | + |
| 193 | +1. Maintain shell script wrappers that call the Rust implementations |
| 194 | +2. Ensure consistent output formats and logging |
| 195 | +3. Document migration details for users |
| 196 | + |
| 197 | +``` |
0 commit comments