Skip to content

Commit 990481d

Browse files
committed
Refactor project structure and implement text chunking utility
- Renamed package from "harald_ingest" to "harald" and updated metadata in Cargo.toml. - Added a new binary for text chunking functionality, with CLI support. - Migrated text chunking logic from a shell script to a Rust module. - Implemented embedding generation with Ollama API client. - Created core modules for embedding, ingesting data, and API endpoints. - Added utility functions for text chunking with multiple strategies (size-based, character-based, semantic). - Established a wrapper script for backward compatibility during migration. - Updated main ingest module to reflect new structure and imports. - Added tests for chunking functionality to ensure reliability.
1 parent c434172 commit 990481d

File tree

16 files changed

+1201
-316
lines changed

16 files changed

+1201
-316
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,8 @@ pragmatic execution, and narrative continuity.
7676
- [Directory Structure](docs/migration/RECOMMENDED-STRUCTURE.md) - Organization standards
7777
- [Implementation Plan](docs/migration/IMPLEMENTATION-PLAN.md) - Migration strategy
7878
- [Ingest Migration](docs/migration/INGEST-MIGRATION.md) - Rust code migration notes
79-
- [Directory Reorganization](docs/migration/DIRECTORY-REORGANIZATION.md) - File reorganization details
79+
- [Directory Reorganization](docs/migration/DIRECTORY-REORGANIZATION.md) - File
80+
reorganization details
8081

8182
## Ethics & Consent
8283

docs/migration/INGEST-MIGRATION.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
```markdown
12
# Ingest Module Migration
23

34
This document records the migration of the Rust ingest code from `rust_ingest/` to
@@ -39,3 +40,158 @@ This document records the migration of the Rust ingest code from `rust_ingest/`
3940
2. The code is organized by domain rather than technology
4041
3. Module boundaries are clearer in the new structure
4142
4. Future functionality can be added to the `src` directory with consistent organization
43+
44+
## Shell Scripts Migration Plan
45+
46+
This section outlines the plan to migrate essential shell scripts to Rust. The goal is to replace critical bash scripts with more maintainable, performant, and type-safe Rust implementations.
47+
48+
### Migration Candidates (Prioritized)
49+
50+
1. **text_chunker.sh** - Text chunking utility
51+
2. **ingest_chunked.sh** - Character-based chunking for Marvel character data
52+
3. **test_embedding_size.sh** - Embedding size testing utility
53+
4. **status.sh** - System status checking
54+
5. **JSON tools** (format_json.sh, validate_json_schema.sh)
55+
56+
### Migration Strategy
57+
58+
#### Phase 1: Core Text Chunking Utilities
59+
60+
1. **Create text chunking module** in `src/utils/chunking.rs`
61+
- Implement character-based chunking
62+
- Implement size-based chunking
63+
- Implement semantic chunking
64+
- Support all functionality from text_chunker.sh
65+
66+
2. **API Design**:
67+
68+
```rust
69+
pub enum ChunkingStrategy {
70+
Size(usize), // max_size
71+
Character(usize), // target_size
72+
Semantic, // natural breaks
73+
}
74+
75+
pub struct ChunkerOptions {
76+
strategy: ChunkingStrategy,
77+
preserve_whitespace: bool,
78+
delimiter: Option<String>,
79+
}
80+
81+
pub fn chunk_text(text: &str, options: ChunkerOptions) -> Vec<String>;
82+
```
83+
84+
#### Phase 2: Embed API Integration
85+
86+
1. **Create API module** in `src/core/embedding/ollama_api.rs`:
87+
- Implement functions for checking Ollama API status
88+
- Implement embedding generation with proper error handling
89+
- Support timeout and chunking for larger texts
90+
91+
2. **API Design**:
92+
93+
```rust
94+
pub struct OllamaApiClient {
95+
base_url: String,
96+
timeout: Duration,
97+
}
98+
99+
impl OllamaApiClient {
100+
pub fn new(base_url: &str) -> Self;
101+
pub fn check_status(&self) -> Result<bool>;
102+
pub fn generate_embedding(&self, text: &str, model: &str) -> Result<Vec<f32>>;
103+
}
104+
```
105+
106+
#### Phase 3: Ingest Chunked Implementation
107+
108+
1. **Extend existing ingest module**:
109+
- Add character-based chunking to `src/ingest/ingest.rs`
110+
- Support for JSON field extraction and processing
111+
- Progress logging and status reporting
112+
113+
2. **Command-line interface extensions**:
114+
115+
```rust
116+
// CLI Options for chunked ingestion
117+
pub struct ChunkedIngestOptions {
118+
source_file: PathBuf,
119+
chunk_size: usize,
120+
model: String,
121+
log_file: Option<PathBuf>,
122+
}
123+
```
124+
125+
#### Phase 4: CLI Enhancements
126+
127+
1. **Unified CLI interface** in `src/main.rs`:
128+
129+
```rust
130+
fn main() {
131+
let matches = clap::Command::new("harald")
132+
.subcommand(
133+
clap::Command::new("chunk")
134+
.about("Chunk text using various strategies")
135+
// options
136+
)
137+
.subcommand(
138+
clap::Command::new("ingest")
139+
.about("Ingest data into the vector database")
140+
// options
141+
)
142+
// other subcommands
143+
.get_matches();
144+
145+
// handle commands
146+
}
147+
```
148+
149+
### Implementation Roadmap
150+
151+
1. **Week 1**: Implement text chunking module
152+
- ✅ Create chunking module in src/utils/chunking.rs
153+
- ✅ Create binary wrapper in src/utils/chunker_bin.rs
154+
- ✅ Update build configuration in Cargo.toml
155+
- ✅ Create compatibility wrapper scripts
156+
157+
2. **Week 2**: Implement Ollama API client
158+
- ✅ Create Ollama API client module in src/core/embedding/ollama_api.rs
159+
- ⏳ Implement chunking-aware embedding generation
160+
- ⏳ Add proper error handling and logging
161+
- ⏳ Create test cases for API client
162+
163+
3. **Week 3**: Extend ingest module with chunked ingestion
164+
- ⏳ Integrate text chunking with ingest process
165+
- ⏳ Implement character-based chunking for large fields
166+
- ⏳ Support semantic chunking for description fields
167+
- ⏳ Add progress reporting and better error messages
168+
169+
4. **Week 4**: Create unified CLI and compatibility wrappers
170+
- ⏳ Design comprehensive CLI interface
171+
- ⏳ Implement subcommands (ingest, query, chunk, etc.)
172+
- ⏳ Create compatibility wrappers for all scripts
173+
- ⏳ Update documentation
174+
175+
### Current Status (July 21, 2025)
176+
177+
- Successfully migrated text_chunker.sh to Rust
178+
- Created a compatibility wrapper to maintain script interface
179+
- Implemented both character-based and semantic chunking strategies
180+
- Started work on the Ollama API client module
181+
- Compiled and tested the text_chunker binary successfully
182+
183+
### Testing Strategy
184+
185+
1. Create unit tests for each component
186+
2. Create integration tests that compare output with existing shell scripts
187+
3. Benchmark performance against shell script implementations
188+
189+
### Compatibility Considerations
190+
191+
During the transition period:
192+
193+
1. Maintain shell script wrappers that call the Rust implementations
194+
2. Ensure consistent output formats and logging
195+
3. Document migration details for users
196+
197+
```

0 commit comments

Comments
 (0)