Skip to content

Commit a016d4c

Browse files
committed
Refactor embedding size testing and migrate shell scripts to Rust
- Replaced the original shell script for embedding size testing with a Rust implementation in `src/core/embedding/embedding_bin.rs`. - Added a new wrapper script `embedding_tool.sh` for embedding generation functionality using the Ollama API. - Updated `test_embedding_size.sh` to ensure it calls the new Rust binary and handles command-line arguments for size testing. - Implemented logging for embedding size tests and improved error handling. - Created a verification script `verify_rust_migration.sh` to compare outputs of legacy shell scripts with new Rust implementations. - Added test data file for migration verification and updated documentation regarding migration status and cleanup plans. - Ensured Rust binaries are built before execution and added checks for the Ollama API status.
1 parent 990481d commit a016d4c

22 files changed

+1207
-131
lines changed

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,16 +73,20 @@ pragmatic execution, and narrative continuity.
7373
- Personality Models
7474
- Workflows
7575
- [JSONL Format for Vector Embedding](docs/vector-search/jsonl-ingestion.md)
76-
- [Directory Structure](docs/migration/RECOMMENDED-STRUCTURE.md) - Organization standards
77-
- [Implementation Plan](docs/migration/IMPLEMENTATION-PLAN.md) - Migration strategy
78-
- [Ingest Migration](docs/migration/INGEST-MIGRATION.md) - Rust code migration notes
76+
- [Directory Structure](docs/migration/RECOMMENDED-STRUCTURE.md) - Organization
77+
standards
78+
- [Implementation Plan](docs/migration/IMPLEMENTATION-PLAN.md) - Migration
79+
strategy
80+
- [Ingest Migration](docs/migration/INGEST-MIGRATION.md) - Rust code migration
81+
notes
7982
- [Directory Reorganization](docs/migration/DIRECTORY-REORGANIZATION.md) - File
8083
reorganization details
8184

8285
## Ethics & Consent
8386

8487
HeraldStack operates on consent-based principles and follows clear ethical
85-
guidelines including those defined in [LawsOfRobotics.json](config/ethics/LawsOfRobotics.json).
88+
guidelines including those defined in
89+
[LawsOfRobotics.json](config/ethics/LawsOfRobotics.json).
8690

8791
## Development Tools
8892

@@ -98,4 +102,4 @@ See docs/DETAILED.md for more information.
98102

99103
---
100104

101-
© 2025 Bryan Chasko
105+
Shared under MIT Open License 2025 Bryan Chasko

docs/migration/DIRECTORY-REORGANIZATION.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,17 @@ HARALD project directory structure.
55

66
## Files Moved
77

8-
| File | Original Location | New Location |
9-
|------|------------------|-------------|
10-
| `LawsOfRobotics.json` | Root | `config/ethics/` |
11-
| `Modelfile` | Root | `config/models/` |
12-
| `rustfmt.toml` | Root | `src/` |
13-
| `test_single_character.json` | Root | `tests/fixtures/` |
14-
| `GITHUB.md` | Root | `docs/` |
15-
| `IMPLEMENTATION-PLAN.md` | Root | `docs/migration/` |
16-
| `RECOMMENDED-STRUCTURE.md` | Root | `docs/migration/` |
17-
| `SCRIPT-MIGRATION.md` | Root | `docs/migration/` |
18-
| `INGEST-MIGRATION.md` | Root | `docs/migration/` |
8+
| File | Original Location | New Location |
9+
| ---------------------------- | ----------------- | ----------------- |
10+
| `LawsOfRobotics.json` | Root | `config/ethics/` |
11+
| `Modelfile` | Root | `config/models/` |
12+
| `rustfmt.toml` | Root | `src/` |
13+
| `test_single_character.json` | Root | `tests/fixtures/` |
14+
| `GITHUB.md` | Root | `docs/` |
15+
| `IMPLEMENTATION-PLAN.md` | Root | `docs/migration/` |
16+
| `RECOMMENDED-STRUCTURE.md` | Root | `docs/migration/` |
17+
| `SCRIPT-MIGRATION.md` | Root | `docs/migration/` |
18+
| `INGEST-MIGRATION.md` | Root | `docs/migration/` |
1919

2020
## README Updates
2121

docs/migration/INGEST-MIGRATION.md

Lines changed: 60 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
```markdown
1+
````markdown
22
# Ingest Module Migration
33

4-
This document records the migration of the Rust ingest code from `rust_ingest/` to
5-
`src/ingest/`.
4+
This document records the migration of the Rust ingest code from `rust_ingest/`
5+
to `src/ingest/`.
66

77
## Migration Steps Completed
88

@@ -24,6 +24,7 @@ This document records the migration of the Rust ingest code from `rust_ingest/`
2424
```bash
2525
./scripts/dev/build-ingest.sh --test
2626
```
27+
````
2728

2829
2. Update any scripts that referenced the old `rust_ingest` directory
2930

@@ -39,11 +40,14 @@ This document records the migration of the Rust ingest code from `rust_ingest/`
3940
1. All source code is now in the `src` directory, following standard conventions
4041
2. The code is organized by domain rather than technology
4142
3. Module boundaries are clearer in the new structure
42-
4. Future functionality can be added to the `src` directory with consistent organization
43+
4. Future functionality can be added to the `src` directory with consistent
44+
organization
4345

4446
## Shell Scripts Migration Plan
4547

46-
This section outlines the plan to migrate essential shell scripts to Rust. The goal is to replace critical bash scripts with more maintainable, performant, and type-safe Rust implementations.
48+
This section outlines the plan to migrate essential shell scripts to Rust. The
49+
goal is to replace critical bash scripts with more maintainable, performant, and
50+
type-safe Rust implementations.
4751

4852
### Migration Candidates (Prioritized)
4953

@@ -141,7 +145,7 @@ This section outlines the plan to migrate essential shell scripts to Rust. The g
141145
)
142146
// other subcommands
143147
.get_matches();
144-
148+
145149
// handle commands
146150
}
147151
```
@@ -172,13 +176,57 @@ This section outlines the plan to migrate essential shell scripts to Rust. The g
172176
- ⏳ Create compatibility wrappers for all scripts
173177
- ⏳ Update documentation
174178

175-
### Current Status (July 21, 2025)
179+
### Current Status (Updated)
180+
181+
**Successfully migrated text_chunker.sh to Rust**
176182

177-
- Successfully migrated text_chunker.sh to Rust
178183
- Created a compatibility wrapper to maintain script interface
179-
- Implemented both character-based and semantic chunking strategies
180-
- Started work on the Ollama API client module
184+
- Implemented character-based, size-based, and semantic chunking strategies
181185
- Compiled and tested the text_chunker binary successfully
186+
- Original script backed up as `text_chunker.sh.legacy`
187+
- New implementation at `src/utils/chunking.rs` and `src/utils/chunker_bin.rs`
188+
189+
**Successfully migrated test_embedding_size.sh to Rust**
190+
191+
- Implemented as part of a comprehensive embedding_tool CLI
192+
- Added test-sizes command with flexible configuration
193+
- Created detailed logging and reporting functionality
194+
- Original script backed up as `test_embedding_size.sh.legacy`
195+
- New implementation at `src/core/embedding/embedding_bin.rs`
196+
197+
**Created a robust Ollama API client module**
198+
199+
- Implemented check_status functionality
200+
- Added embedding generation with timeout handling
201+
- Added support for chunked embeddings for long text
202+
- Implemented proper error handling and reporting
203+
- New implementation at `src/core/embedding/ollama_api.rs`
204+
205+
**Created wrapper scripts for backwards compatibility**
206+
207+
- `text_chunker.sh` - Now a wrapper around the Rust implementation
208+
- `test_embedding_size.sh` - Now a wrapper around the Rust implementation
209+
- Automatic Rust binary rebuilding when source changes
210+
- Error handling and fallback mechanisms
211+
212+
📝 **Created cleanup documentation**
213+
214+
- Migration tracking document at `docs/migration/SCRIPT-CLEANUP-PLAN.md`
215+
- Implementation timeline and roadmap
216+
- Testing and verification strategies
217+
218+
### Scripts Pending Migration
219+
220+
The following scripts are still pending migration to Rust:
221+
222+
1. 🔄 `ingest_chunked.sh` - Character-based chunking for data ingestion (High
223+
Priority)
224+
2. 🔄 `ingest_marvelai.sh` - Marvel AI data ingestion (Medium Priority)
225+
3. 🔄 `test_basic_embedding.sh` - Basic embedding testing (Medium Priority)
226+
4. 🔄 `ingest.sh` - Main ingestion script (High Priority)
227+
5. 🔄 `test_text_chunker.sh` - Tests for text chunking (Low Priority)
228+
6. 🔄 `ingest_single_character.sh` - Single character ingestion (Medium
229+
Priority)
182230

183231
### Testing Strategy
184232

@@ -195,3 +243,5 @@ During the transition period:
195243
3. Document migration details for users
196244

197245
```
246+
247+
```
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Shell Script to Rust Migration Verification
2+
3+
This document outlines the process for verifying that our Rust implementations
4+
correctly match the behavior of the original shell scripts.
5+
6+
## Overview
7+
8+
As part of our effort to migrate shell scripts to Rust, we need to ensure
9+
that the new implementations maintain the same behavior and output as the
10+
original scripts. This document explains how to use the verification tools
11+
to test the migration.
12+
13+
## Verification Process
14+
15+
To verify the migration from shell scripts to Rust, use the verification script:
16+
17+
```bash
18+
# Run all available tests
19+
./scripts/verify_rust_migration.sh
20+
21+
# Clean previous test outputs before running
22+
./scripts/verify_rust_migration.sh --clean
23+
```
24+
25+
The verification script compares:
26+
27+
- Outputs from original shell scripts (`.legacy` files)
28+
- Against the new Rust implementations
29+
30+
to ensure they behave identically.
31+
32+
## Test Data and Outputs
33+
34+
- Test input data is stored in the `tests/data/` directory
35+
- Test outputs are stored in the `tests/output/` directory
36+
- The verification script automatically creates test data if needed
37+
38+
## Current Migration Status
39+
40+
The following shell scripts have been migrated to Rust:
41+
42+
- `text_chunker.sh``src/utils/chunking.rs`
43+
- `test_embedding_size.sh``src/core/embedding/embedding_bin.rs`
44+
45+
See the full migration status in `docs/migration/INGEST-MIGRATION.md` and the
46+
cleanup plan in `docs/migration/SCRIPT-CLEANUP-PLAN.md`.
47+
48+
## Extending the Verification Process
49+
50+
To add tests for newly migrated scripts:
51+
52+
1. Update the `verify_rust_migration.sh` script to include the new tests
53+
2. Add appropriate test data to `tests/data/` if needed
54+
3. Run the verification to ensure the new implementation matches the original
55+
56+
## Troubleshooting
57+
58+
If verification tests fail with output differences:
59+
60+
1. Check the output files in `tests/output/` to see the differences
61+
2. Look for whitespace, line ending, or formatting differences
62+
3. Ensure the Rust implementation handles edge cases properly
63+
4. Update the implementation as needed to match the original behavior
64+
65+
## Best Practices
66+
67+
- Always run verification tests before removing original scripts
68+
- Keep legacy scripts until verification is complete and successful
69+
- When making changes to Rust implementations, re-verify against legacy scripts
70+
- Document any intentional behavioral differences between implementations

docs/migration/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Migration Documentation
22

3-
This directory contains documentation related to the project restructuring
4-
and migration efforts.
3+
This directory contains documentation related to the project restructuring and
4+
migration efforts.
55

66
## Files
77

@@ -12,6 +12,6 @@ and migration efforts.
1212

1313
## Purpose
1414

15-
These documents track the organizational improvements made to the HARALD
16-
project structure. They provide context on decisions made during the
17-
migration process and serve as a reference for future development.
15+
These documents track the organizational improvements made to the HARALD project
16+
structure. They provide context on decisions made during the migration process
17+
and serve as a reference for future development.

docs/migration/RECOMMENDED-STRUCTURE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Recommended Project Structure
22

3-
This document outlines recommended organizational improvements for the
4-
HARALD project based on developer best practices.
3+
This document outlines recommended organizational improvements for the HARALD
4+
project based on developer best practices.
55

66
## Current vs Future Structure
77

@@ -28,7 +28,7 @@ HARALD/
2828
│ ├── api/ # API endpoints and handlers
2929
│ ├── core/ # Core application logic
3030
│ │ ├── embedding/ # Embedding-related logic
31-
│ │ ├── entities/ # Entity management logic
31+
│ │ ├── entities/ # Entity management logic
3232
│ │ └── memory/ # Memory handling logic
3333
│ ├── ingest/ # Ingestion pipeline (from rust_ingest)
3434
│ └── utils/ # Shared utilities and helpers
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Script Migration Cleanup Plan
2+
3+
This document outlines the plan for cleaning up the migration of shell scripts
4+
to Rust once we're confident the implementations work correctly.
5+
6+
## Current Status
7+
8+
We've successfully migrated the following scripts to Rust:
9+
10+
- `text_chunker.sh` → Now a wrapper around `src/target/release/text_chunker`
11+
- `test_embedding_size.sh` → Now a wrapper around
12+
`src/target/release/embedding_tool`
13+
14+
The original scripts have been backed up as:
15+
16+
- `text_chunker.sh.legacy`
17+
- `test_embedding_size.sh.legacy`
18+
19+
## Pending Migrations
20+
21+
The following scripts still need to be migrated:
22+
23+
1. `ingest_chunked.sh` → Migrate to `src/ingest/chunked_ingest.rs`
24+
2. `ingest_marvelai.sh` → Integrate into unified ingest tool
25+
3. `test_basic_embedding.sh` → Replace with embedding_tool commands
26+
4. `ingest.sh` → Migrate to core ingest functionality
27+
5. `test_text_chunker.sh` → Replace with proper unit tests
28+
6. `ingest_single_character.sh` → Integrate into unified ingest tool
29+
30+
## Cleanup Steps
31+
32+
Once we're confident that the Rust implementations work correctly, we should:
33+
34+
### 1. Update Documentation
35+
36+
- Document the new Rust-based workflow
37+
- Update README files to point to the new tools
38+
- Add examples of using the new tools
39+
40+
### 2. Remove Backup Scripts
41+
42+
- Delete `.legacy` backup files
43+
- Ensure all references to the original scripts are updated
44+
45+
### 3. Simplify Wrapper Scripts
46+
47+
- Remove temporary migration code
48+
- Simplify error handling
49+
- Update comments to reflect permanent status
50+
51+
## Implementation Schedule
52+
53+
### Phase 1 (Current)
54+
55+
- Complete migrations of text chunking and embedding size testing
56+
- Document the migration process
57+
- Test the new Rust implementations
58+
59+
### Phase 2 (Next 2 weeks)
60+
61+
- Migrate ingest-related scripts to Rust
62+
- Create unified ingest CLI tool
63+
- Update documentation and examples
64+
65+
### Phase 3 (Final)
66+
67+
- Migrate remaining test scripts
68+
- Remove backup files
69+
- Complete final documentation
70+
71+
## Success Criteria
72+
73+
By the end of this process, we should have:
74+
75+
1. A unified Rust-based CLI for all operations
76+
2. Minimal shell script wrappers where needed for backward compatibility
77+
3. Complete documentation of the new tools
78+
4. Comprehensive test suite for all functionality
79+
5. No legacy scripts in active use
80+
81+
## Remaining Migrations
82+
83+
The following scripts still need to be migrated:
84+
85+
1. `ingest_chunked.sh` → Migrate to `src/ingest/chunked_ingest.rs`
86+
2. `ingest_marvelai.sh` → Integrate into unified ingest tool
87+
3. `test_basic_embedding.sh` → Replace with embedding_tool commands
88+
4. `ingest.sh` → Migrate to core ingest functionality
89+
5. `test_text_chunker.sh` → Replace with proper unit tests
90+
6. `ingest_single_character.sh` → Integrate into unified ingest tool
91+
92+
## Timeline
93+
94+
- **Week 3**: Complete migrations of ingest-related scripts
95+
- **Week 4**: Complete migrations of test-related scripts
96+
- **Week 5**: Cleanup and finalize documentation
97+
98+
## Final Deliverable
99+
100+
By the end of this process, we should have:
101+
102+
1. A unified Rust-based CLI for all operations
103+
2. Minimal shell script wrappers where needed for backward compatibility
104+
3. Complete documentation of the new tools
105+
4. Comprehensive test suite for all functionality
106+
5. No legacy scripts in active use

docs/migration/SCRIPT-MIGRATION.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Script Migration Plan
22

3-
This document outlines how to migrate existing scripts to the new directory structure.
3+
This document outlines how to migrate existing scripts to the new directory
4+
structure.
45

56
## Development Scripts
67

0 commit comments

Comments
 (0)