Skip to content

Commit dfd9fc3

Browse files
committed
Refactor ingestion pipeline: modularize chunking, enhance error handling, and improve API interaction
- Removed deprecated `embed` function and `create_config` from `embed.rs`. - Introduced `chunking_utils.rs` for shared chunking logic, ensuring fields are chunked to ≤250 characters. - Enhanced `marvelai_ingest.rs` with configurable retry and delay for embedding API calls, added debug output, and improved error messages. - Updated `single_character_ingest.rs` to streamline processing of the 'Vision' character, including chunking and embedding with detailed logging. - Created `ingest_utils.rs` for JSONL validation and error handling. - Added a workspace structure in `Cargo.toml` for better project organization. - Included a sample `vision.jsonl` file for testing and validation.
1 parent 0cfbc0c commit dfd9fc3

File tree

10 files changed

+669
-487
lines changed

10 files changed

+669
-487
lines changed

Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[workspace]
2+
members = [
3+
"src",
4+
"rust_ingest"
5+
]
6+
resolver = "2"

rust_ingest/Cargo.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,15 @@ edition = "2021"
77
name = "rust_ingest"
88
path = "src/main.rs"
99

10+
[[bin]]
11+
name = "marvelai_ingest"
12+
path = "../src/ingest/marvelai_ingest.rs"
13+
1014
[lib]
1115
name = "rust_ingest"
1216
path = "src/lib.rs"
1317

18+
1419
[dependencies]
1520
reqwest = { version = "0.12", features = ["json", "stream"] }
1621
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
@@ -22,6 +27,7 @@ walkdir = "2"
2227
clap = { version = "4", features = ["derive"] }
2328
anyhow = "1"
2429
bincode = "1.3"
30+
tempfile = "3.6.0"
2531

2632
[features]
2733
integration-tests = []

src/ingest/INGEST.md

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,35 @@ This directory contains the Rust implementation of the HARALD ingestion
44
pipeline, migrated from the previous `rust_ingest` directory as part of our
55
organizational improvements.
66

7+
## Important Notes & Best Practices
8+
9+
- **Chunking Limits:** All text fields are chunked to ≤250 characters for
10+
reliable embedding. This is enforced in the code and recommended for API
11+
stability and resource management.
12+
- **Error Handling:** The pipeline includes robust error handling and
13+
diagnostics. Common error messages (e.g., "No chunks were processed") are
14+
documented below.
15+
- **Debug Output:** Enable debug output to see chunk sizes and counts. This
16+
helps diagnose chunking and embedding issues.
17+
- **CLI Features:** Some binaries require specific features (e.g.,
18+
`--features="cli"`). Always check the documentation for required flags.
19+
- **Input/Output Formats:** Ingestion expects JSONL input and outputs vector
20+
data to the `data/` directory. Validate your input and output files for
21+
correct structure.
22+
23+
## Troubleshooting
24+
25+
- **No chunks were processed:** This message means the input file was empty or
26+
incorrectly formatted. Check your JSONL file and ensure it contains valid
27+
character objects.
28+
- **Network/API errors:** If embedding fails, check Ollama API status and logs.
29+
Retry logic is built-in for transient failures.
30+
- **Debugging chunking:** Use the debug output to verify chunk sizes and counts.
31+
If chunking does not match expectations, review the chunking logic and input
32+
data.
33+
34+
## Key Components
35+
736
## Key Components
837

938
- `main.rs` - CLI entry point for ingest and query operations
@@ -20,13 +49,25 @@ From the `src` directory:
2049
# Build the project
2150
cargo build
2251

23-
# Run ingestion
52+
# Run ingestion (with CLI features if required)
53+
cargo run --bin single_character_ingest --features="cli"
54+
55+
# Run main ingest
2456
cargo run -- ingest
2557

2658
# Run query
2759
cargo run -- query "your query here"
2860
```
2961

62+
## Debugging & Validation
63+
64+
- To enable debug output for chunking, run the test harness or ingest binary and
65+
review the printed chunk sizes/counts.
66+
- Always validate your JSONL input before running ingestion. Use `jq` or similar
67+
tools to check structure.
68+
69+
## Migration Notes
70+
3071
## Migration Notes
3172

3273
This code was migrated from the original `rust_ingest` directory as part of the
@@ -38,3 +79,12 @@ updated module structure that better fits our overall architecture.
3879
- Outputs vector data to the `data/` directory
3980
- Reads from project directories based on configuration
4081
- Provides both a library API and command-line interface
82+
- Expects JSONL input for ingestion; output files are validated and summarized
83+
in logs
84+
85+
## Common Pitfalls
86+
87+
- Forgetting to enable required CLI features (e.g., `--features="cli"`)
88+
- Incorrectly formatted JSONL input (empty or missing fields)
89+
- Not keeping chunk sizes ≤250 characters, leading to API errors
90+
- Overlooking debug output for diagnostics

0 commit comments

Comments
 (0)