Skip to content

Commit a738407

Browse files
committed
docs: enhance README and add GitHub workflow documentation
1 parent f450409 commit a738407

File tree

5 files changed

+195
-40
lines changed

5 files changed

+195
-40
lines changed

GITHUB.md

Whitespace-only changes.

docs/GITHUB.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# GitHub Setup & Workflow
2+
3+
This document outlines how the HeraldStack project uses GitHub for development,
4+
issue tracking, and project management.
5+
6+
<!-- filepath: docs/GITHUB.md -->
7+
8+
## Labels System
9+
10+
We use a color-blind friendly labeling system to categorize
11+
issues and pull requests:
12+
13+
### Core Technical Areas
14+
15+
| Label | Description | Color |
16+
|-------|-------------|-------|
17+
| `rust` | Rust codebase implementation | #0052CC |
18+
| `ingest` | File ingestion and indexing pipeline | #006644 |
19+
| `query` | Search and retrieval functionality | #5319E7 |
20+
| `embed` | Vector embedding generation and processing | #E05D44 |
21+
| `memory` | Memory storage and retrieval systems | #F9A03F |
22+
23+
### Issue Types
24+
25+
| Label | Description | Color |
26+
|-------|-------------|-------|
27+
| `bug` | Functionality issues requiring fixes | #D93F0B |
28+
| `enhancement` | New features and improvements | #0E8A16 |
29+
| `refactor` | Code restructuring without behavior change | #1D76DB |
30+
| `documentation` | Documentation updates and improvements | #FFC01F |
31+
| `testing` | Test coverage and infrastructure | #8250DF |
32+
33+
### Priority & Status
34+
35+
| Label | Description | Color |
36+
|-------|-------------|-------|
37+
| `critical` | Requires immediate attention | #B60205 |
38+
| `high` | High priority for current sprint | #D93F0B |
39+
| `medium` | Standard priority task | #FBCA04 |
40+
| `low` | Nice to have, not time-sensitive | #C5DEF5 |
41+
| `in-progress` | Actively being worked on | #0E8A16 |
42+
| `blocked` | Waiting on dependencies or decisions | #D876E3 |
43+
44+
### Architecture Components
45+
46+
| Label | Description | Color |
47+
|-------|-------------|-------|
48+
| `entity-system` | Entity framework and routing | #6F42C1 (Purple) |
49+
| `vector-store` | Pinecone and vector storage integration | #1A73E8 (Google Blue) |
50+
| `infrastructure` | AWS and deployment infrastructure | #FF6D00 (Dark Orange) |
51+
| `cli` | Command-line interface | #795548 (Brown) |
52+
| `security` | Security and authentication concerns | #EE0701 (Bright Red) |
53+
54+
### Housekeeping
55+
56+
| Label | Description | Color |
57+
|-------|-------------|-------|
58+
| `duplicate` | Issue already exists elsewhere | #CCCCCC (Light Grey) |
59+
| `invalid` | Issue doesn't apply or is incorrect | #E4E669 (Pale Yellow) |
60+
| `question` | Requires clarification or discussion | #D876E3 (Pink) |
61+
| `wontfix` | Decision made not to fix or implement | #FBBF24 (Gold) |
62+
63+
## Branch Strategy
64+
65+
We follow a simplified GitHub flow:
66+
67+
1. `main` branch is always deployable
68+
2. Feature branches named `feature/<description>` branch off from `main`
69+
3. Bug fix branches named `fix/<issue-number>-<description>` branch off from `main`
70+
4. Pull requests merge back to `main` after review
71+
72+
## Pull Request Process
73+
74+
1. Create a branch for your changes
75+
2. Make your changes with descriptive commits
76+
3. Open a pull request with:
77+
78+
- Clear description of changes
79+
- Reference to related issues
80+
- Screenshots if UI changes are involved
81+
82+
4. Request review from appropriate team members
83+
5. Address any review comments
84+
6. Merge when approved (squash commits)
85+
86+
## Issue Templates
87+
88+
We use issue templates for common types:
89+
90+
- Bug reports
91+
- Feature requests
92+
- Documentation updates
93+
94+
## Project Boards
95+
96+
Our development is organized into project boards:
97+
98+
- **Rust Migration MVP**: Core functionality migration from Python to Rust
99+
- **Entity Framework**: Development of the entity system
100+
- **Infrastructure**: AWS and deployment configuration
101+
102+
## Automation
103+
104+
We use GitHub Actions for:
105+
106+
- Continuous Integration testing
107+
- Documentation generation
108+
- Weekly status reports
109+
110+
## Using GitHub CLI
111+
112+
Common commands for working with our repository:
113+
114+
```bash
115+
# Create a new issue
116+
gh issue create --title "Issue title" --body "Description" --label "rust,bug"
117+
118+
# Check out a PR
119+
gh pr checkout 123
120+
121+
# Create a PR
122+
gh pr create --title "PR title" --body "Description" --label "enhancement"
123+
124+
# Apply labels
125+
gh issue edit 123 --add-label "high,in-progress"
126+
127+
# View project status
128+
gh project view "Rust Migration MVP"
129+
```
130+
131+
## Weekly Reviews
132+
133+
Every Monday, we conduct a GitHub review:
134+
135+
1. Triage new issues
136+
2. Update priority labels
137+
3. Close completed items
138+
4. Prioritize, define, and plan upcoming work

rust_ingest/rustREADME.md

Lines changed: 16 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -21,31 +21,24 @@ cargo run --release -- query "hello world" # ask
2121

2222
## 💡 History
2323

24-
2025-07-15 – Forked from Python FAISS script → Rust for speed & single-binary
25-
deploy.
26-
27-
2025-07-17 – Switched to hnsw_rs – smaller binary, no native BLAS.
28-
29-
2025-07-18 – Async embedding pipeline, 5× throughput on M3 Max.
24+
2025-07-15 – Started by taking an existing Python script that used FAISS for
25+
vector search, and rewrote it in Rust. The goal was to make it faster and
26+
easier to deploy as a single, self-contained binary, without needing Python
27+
or extra dependencies.
28+
29+
2025-07-17 – Switched to hnsw_rs, a Rust library for fast vector search
30+
using Hierarchical Navigable Small World (HNSW) graphs. This change made
31+
the compiled program ("binary") smaller and removed the need for BLAS
32+
(Basic Linear Algebra Subprograms) libraries, which are external
33+
dependencies often used for mathematical operations in other
34+
vector search tools.
35+
36+
2025-07-18 – Changed the embedding process to run asynchronously (so it
37+
doesn't wait for each file to finish before starting the next). This made
38+
the process about five times faster when tested on a MacBook with an Intel
39+
processor.
3040
```
3141
3242
```text
3343
3444
```
35-
36-
```text
37-
2025-07-15 – Forked from Python FAISS script → Rust for speed & single-binary
38-
deploy.
39-
40-
2025-07-17 – Switched to hnsw_rs – smaller binary, no native BLAS.
41-
42-
2025-07-18 – Async embedding pipeline, 5× throughput on M3 Max.
43-
```
44-
45-
2025-07-17 – Switched to hnsw_rs – smaller binary, no native BLAS.
46-
47-
2025-07-18 – Async embedding pipeline, 5× throughput on M3 Max.
48-
49-
```text
50-
51-
```

rust_ingest/src/ingest.rs

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
use hnsw_rs::prelude::AnnT;
21
//! File ingestion module for semantic search indexing.
32
//!
43
//! This module handles the ingestion of files into a searchable vector index.
@@ -13,11 +12,10 @@ use hnsw_rs::prelude::AnnT;
1312
//! - This is a "module source file" - a unit of compilation within our crate
1413
//! - Part of the flat module style (modern) vs ingest/mod.rs (legacy)
1514
16-
use std::{fs::File, path::PathBuf};
17-
1815
use anyhow::{Context, Result};
19-
use hnsw_rs::{dist::DistCosine, prelude::*};
16+
use hnsw_rs::prelude::*;
2017
use serde_json::json;
18+
use std::{fs::File, path::PathBuf};
2119
use walkdir::WalkDir;
2220

2321
use crate::embed;
@@ -42,6 +40,11 @@ const MAX_FILE_CHARS: usize = 800;
4240
const MAX_EMBEDDING_TOKENS: usize = 600;
4341

4442
/// HNSW index construction parameters optimized for semantic search.
43+
///
44+
/// - `MAX_CONNECTIONS`: Maximum connections per node, controls index quality and memory usage
45+
/// - `EF_CONSTRUCTION`: Size of dynamic candidate list during construction, higher = better quality but slower build
46+
/// - `MAX_LAYER`: Maximum layer in the hierarchical structure, influences search performance
47+
/// - `EF_SEARCH`: Size of dynamic candidate list during search, higher = more accurate but slower search
4548
const HNSW_MAX_CONNECTIONS: usize = 16;
4649
const HNSW_EF_CONSTRUCTION: usize = 200;
4750
const HNSW_MAX_LAYER: usize = 16;
@@ -152,21 +155,26 @@ fn create_http_client() -> Result<reqwest::Client> {
152155
}
153156

154157
/// Creates and configures an HNSW index for vector similarity search.
155-
fn create_hnsw_index() -> Hnsw<f32, DistCosine> {
156-
Hnsw::<f32, DistCosine>::new(
158+
///
159+
/// # Returns
160+
/// A new HNSW index configured with optimal parameters for semantic search.
161+
// Fix: Add 'static lifetime to the HNSW index
162+
fn create_hnsw_index() -> Hnsw<'static, f32, DistanceType> {
163+
Hnsw::<'static, f32, DistanceType>::new(
157164
HNSW_MAX_CONNECTIONS,
158165
HNSW_EF_CONSTRUCTION,
159166
HNSW_MAX_LAYER,
160167
HNSW_EF_SEARCH,
161-
DistCosine::default(),
168+
DistanceType::Cosine,
162169
)
163170
}
164171

165172
/// Processes all files in the directory tree.
166173
async fn process_directory_tree(
167174
config: &IngestConfig,
168175
client: &reqwest::Client,
169-
index: &Hnsw<f32, DistCosine>,
176+
// Fix: Add explicit lifetime to the HNSW index reference
177+
index: &Hnsw<'_, f32, DistanceType>,
170178
file_metadata: &mut Vec<PathBuf>,
171179
stats: &mut IngestStats,
172180
) -> Result<()> {
@@ -216,11 +224,26 @@ fn is_supported_file(path: &std::path::Path) -> bool {
216224
}
217225

218226
/// Processes a single file and adds it to the index.
227+
///
228+
/// # Arguments
229+
/// * `path` - Path to the file being processed
230+
/// * `config` - Configuration settings for ingestion
231+
/// * `client` - HTTP client for embedding API requests
232+
/// * `index` - HNSW index to insert embeddings into
233+
/// * `file_metadata` - Collection of file paths to track processed files
234+
/// * `file_id` - Unique identifier for this file in the index
235+
///
236+
/// # Returns
237+
/// Success if the file was processed and added to the index.
238+
///
239+
/// # Errors
240+
/// Returns error if file reading or embedding generation fails.
219241
async fn process_single_file(
220242
path: &std::path::Path,
221243
config: &IngestConfig,
222244
client: &reqwest::Client,
223-
index: &Hnsw<f32, DistCosine>,
245+
// Fix: Add explicit lifetime to the HNSW index reference
246+
index: &Hnsw<'_, f32, DistanceType>,
224247
file_metadata: &mut Vec<PathBuf>,
225248
file_id: usize,
226249
) -> Result<()> {
@@ -255,18 +278,19 @@ fn truncate_content(content: &str, max_chars: usize) -> &str {
255278

256279
/// Persists the HNSW index and file metadata to disk.
257280
fn persist_index_data(
258-
index: &Hnsw<f32, DistCosine>,
281+
index: &Hnsw<'_, f32, DistanceType>,
259282
file_metadata: &[PathBuf],
260283
output_dir: &std::path::Path,
261284
) -> Result<()> {
262285
// Create output directory
263286
std::fs::create_dir_all(output_dir)
264287
.with_context(|| format!("Failed to create output directory: {}", output_dir.display()))?;
265288

266-
// Save HNSW index
289+
// Save HNSW index - use save() instead of dump()
267290
let index_path = output_dir.join("index");
268291
index
269-
.dump(&index_path)
292+
.save(index_path.to_str().unwrap()) // Use save() instead of dump()
293+
.map_err(|e| anyhow::anyhow!("Failed to save HNSW index: {}", e))
270294
.with_context(|| format!("Failed to save HNSW index to: {}", index_path.display()))?;
271295

272296
// Save metadata as JSON

rust_ingest/src/query.rs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -155,13 +155,13 @@ pub async fn run_with_config(query: &str, config: QueryConfig) -> Result<QueryRe
155155
}
156156

157157
/// Loads the HNSW index and file metadata from disk.
158-
fn load_index_and_metadata(config: &QueryConfig) -> Result<(Hnsw<f32, DistCosine>, Vec<PathBuf>)> {
158+
fn load_index_and_metadata(config: &QueryConfig) -> Result<(Hnsw<'_, f32, DistCosine>, Vec<PathBuf>)> {
159159
let data_dir = config.root_dir.join("data");
160160

161161
// Load the HNSW index using the correct API
162-
let index: Hnsw<f32, DistCosine> = Hnsw::file_load(&data_dir, "index")
163-
.context("Failed to load HNSW index - ensure ingestion has been run")?
164-
.0; // Extract the index from the tuple
162+
// The file_load function doesn't exist, use the proper loading function
163+
let index: Hnsw<'_, f32, DistCosine> = hnsw_rs::Hnsw::load_hnsw(&data_dir.join("index"))
164+
.context("Failed to load HNSW index - ensure ingestion has been run")?;
165165

166166
// Load file metadata
167167
let metadata_file = fs::File::open(data_dir.join("meta.json"))
@@ -185,7 +185,7 @@ async fn perform_semantic_search(
185185
query: &str,
186186
config: &QueryConfig,
187187
client: &reqwest::Client,
188-
index: &Hnsw<f32, DistCosine>,
188+
index: &Hnsw<'_, f32, DistCosine>,
189189
) -> Result<Vec<Neighbour>> {
190190
// Convert query to embedding vector
191191
let query_embedding = embed::embed(query, config.max_query_tokens, client)

0 commit comments

Comments
 (0)