Skip to content

Commit 0bc23c8

Browse files
committed
Refactor migration documentation and cleanup plans
- Updated JSON-TOOLS.md to reference new migration documentation. - Revised TESTS.md to link to DEVELOPMENT-PRINCIPLES.md for cleanup plans. - Added comprehensive Copilot instructions for HeraldStack, detailing project overview, architecture, developer workflows, and best practices. - Created INGEST-MIGRATION-MODULAR-PLAN.md to outline the refactor to a modular ingestion architecture. - Established a Migration Archive with historical documents and guidelines for future migrations. - Documented the Script Migration Cleanup Plan, detailing the process for cleaning up shell scripts post-migration. - Compiled a Script Migration to Rust Plan, outlining core principles and successfully migrated scripts. - Summarized Shell Script Prevention strategies to prevent the creation of new shell scripts and promote automated tools.
1 parent b9d9507 commit 0bc23c8

26 files changed

+512
-619
lines changed

.github/copilot-instructions.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Copilot Instructions for HeraldStack
2+
3+
## Project Overview
4+
- **HeraldStack** is a Rust-based, context-aware AI framework for personal and professional productivity, integrating memory, emotion, and modular execution across a cohort of AI entities.
5+
- The system is migrating all application logic from shell scripts to Rust binaries. Shell scripts are only used for infrastructure orchestration (e.g., deployment, CI/CD, AWS CLI).
6+
7+
## Architecture & Key Components
8+
- **src/**: Core Rust code for all application logic, including data processing, JSON/JSONL tools, embedding, and ingestion.
9+
- **ai-entities/**: Definitions and metadata for AI personalities (see `entity-registry.json`, individual `.md` files).
10+
- **config/**: Schemas, ethics, and model configuration.
11+
- **docs/**: System design, migration, and vector search documentation.
12+
- **scripts/**: Shell scripts for deployment and validation (do not add new app logic here).
13+
- **rust_ingest/**: Rust CLI tools for ingestion and embedding (see `marvelai_ingest.rs`, `ingest.rs`).
14+
- **data/**: Vector store registry and ingested data.
15+
16+
## Developer Workflows
17+
- **Build Rust Binaries:**
18+
```bash
19+
cd src && cargo build --release --features cli
20+
# Binaries: format_json, validate_json_schema, ingest_chunked, embedding_tool, text_chunker
21+
```
22+
- **Run Ingestion Pipeline:**
23+
- Use Rust binaries for all data ingestion and embedding.
24+
- Chunking and embedding logic is in `src/ingest/` and `rust_ingest/`.
25+
- Embeddings are saved as `.emb.json` files alongside chunked data.
26+
- **Deploy:**
27+
```bash
28+
./scripts/deploy/deploy.sh [--build-only|prod|staging|--no-tests]
29+
```
30+
- **Validation & Linting:**
31+
```bash
32+
./scripts/validation/check-json.sh
33+
./scripts/validation/check-rust.sh
34+
./src/target/release/format_md
35+
```
36+
37+
## Project-Specific Conventions
38+
- **No new shell scripts for app logic.** All new features must be implemented in Rust.
39+
- **Chunking for Embedding:**
40+
- All text for embedding must be chunked to ≤250 characters (see `chunking_utils.rs`).
41+
- Embeddings are generated per sub-chunk and saved as `<chunkfile>.<entity>.<field>.<start>-<end>.emb.json`.
42+
- **JSONL as Standard:**
43+
- All ingestion and embedding pipelines use JSONL for input/output.
44+
- **Error Handling:**
45+
- Use Rust's `anyhow` and `Result` for error propagation.
46+
- All ingestion failures must be logged with actionable diagnostics.
47+
- **Documentation:**
48+
- Update `README.md` and relevant `.md` files for any new workflow or tool.
49+
- See `docs/vector-search/ollama-embedding-limits.md` for API constraints.
50+
51+
## Integration & External Dependencies
52+
- **Ollama API** for embedding (see `docs/vector-search/ollama-embedding-limits.md`).
53+
- **AWS Lambda, S3, DynamoDB, Pinecone** for compute, storage, state, and semantic memory.
54+
- **No direct shell logic in Rust binaries.**
55+
56+
## Examples
57+
- To add a new ingestion feature, extend `src/ingest/marvelai_ingest.rs` and document the CLI usage.
58+
- To validate JSON, use `src/utils/json_tools/` Rust binaries, not shell scripts.
59+
- To add a new entity, update `ai-entities/entity-registry.json` and add a corresponding `.md` file.
60+
61+
## References
62+
- [README.md](../README.md)
63+
- [docs/vector-search/character-based-chunking.md](../docs/vector-search/character-based-chunking.md)
64+
- [docs/vector-search/ollama-embedding-limits.md](../docs/vector-search/ollama-embedding-limits.md)
65+
- [docs/CONTRIBUTING.md](../docs/CONTRIBUTING.md)
66+
67+
---
68+
69+
## Copilot Chat Preferences
70+
71+
If you are unsure about a workflow or convention, check the documentation including the Root "HARALD" directory README.md and the appropriate .md file in the working directory, as well as general documentation in `docs/` paying special attention to the `migration/` directory for migration guidelines and the `CONTRIBUTING.md` for contribution standards.
72+
73+
- When suggesting code or pipeline changes, do not offer multiple alternatives or open‑ended questions. Clearly state the proposed fix and tell me you will apply it after I say “yes.” Only ask me to reply “yes” when confirmation is absolutely necessary before executing the change.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1+
12
[workspace]
23
members = [
3-
"src",
44
"rust_ingest"
55
]
66
resolver = "2"

README.md

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,13 @@ handling.
140140
Rust
141141
- [Project Structure](docs/migration/RECOMMENDED-STRUCTURE.md) - Recommended
142142
organization
143-
- [Migration Documentation](docs/migration/) - Detailed migration information
143+
- [Migration Documentation](docs/migration/) - Shell-to-Rust migration details
144+
- **Ingestion/Embedding Architecture:** All ingestion and embedding logic must
145+
follow the
146+
[Modular Ingest Refactor Plan](docs/migration/INGEST-MIGRATION-MODULAR-PLAN.md).
147+
This plan defines the canonical, reusable ingest library and the pattern for
148+
domain-specific wrappers (e.g., marvelai_ingest.rs). All new pipelines and
149+
refactors must use this architecture and update documentation accordingly.
144150

145151
## Operating Model
146152

@@ -159,6 +165,9 @@ pragmatic execution, and narrative continuity.
159165
- Workflows
160166
- [JSONL Format for Vector Embedding](docs/vector-search/jsonl-ingestion.md)
161167
- [Migration Documentation](docs/migration/) - Shell-to-Rust migration details
168+
- [Modular Ingest Refactor Plan](docs/migration/INGEST-MIGRATION-MODULAR-PLAN.md)
169+
– Step-by-step plan for refactoring to a reusable, component-based ingestion
170+
architecture
162171

163172
## Ethics & Consent
164173

@@ -175,9 +184,19 @@ guidelines including those defined in
175184
- **Test Data**: Test fixtures are available in `tests/fixtures/` (see
176185
[FIXTURES.md](tests/fixtures/FIXTURES.md) for details)
177186

178-
## Further Information
187+
## Directory Structure Overview
188+
189+
For a detailed, canonical description of the project’s directory structure, see:
190+
191+
- [docs/DETAILED.md](docs/DETAILED.md)**Directory Structure and Naming Best
192+
Practices** (includes a `tree` overview and rationale)
193+
- [docs/naming-conventions.md](docs/naming-conventions.md)**Directory and
194+
file naming conventions**
179195

180-
See docs/DETAILED.md for more information.
196+
Other structure-related documents in `docs/migration/` (such as
197+
`RECOMMENDED-STRUCTURE.md`, `DIRECTORY-REORGANIZATION.md`, and
198+
`IMPLEMENTATION-PLAN.md`) are project planning artifacts and will be moved to a
199+
`docs/project-planning/` subdirectory.
181200

182201
---
183202

docs/CONTRIBUTING.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Contributing to HARALD
22

3+
**Created**: July 2025
4+
**Last Updated**: July 24, 2025
5+
**Version**: 1.1
6+
37
This document provides practical guidance for contributing to the HARALD
48
project. For high-level development philosophy and principles, see
59
[DEVELOPMENT-PRINCIPLES.md](DEVELOPMENT-PRINCIPLES.md).
@@ -16,7 +20,15 @@ Before contributing, please familiarize yourself with our
1620
- **Automation over manual work** - Use and extend our automated tools
1721

1822
For the complete migration strategy and decision framework, refer to the
19-
[Development Principles](DEVELOPMENT-PRINCIPLES.md#migration-strategy--guidelines) document.
23+
[Development Principles](DEVELOPMENT-PRINCIPLES.md#-rust-vs-shell-decision-framework)
24+
document.
25+
26+
**Ingestion/Embedding Architecture:** All ingestion and embedding logic must
27+
follow the
28+
[Modular Ingest Refactor Plan](migration/INGEST-MIGRATION-MODULAR-PLAN.md). This
29+
plan defines the canonical, reusable ingest library and the pattern for
30+
domain-specific wrappers (e.g., marvelai_ingest.rs). All new pipelines and
31+
refactors must use this architecture and update documentation accordingly.
2032

2133
## 🔧 Automated Cleanup Tools First
2234

@@ -141,13 +153,10 @@ standards when working with others:
141153
preferred over long comment threads.
142154
- **Listen past the first answer**—follow-up questions deepen understanding.
143155

144-
> "Empathy is a muscle: left unused, it atrophies; put to work, it grows."
145-
> — Jamil Zaki, Stanford
146-
> "Minds are mirrors to one another."
147-
> — David Hume
148-
> "Seeing the world through the eyes of the other, not seeing your world
149-
> reflected in their eyes."
150-
> — Carl Rogers
156+
> "Empathy is a muscle: left unused, it atrophies; put to work, it grows." —
157+
> Jamil Zaki, Stanford "Minds are mirrors to one another." — David Hume "Seeing
158+
> the world through the eyes of the other, not seeing your world reflected in
159+
> their eyes." — Carl Rogers
151160
152161
For more on our collaboration philosophy, see
153162
[DEVELOPMENT-PRINCIPLES.md](DEVELOPMENT-PRINCIPLES.md).

docs/DETAILED.md

Lines changed: 51 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ _exclusively for Bryan Chasko_. It integrates memory, emotion, and modular
1818
execution across a trusted cohort of entities to restore momentum, anchor
1919
decisions, and evolve alongside Bryan's ongoing personal and professional arcs.
2020

21-
### Core Components:
21+
### Core Components
2222

2323
- **🦊 HARALD** – The default entity. Serves as an emotional mirror, decision
2424
anchor, and continuity manager—especially effective during moments of
@@ -175,7 +175,9 @@ relationships, as well as routines and behavioral patterns.
175175
With explicit consent, HeraldStack observes and logs sleep patterns,
176176
conversations (when dual-consent is recorded), random thoughts, and important
177177
insights. All thoughts are automatically tagged and categorized (e.g., #idea,
178-
#todo, #relationship, #coding), with full access to raw logs for auditing,
178+
179+
# todo, #relationship, #coding), with full access to raw logs for auditing
180+
179181
tuning, or retraining.
180182

181183
### Calendar Intelligence
@@ -294,22 +296,54 @@ each component.
294296
and [GitHub repository](https://github.com/jean-pierreBoth/hnswlib-rs) for
295297
details and updates.
296298

297-
herald-stack/ # Project root (kebab-case) . ├── ai-entities │ ├── ellow.md │ ├──
298-
entity-registry.json │ ├── harald.md │ ├── kade-vox.md │ ├── liora.md │ ├──
299-
myrren.md │ ├── Orin.md │ ├── solan.md │ └── stratia.md ├── docs │ ├──
300-
architecture-decisions │ │ └── 001-entity-cohort-design.md │ ├── changelog.md │
301-
├── roadmap.md │ └── weekly-reviews ├── infrastructure │ ├── aws-stack.md │ ├──
302-
cost-monitoring.md │ ├── deployment-guide.md │ └── pinecone-schemas.md ├──
303-
integration-guides │ ├── agentic-orchestration.md │ ├──
299+
HARALD/ # Project root (kebab-case) ├── ai-entities/ # AI entity
300+
definitions and metadata │ ├── entity-registry.json # Entity registry (all
301+
entities) │ ├── harald.md # Entity: HARALD │ ├── stratia.md # Entity: Stratia │
302+
├── myrren.md # Entity: Myrren │ ├── liora.md # Entity: Liora │ ├──
303+
kade-vox.md # Entity: Kade Vox │ ├── solan.md # Entity: Solan │ ├── ellow.md #
304+
Entity: Ellow │ ├── orin.md # Entity: Orin │ └── prompts/ # Prompt templates for
305+
entities ├── config/ # Schemas, ethics, and model configs │ ├── CONFIG.md #
306+
Config documentation │ ├── ethics/ # Ethical guidelines (e.g.,
307+
LawsOfRobotics.json) │ │ └── LawsOfRobotics.json │ ├── models/ # Model
308+
configuration files │ └── schemas/ # Data schemas for validation ├── data/ #
309+
Vector store registry, ingested data │ ├── vector-stores-registry.json │ └──
310+
schemas/ # Data schemas (if present) ├── datasets/ # Source datasets for
311+
ingestion/embedding ├── docs/ # System, migration, and vector search docs │ ├──
312+
CONTRIBUTING.md # Contribution guidelines │ ├── DETAILED.md # This file
313+
(detailed docs) │ ├── DEVELOPMENT-PRINCIPLES.md │ ├── naming-conventions.md │
314+
├── migration/ # Shell-to-Rust migration plans │ │ ├── RECOMMENDED-STRUCTURE.md
315+
│ │ ├── DIRECTORY-REORGANIZATION.md │ │ └── IMPLEMENTATION-PLAN.md │ └──
316+
vector-search/ # Vector search and embedding docs │ ├──
317+
character-based-chunking.md │ ├── ollama-embedding-limits.md │ └──
318+
jsonl-ingestion.md ├── infrastructure/ # Cloud and deployment infrastructure
319+
docs │ ├── aws-stack.md │ ├── cost-monitoring.md │ ├── deployment-guide.md │ └──
320+
pinecone-schemas.md ├── integration-guides/ # Integration docs for external
321+
APIs/services │ ├── agentic-orchestration.md │ ├──
304322
amazon-voice-interoperability.md │ ├── anthropic.md │ ├── aws.md │ ├──
305323
bedrock.md │ ├── cohere.md │ ├── google.md │ ├── griptape.md │ ├── haystack.md │
306324
├── hugging-face.md │ ├── microsoft.md │ ├── open-ai.md │ └── pinecone.md ├──
307-
LawsOfRobotics.json ├── memory-schemas │ ├── conversation-metadata.json │ ├──
325+
logs/ # Ingestion and embedding logs │ ├── embedding*size_test*_.log │ ├──
326+
ingest*log*_.log │ └── embedding_api/ # API-specific logs ├── memory-schemas/ #
327+
Schemas for memory and context │ ├── conversation-metadata.json │ ├──
308328
emotion-vectors.json │ ├── entity-context.json │ └── narrative-arc.json ├──
309-
personality-archetypes │ ├── mythological │ │ ├── celtic │ │ ├──
310-
human-inspired.md │ │ └── norse │ │ ├── Heralds.json │ │ └── heralds.md │ └──
311-
pop-culture │ ├── bojack-horseman │ │ └── Bojack.json │ ├── literary │ ├──
312-
marvel │ │ ├── MarvelAIs.json │ │ ├── pop-culture-ai-references.md │ │ └──
313-
VictorMancha.json │ └── marvel.md ├── README.md └── workflows ├──
314-
consent-logging.md ├── entity-routing.md ├── task-orchestration.md └──
315-
weekly-review.md
329+
personality-archetypes/ # Archetype definitions and docs │ ├── Heralds.json │
330+
├── heralds.md │ ├── mythological/ │ │ ├── celtic/ │ │ ├── norse/ │ │ └──
331+
human-inspired.md │ └── pop-culture/ │ ├── bojack-horseman/ │ │ └── Bojack.json
332+
│ ├── literary/ │ └── marvel/ │ ├── MarvelAIs.json │ ├──
333+
pop-culture-ai-references.md │ └── VictorMancha.json ├── rust_ingest/ # Rust CLI
334+
tools for ingestion/embedding │ ├── Cargo.toml │ ├── Cargo.lock │ ├──
335+
rustREADME.md │ ├── src/ │ └── target/ ├── scripts/ # Shell scripts for
336+
deployment/validation only │ ├── build_rust_tools.sh │ └── validation/ │ ├──
337+
check-json.sh │ └── check-rust.sh │ └── deploy/ │ └── deploy.sh ├── src/ # Core
338+
Rust code (all app logic) │ ├── ingest/ # Ingestion pipeline logic │ │ ├──
339+
marvelai_ingest.rs # Domain-specific ingest wrapper │ │ ├── ingest.rs # Core
340+
ingest logic │ │ ├── chunking_utils.rs # Character-based chunking │ │ ├──
341+
embedding.rs # Embedding API integration │ │ └── ... │ ├── utils/ │ │ ├──
342+
json_tools/ │ │ │ ├── format_json.rs │ │ │ ├── validate_json_schema.rs │ │ │ └──
343+
... │ │ └── ... │ └── target/ # Rust build output (release/debug) ├── target/ #
344+
Rust build output (workspace root) ├── tests/ # Test fixtures and test code │
345+
├── fixtures/ │ │ └── FIXTURES.md │ ├── ingest_tests.rs # Ingestion/embedding
346+
tests │ ├── utils_tests.rs # Utility function tests │ └── ... ├── workflows/ #
347+
CI/CD and automation configs │ ├── rust.yml # Rust build/test workflow │ ├──
348+
lint.yml # Linting/formatting workflow │ └── ... ├── README.md # Project
349+
overview, build, and dev standards └── Cargo.toml # Rust workspace config (root)

0 commit comments

Comments
 (0)