sandialabs
diff --git a/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/tutorials/Tutorial_1-Document_Indexing/README.md‎
Lines changed: 104 additions & 181 deletions b/‎docs/tutorials/Tutorial_1-Document_Indexing/README.md‎
Lines changed: 104 additions & 181 deletions
@@ -1,6 +1,9 @@
 # Changelog
 
 ## In Development
+- Improved Tutorial 1 (Document Indexing), Tutorial 2 (Search by Example and RAG), and Tutorial 3
+  (Report Generation) documentation: clearer structure, prerequisites and quick-start section,
+  value-focused intro, and step-by-step guidance for running the tutorials.
 - Fixed documentation examples: corrected lazy-loading.md ChatterLang syntax (range as source
   needs INPUT FROM, uses lower/upper not start/stop; scale uses multiplier not factor);
   corrected protocol.md assign_property example (nested paths require direct assignment).
 
@@ -1,36 +1,68 @@
-# Prototyping a Searchable Document System with TalkPipe
+# Tutorial 1: Document Indexing and Search
 
-This tutorial walks through building a document indexing and search system using TalkPipe. You'll learn how to rapidly prototype and experiment with different approaches to make large collections of text searchable and retrievable.
+**Build a full document search system in under 30 minutes—from zero to a working web interface and API.**
+
+TalkPipe lets you prototype searchable document systems without external databases or custom code. This tutorial shows how: you’ll generate test content, index it with full-text search, and expose it via a web UI and REST API—all using TalkPipe’s built-in components.
+
+---
+
+## Why This Tutorial?
+
+- **Start fast**: No database setup—Whoosh is included with TalkPipe.
+- **Real results**: You get a working search UI and API, not just a demo.
+- **Learn the basics**: Foundation for Tutorials 2 (semantic search) and 3 (report generation).
+- **Reuse it**: The same patterns work for docs, wikis, tickets, and other text collections.
+
+---
 
 ## What You'll Build
 
-A complete document search system in three steps:
-1. **Creating the Content** - Generate synthetic test documents using LLMs
-2. **Building the Index** - Make documents searchable with full-text indexing
-3. **Implementing Search** - Create web and API interfaces for searching
+| Step | Goal | Outcome |
+|------|------|---------|
+| **1** | Create content | 50 AI-generated stories in `stories.json` |
+| **2** | Build the index | Full-text search index in `full_text_index/` |
+| **3** | Add search | Web form + REST API for searching |
+
+---
+
+## Prerequisites
 
-This tutorial uses minimal LLM integration (only for test data generation) to demonstrate TalkPipe's core pipeline concepts. The actual search functionality uses Whoosh, a pure-Python full-text search library included with TalkPipe - no external databases or services required.
+- **TalkPipe** installed: `pip install talkpipe[ollama]` (or `talkpipe[all]`)
+- **Step 1 only**: Ollama installed locally with the `llama3.2` model (or adjust the script to use another model)
 
-**Use Cases**: These three steps might be your complete solution for small projects, or your proof-of-concept phase for larger deployments where you'll extract the proven pipelines into custom applications.
+> **Tip:** If you skip Step 1, you can use the included `stories.json` and go straight to Step 2.
+
+---
+
+## Quick Start
+
+All commands must be run from the tutorial directory:
+
+```bash
+cd docs/tutorials/Tutorial_1-Document_Indexing
+```
+
+| Step | Command | Time |
+|------|---------|------|
+| 1 | `./Step_1_CreateSyntheticData.sh` or `chatterlang_script --script Step_1_CreateSyntheticData.script` | ~5–10 min |
+| 2 | `./Step_2_IndexStories.sh` or `chatterlang_script --script Step_2_IndexStories.script` | ~5 sec |
+| 3 | `./Step_3_SearchStories.sh` or `chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script` | Starts server |
+
+Step 3 starts a web server. Use the URL printed in the terminal—append `/stream` for the search form, or POST to the base URL for the REST API.
 
 ---
 
 ## Step 1: Creating Synthetic Data
-*Generating realistic test content for our indexing system*
 
-### The Challenge
-Before we can build a search system, we need documents to search through. In real-world scenarios, you might be working with existing content like:
-- Research papers and articles
-- Company documentation and wikis
-- Customer support tickets
-- Product descriptions
-- Legal documents
+*Generating realistic test content for indexing*
+
+### The Problem
 
-However, for testing and development purposes, we often need synthetic data that mimics real content patterns without using sensitive or copyrighted material.
+You need documents to search. In real projects you might use papers, wikis, or tickets. For prototyping, synthetic data is safer and easier to control.
 
-### The Solution: AI-Generated Stories
+### The Solution
 
-The first step uses TalkPipe's ChatterLang scripting language to generate 50 fictional stories about technology development. The pipeline is defined in `Step_1_CreateSyntheticData.script`:
+A ChatterLang pipeline in `Step_1_CreateSyntheticData.script` generates 50 short stories using an LLM:
 
 ```
 INPUT FROM "Generating 50 synthetic stories into stories.json" | print;
@@ -43,70 +75,35 @@ INPUT FROM echo[data="Write a fictitious five sentence story about technology de
 | writeString[fname="stories.json"];
 ```
 
-To run this script (from the `Tutorial_1-Document_Indexing` directory):
+**Run it:**
 
 ```bash
 chatterlang_script --script Step_1_CreateSyntheticData.script
 ```
 
-### Breaking Down the Pipeline
-
-Let's understand what each part of this pipeline accomplishes:
-
-**1. Status Message**
-```
-INPUT FROM "Generating 50 synthetic stories into stories.json" | print;
-```
-This prints a status message to let the user know what's happening.
-
-**2. Content Generation with Repetition**
-```
-INPUT FROM echo[data="Write a fictitious five sentence story about technology development in an imaginary country.", n=50]
-| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
-```
-The `echo` source generates 50 copies of the prompt, which are then sent to the LLM to generate 50 different stories. The `multi_turn=False` parameter ensures each story is independent, preventing the AI from building on previous stories.
-
-**3. Data Structuring**
-```
-| toDict[field_list="_:content"]
-```
-This transforms the generated text into a structured format, creating a dictionary with the story content in a "content" field. This structure is crucial for later indexing.
-
-**4. Title Generation**
-```
-| llmPrompt[system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
-```
-This step generates appropriate titles for each story. Notice how it:
-- Uses the story content as input (`field="content"`)
-- Adds the title to the existing data (`set_as="title"`)
-- Ensures clean output with specific formatting instructions
-
-**5. Progress Tracking and Output**
-```
-| progressTicks[tick_count=1, eol_count=10, print_count=true]
-| dumpsJsonl
-| writeString[fname="stories.json"];
-```
-The `progressTicks` segment provides visual feedback during processing. Finally, each complete document (with both content and title) is formatted as JSON Lines (JSONL) format and written directly to `stories.json` using `writeString`.
+### Pipeline Breakdown
 
+| Segment | Purpose |
+|---------|---------|
+| `echo[data="...", n=50]` | Sends 50 copies of the prompt to the pipeline |
+| `llmPrompt[source="ollama", model="llama3.2", multi_turn=False]` | Generates 50 different stories (no context between calls) |
+| `toDict[field_list="_:content"]` | Puts each story into a dict with a `content` field |
+| `llmPrompt[..., field="content", set_as="title", ...]` | Adds a title for each story |
+| `dumpsJsonl` \| `writeString[fname="stories.json"]` | Writes JSONL to `stories.json` |
 
 ---
 
 ## Step 2: Indexing the Stories
-*Making documents searchable through full-text indexing*
 
-### The Challenge
-Having documents isn't enough – we need to make them searchable. Raw text files can't efficiently answer queries like "find all stories about artificial intelligence" or "show me documents mentioning quantum computing." We need an index.
+*Making documents searchable with full-text indexing*
+
+### The Problem
 
-Think of an index like the index at the back of a book, but much more sophisticated. Instead of just noting page numbers for specific terms, a full-text search index:
-- Breaks down every document into searchable terms
-- Creates reverse mappings from terms to documents
-- Enables complex queries with multiple terms
-- Provides relevance scoring for search results
+Raw text files don’t support queries like “find stories about quantum computing.” You need an index: terms → documents, with relevance scoring.
 
-### The Solution: Whoosh Indexing
+### The Solution
 
-Step 2 takes our generated stories and creates a searchable index using the Whoosh library. The pipeline is defined in `Step_2_IndexStories.script`:
+Whoosh creates a full-text index from `stories.json`:
 
 ```
 INPUT FROM "stories.json"
@@ -115,154 +112,80 @@ INPUT FROM "stories.json"
 | indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
 ```
 
-To run this script (from the `Tutorial_1-Document_Indexing` directory):
+**Run it:**
 
 ```bash
 chatterlang_script --script Step_2_IndexStories.script
 ```
 
-### Understanding the Indexing Pipeline
-
-**1. Reading the Data**
-```
-INPUT FROM "stories.json"
-| readJsonl
-```
-This reads our generated stories file and converts each line from JSON format back into Python objects that can be processed.
-
-**2. Progress Monitoring**
-```
-| progressTicks[tick_count=1, print_count=True]
-```
-This provides feedback during processing, showing how many documents have been indexed. For larger datasets, this helps track progress and identify any bottlenecks.
-
-**3. The Indexing Engine**
-```
-| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
-```
-This is where the magic happens. The `indexWhoosh` segment:
-- **Creates the index structure** in the `./full_text_index` directory
-- **Indexes specified fields** - both "content" and "title" become searchable
-- **Overwrites existing indexes** if they exist (`overwrite=True`)
-
-### Why Whoosh?
-
-Whoosh is a pure-Python search library that provides:
-- **Full-text search capabilities** - Find documents containing specific terms
-- **Boolean queries** - Combine terms with AND, OR, NOT operators
-- **Phrase searching** - Find exact phrases within documents
-- **Wildcard support** - Search with partial matches using * and ?
-- **Relevance scoring** - Rank results by how well they match queries
+### What Whoosh Provides
 
-Whoosh is included with TalkPipe, providing access to full-text search capabilities without additional setup, configuration, or external dependencies. This makes it possible to start experimenting with search functionality quickly.
+- Full-text search, phrase search, wildcards (`*`, `?`)
+- Boolean queries (AND, OR, NOT)
+- Relevance scoring
 
-For production deployments that need to scale beyond Whoosh's capabilities, TalkPipe's modular design allows you to replace the Whoosh components with enterprise search engines like Elasticsearch or Solr while keeping the rest of your pipeline unchanged.
+TalkPipe includes Whoosh, so there’s no extra install. For larger deployments, you can swap in Elasticsearch or Solr and keep the rest of the pipeline.
 
-### What Happens During Indexing
+### Data Flow
 
-The indexing process:
-
-1. **Tokenization** - Breaks text into individual terms and phrases
-2. **Normalization** - Converts terms to lowercase, removes punctuation
-3. **Term Analysis** - Identifies important terms and their frequencies
-4. **Inverse Mapping** - Creates mappings from terms to documents containing them
-5. **Storage** - Saves the index structure to disk for fast retrieval
-
-After this step, you'll have a `full_text_index` directory containing the searchable index of all your stories.
-
-### The Data Flow Transformation
-
-Notice how the data changes as it flows through the pipeline:
-- **Input**: A filename string (`"stories.json"`)
-- **After readJsonl**: Individual JSON objects (one per story)
-- **After indexWhoosh**: The same objects, but now also stored in a searchable index
-
-This demonstrates TalkPipe's power in transforming data while maintaining its flow through the pipeline.
+- **Input**: Filename `"stories.json"`
+- **After `readJsonl`**: One JSON object per story
+- **After `indexWhoosh`**: Same objects, plus searchable index in `full_text_index/`
 
 ---
 
 ## Step 3: Implementing Search
-*Creating user interfaces for finding and retrieving documents*
 
-### The Challenge
-We now have a searchable index, but we need a way for users to actually search it. The challenge is providing both:
-- **Programmatic access** - APIs that other systems can call
-- **Human-friendly interfaces** - Web forms that people can use directly
+*Web interface and REST API*
 
-Most search implementations require significant custom development, but TalkPipe provides built-in solutions for both needs.
+### The Problem
 
-### The Solution: Dual Interface Search
+You need both a UI for humans and an API for other systems. TalkPipe provides both from one configuration.
 
-Step 3 creates both an API endpoint and a web interface using a single command. The pipeline is defined in `Step_3_SearchStories.script`:
+### The Solution
+
+`chatterlang_serve` starts a server with a search form and JSON API. The pipeline in `Step_3_SearchStories.script`:
 
 ```
 | searchWhoosh[index_path="full_text_index", field="query"]
 | formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
 ```
 
-To run this script (from the `Tutorial_1-Document_Indexing` directory):
+**Run it:**
 
 ```bash
 chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script
 ```
 
-### Understanding the Search System
+### What You Get
 
-**1. The Endpoint Architecture**
-```
-chatterlang_serve --form-config story_search_ui.yml --title "Story Search"
-```
-This creates a web server that provides:
-- **An API endpoint** for programmatic access
-- **A web form interface** configured by `story_search_ui.yml`
-- **Automatic JSON handling** for both input and output
+| Interface | Path | Use |
+|-----------|------|-----|
+| **Web form** | `{base_url}/stream` | Search in a browser |
+| **REST API** | `{base_url}/process` (POST) | Use from other apps |
 
-**2. The Search Pipeline**
-```
-| searchWhoosh[index_path="full_text_index", field="query"]
-```
-This segment:
-- **Connects to our index** created in Step 2
-- **Accepts search queries** from the "query" field in incoming requests
-- **Returns matching documents** with relevance scores
-
-**3. Result Formatting**
-```
-| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
-```
-This formats the search results for display:
-- **Maps internal fields** to user-friendly labels
-- **Selects relevant information** (title, content, and relevance score)
-- **Prepares data** for both API responses and web display
+The server prints the base URL when it starts (default: `http://localhost:2025`).
 
-### How Users Interact with the System
+**Example API call:**
 
-When you run the `chatterlang_serve` command, it will start a web server and display the URL to access it. Visiting that URL directly will show you the raw API endpoint interface, while adding `/stream` to the URL will bring up the user-friendly web form interface.
+```bash
+curl -X POST http://localhost:2025/process \
+  -H "Content-Type: application/json" \
+  -d '{"query": "quantum energy"}'
+```
 
-**Web Interface Users** can:
-1. Navigate to the `/stream` URL in their browser
-2. Enter search terms in a simple form
-3. View formatted results with titles, content, and relevance scores
-4. Refine their searches and try different terms
+### Form Configuration
 
-**API Users** can:
-1. Send POST requests to the base URL with JSON payloads containing queries
-2. Receive structured JSON responses with search results
-3. Integrate search functionality into their own applications
-4. Build more complex interfaces on top of the search API
+`story_search_ui.yml` defines the web form (fields, layout, theme). You can change the UI without touching the search logic.
 
-### Configuring chatterlang_serve
+---
 
-The `story_search_ui.yml` file (included in the tutorial directory) contains configuration for the web interface:
-- **Form field definitions** - What search options to present
-- **Styling configuration** - How the interface should look
-- **Validation rules** - What types of queries are allowed
-- **Response formatting** - How results should be displayed
+## Next Steps
 
-This separation of configuration from code means you can:
-- **Modify the interface** without changing the underlying search logic
-- **Customize for different use cases** with different YAML files
-- **Maintain consistency** across multiple search interfaces
+- **Tutorial 2**: Add semantic search and RAG with vector embeddings.
+- **Tutorial 3**: Build report generation from search results.
+- **Customize**: Swap prompts, models, or indexes; the pipeline structure stays the same.
 
 ---
-Last Reviewed: 20251128
+
+*Last reviewed: 20260212*