Skip to content

Commit 8989695

Browse files
committed
Updated Documentation
1 parent ad87ca6 commit 8989695

File tree

4 files changed

+318
-520
lines changed

4 files changed

+318
-520
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
# Changelog
22

33
## In Development
4+
- Improved Tutorial 1 (Document Indexing), Tutorial 2 (Search by Example and RAG), and Tutorial 3
5+
(Report Generation) documentation: clearer structure, prerequisites and quick-start section,
6+
value-focused intro, and step-by-step guidance for running the tutorials.
47
- Fixed documentation examples: corrected lazy-loading.md ChatterLang syntax (range as source
58
needs INPUT FROM, uses lower/upper not start/stop; scale uses multiplier not factor);
69
corrected protocol.md assign_property example (nested paths require direct assignment).
Lines changed: 104 additions & 181 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,68 @@
1-
# Prototyping a Searchable Document System with TalkPipe
1+
# Tutorial 1: Document Indexing and Search
22

3-
This tutorial walks through building a document indexing and search system using TalkPipe. You'll learn how to rapidly prototype and experiment with different approaches to make large collections of text searchable and retrievable.
3+
**Build a full document search system in under 30 minutes—from zero to a working web interface and API.**
4+
5+
TalkPipe lets you prototype searchable document systems without external databases or custom code. This tutorial shows how: you’ll generate test content, index it with full-text search, and expose it via a web UI and REST API—all using TalkPipe’s built-in components.
6+
7+
---
8+
9+
## Why This Tutorial?
10+
11+
- **Start fast**: No database setup—Whoosh is included with TalkPipe.
12+
- **Real results**: You get a working search UI and API, not just a demo.
13+
- **Learn the basics**: Foundation for Tutorials 2 (semantic search) and 3 (report generation).
14+
- **Reuse it**: The same patterns work for docs, wikis, tickets, and other text collections.
15+
16+
---
417

518
## What You'll Build
619

7-
A complete document search system in three steps:
8-
1. **Creating the Content** - Generate synthetic test documents using LLMs
9-
2. **Building the Index** - Make documents searchable with full-text indexing
10-
3. **Implementing Search** - Create web and API interfaces for searching
20+
| Step | Goal | Outcome |
21+
|------|------|---------|
22+
| **1** | Create content | 50 AI-generated stories in `stories.json` |
23+
| **2** | Build the index | Full-text search index in `full_text_index/` |
24+
| **3** | Add search | Web form + REST API for searching |
25+
26+
---
27+
28+
## Prerequisites
1129

12-
This tutorial uses minimal LLM integration (only for test data generation) to demonstrate TalkPipe's core pipeline concepts. The actual search functionality uses Whoosh, a pure-Python full-text search library included with TalkPipe - no external databases or services required.
30+
- **TalkPipe** installed: `pip install talkpipe[ollama]` (or `talkpipe[all]`)
31+
- **Step 1 only**: Ollama installed locally with the `llama3.2` model (or adjust the script to use another model)
1332

14-
**Use Cases**: These three steps might be your complete solution for small projects, or your proof-of-concept phase for larger deployments where you'll extract the proven pipelines into custom applications.
33+
> **Tip:** If you skip Step 1, you can use the included `stories.json` and go straight to Step 2.
34+
35+
---
36+
37+
## Quick Start
38+
39+
All commands must be run from the tutorial directory:
40+
41+
```bash
42+
cd docs/tutorials/Tutorial_1-Document_Indexing
43+
```
44+
45+
| Step | Command | Time |
46+
|------|---------|------|
47+
| 1 | `./Step_1_CreateSyntheticData.sh` or `chatterlang_script --script Step_1_CreateSyntheticData.script` | ~5–10 min |
48+
| 2 | `./Step_2_IndexStories.sh` or `chatterlang_script --script Step_2_IndexStories.script` | ~5 sec |
49+
| 3 | `./Step_3_SearchStories.sh` or `chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script` | Starts server |
50+
51+
Step 3 starts a web server. Use the URL printed in the terminal—append `/stream` for the search form, or POST to the base URL for the REST API.
1552

1653
---
1754

1855
## Step 1: Creating Synthetic Data
19-
*Generating realistic test content for our indexing system*
2056

21-
### The Challenge
22-
Before we can build a search system, we need documents to search through. In real-world scenarios, you might be working with existing content like:
23-
- Research papers and articles
24-
- Company documentation and wikis
25-
- Customer support tickets
26-
- Product descriptions
27-
- Legal documents
57+
*Generating realistic test content for indexing*
58+
59+
### The Problem
2860

29-
However, for testing and development purposes, we often need synthetic data that mimics real content patterns without using sensitive or copyrighted material.
61+
You need documents to search. In real projects you might use papers, wikis, or tickets. For prototyping, synthetic data is safer and easier to control.
3062

31-
### The Solution: AI-Generated Stories
63+
### The Solution
3264

33-
The first step uses TalkPipe's ChatterLang scripting language to generate 50 fictional stories about technology development. The pipeline is defined in `Step_1_CreateSyntheticData.script`:
65+
A ChatterLang pipeline in `Step_1_CreateSyntheticData.script` generates 50 short stories using an LLM:
3466

3567
```
3668
INPUT FROM "Generating 50 synthetic stories into stories.json" | print;
@@ -43,70 +75,35 @@ INPUT FROM echo[data="Write a fictitious five sentence story about technology de
4375
| writeString[fname="stories.json"];
4476
```
4577

46-
To run this script (from the `Tutorial_1-Document_Indexing` directory):
78+
**Run it:**
4779

4880
```bash
4981
chatterlang_script --script Step_1_CreateSyntheticData.script
5082
```
5183

52-
### Breaking Down the Pipeline
53-
54-
Let's understand what each part of this pipeline accomplishes:
55-
56-
**1. Status Message**
57-
```
58-
INPUT FROM "Generating 50 synthetic stories into stories.json" | print;
59-
```
60-
This prints a status message to let the user know what's happening.
61-
62-
**2. Content Generation with Repetition**
63-
```
64-
INPUT FROM echo[data="Write a fictitious five sentence story about technology development in an imaginary country.", n=50]
65-
| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
66-
```
67-
The `echo` source generates 50 copies of the prompt, which are then sent to the LLM to generate 50 different stories. The `multi_turn=False` parameter ensures each story is independent, preventing the AI from building on previous stories.
68-
69-
**3. Data Structuring**
70-
```
71-
| toDict[field_list="_:content"]
72-
```
73-
This transforms the generated text into a structured format, creating a dictionary with the story content in a "content" field. This structure is crucial for later indexing.
74-
75-
**4. Title Generation**
76-
```
77-
| llmPrompt[system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
78-
```
79-
This step generates appropriate titles for each story. Notice how it:
80-
- Uses the story content as input (`field="content"`)
81-
- Adds the title to the existing data (`set_as="title"`)
82-
- Ensures clean output with specific formatting instructions
83-
84-
**5. Progress Tracking and Output**
85-
```
86-
| progressTicks[tick_count=1, eol_count=10, print_count=true]
87-
| dumpsJsonl
88-
| writeString[fname="stories.json"];
89-
```
90-
The `progressTicks` segment provides visual feedback during processing. Finally, each complete document (with both content and title) is formatted as JSON Lines (JSONL) format and written directly to `stories.json` using `writeString`.
84+
### Pipeline Breakdown
9185

86+
| Segment | Purpose |
87+
|---------|---------|
88+
| `echo[data="...", n=50]` | Sends 50 copies of the prompt to the pipeline |
89+
| `llmPrompt[source="ollama", model="llama3.2", multi_turn=False]` | Generates 50 different stories (no context between calls) |
90+
| `toDict[field_list="_:content"]` | Puts each story into a dict with a `content` field |
91+
| `llmPrompt[..., field="content", set_as="title", ...]` | Adds a title for each story |
92+
| `dumpsJsonl` \| `writeString[fname="stories.json"]` | Writes JSONL to `stories.json` |
9293

9394
---
9495

9596
## Step 2: Indexing the Stories
96-
*Making documents searchable through full-text indexing*
9797

98-
### The Challenge
99-
Having documents isn't enough – we need to make them searchable. Raw text files can't efficiently answer queries like "find all stories about artificial intelligence" or "show me documents mentioning quantum computing." We need an index.
98+
*Making documents searchable with full-text indexing*
99+
100+
### The Problem
100101

101-
Think of an index like the index at the back of a book, but much more sophisticated. Instead of just noting page numbers for specific terms, a full-text search index:
102-
- Breaks down every document into searchable terms
103-
- Creates reverse mappings from terms to documents
104-
- Enables complex queries with multiple terms
105-
- Provides relevance scoring for search results
102+
Raw text files don’t support queries like “find stories about quantum computing.” You need an index: terms → documents, with relevance scoring.
106103

107-
### The Solution: Whoosh Indexing
104+
### The Solution
108105

109-
Step 2 takes our generated stories and creates a searchable index using the Whoosh library. The pipeline is defined in `Step_2_IndexStories.script`:
106+
Whoosh creates a full-text index from `stories.json`:
110107

111108
```
112109
INPUT FROM "stories.json"
@@ -115,154 +112,80 @@ INPUT FROM "stories.json"
115112
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
116113
```
117114

118-
To run this script (from the `Tutorial_1-Document_Indexing` directory):
115+
**Run it:**
119116

120117
```bash
121118
chatterlang_script --script Step_2_IndexStories.script
122119
```
123120

124-
### Understanding the Indexing Pipeline
125-
126-
**1. Reading the Data**
127-
```
128-
INPUT FROM "stories.json"
129-
| readJsonl
130-
```
131-
This reads our generated stories file and converts each line from JSON format back into Python objects that can be processed.
132-
133-
**2. Progress Monitoring**
134-
```
135-
| progressTicks[tick_count=1, print_count=True]
136-
```
137-
This provides feedback during processing, showing how many documents have been indexed. For larger datasets, this helps track progress and identify any bottlenecks.
138-
139-
**3. The Indexing Engine**
140-
```
141-
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
142-
```
143-
This is where the magic happens. The `indexWhoosh` segment:
144-
- **Creates the index structure** in the `./full_text_index` directory
145-
- **Indexes specified fields** - both "content" and "title" become searchable
146-
- **Overwrites existing indexes** if they exist (`overwrite=True`)
147-
148-
### Why Whoosh?
149-
150-
Whoosh is a pure-Python search library that provides:
151-
- **Full-text search capabilities** - Find documents containing specific terms
152-
- **Boolean queries** - Combine terms with AND, OR, NOT operators
153-
- **Phrase searching** - Find exact phrases within documents
154-
- **Wildcard support** - Search with partial matches using * and ?
155-
- **Relevance scoring** - Rank results by how well they match queries
121+
### What Whoosh Provides
156122

157-
Whoosh is included with TalkPipe, providing access to full-text search capabilities without additional setup, configuration, or external dependencies. This makes it possible to start experimenting with search functionality quickly.
123+
- Full-text search, phrase search, wildcards (`*`, `?`)
124+
- Boolean queries (AND, OR, NOT)
125+
- Relevance scoring
158126

159-
For production deployments that need to scale beyond Whoosh's capabilities, TalkPipe's modular design allows you to replace the Whoosh components with enterprise search engines like Elasticsearch or Solr while keeping the rest of your pipeline unchanged.
127+
TalkPipe includes Whoosh, so there’s no extra install. For larger deployments, you can swap in Elasticsearch or Solr and keep the rest of the pipeline.
160128

161-
### What Happens During Indexing
129+
### Data Flow
162130

163-
The indexing process:
164-
165-
1. **Tokenization** - Breaks text into individual terms and phrases
166-
2. **Normalization** - Converts terms to lowercase, removes punctuation
167-
3. **Term Analysis** - Identifies important terms and their frequencies
168-
4. **Inverse Mapping** - Creates mappings from terms to documents containing them
169-
5. **Storage** - Saves the index structure to disk for fast retrieval
170-
171-
After this step, you'll have a `full_text_index` directory containing the searchable index of all your stories.
172-
173-
### The Data Flow Transformation
174-
175-
Notice how the data changes as it flows through the pipeline:
176-
- **Input**: A filename string (`"stories.json"`)
177-
- **After readJsonl**: Individual JSON objects (one per story)
178-
- **After indexWhoosh**: The same objects, but now also stored in a searchable index
179-
180-
This demonstrates TalkPipe's power in transforming data while maintaining its flow through the pipeline.
131+
- **Input**: Filename `"stories.json"`
132+
- **After `readJsonl`**: One JSON object per story
133+
- **After `indexWhoosh`**: Same objects, plus searchable index in `full_text_index/`
181134

182135
---
183136

184137
## Step 3: Implementing Search
185-
*Creating user interfaces for finding and retrieving documents*
186138

187-
### The Challenge
188-
We now have a searchable index, but we need a way for users to actually search it. The challenge is providing both:
189-
- **Programmatic access** - APIs that other systems can call
190-
- **Human-friendly interfaces** - Web forms that people can use directly
139+
*Web interface and REST API*
191140

192-
Most search implementations require significant custom development, but TalkPipe provides built-in solutions for both needs.
141+
### The Problem
193142

194-
### The Solution: Dual Interface Search
143+
You need both a UI for humans and an API for other systems. TalkPipe provides both from one configuration.
195144

196-
Step 3 creates both an API endpoint and a web interface using a single command. The pipeline is defined in `Step_3_SearchStories.script`:
145+
### The Solution
146+
147+
`chatterlang_serve` starts a server with a search form and JSON API. The pipeline in `Step_3_SearchStories.script`:
197148

198149
```
199150
| searchWhoosh[index_path="full_text_index", field="query"]
200151
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
201152
```
202153

203-
To run this script (from the `Tutorial_1-Document_Indexing` directory):
154+
**Run it:**
204155

205156
```bash
206157
chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script
207158
```
208159

209-
### Understanding the Search System
160+
### What You Get
210161

211-
**1. The Endpoint Architecture**
212-
```
213-
chatterlang_serve --form-config story_search_ui.yml --title "Story Search"
214-
```
215-
This creates a web server that provides:
216-
- **An API endpoint** for programmatic access
217-
- **A web form interface** configured by `story_search_ui.yml`
218-
- **Automatic JSON handling** for both input and output
162+
| Interface | Path | Use |
163+
|-----------|------|-----|
164+
| **Web form** | `{base_url}/stream` | Search in a browser |
165+
| **REST API** | `{base_url}/process` (POST) | Use from other apps |
219166

220-
**2. The Search Pipeline**
221-
```
222-
| searchWhoosh[index_path="full_text_index", field="query"]
223-
```
224-
This segment:
225-
- **Connects to our index** created in Step 2
226-
- **Accepts search queries** from the "query" field in incoming requests
227-
- **Returns matching documents** with relevance scores
228-
229-
**3. Result Formatting**
230-
```
231-
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
232-
```
233-
This formats the search results for display:
234-
- **Maps internal fields** to user-friendly labels
235-
- **Selects relevant information** (title, content, and relevance score)
236-
- **Prepares data** for both API responses and web display
167+
The server prints the base URL when it starts (default: `http://localhost:2025`).
237168

238-
### How Users Interact with the System
169+
**Example API call:**
239170

240-
When you run the `chatterlang_serve` command, it will start a web server and display the URL to access it. Visiting that URL directly will show you the raw API endpoint interface, while adding `/stream` to the URL will bring up the user-friendly web form interface.
171+
```bash
172+
curl -X POST http://localhost:2025/process \
173+
-H "Content-Type: application/json" \
174+
-d '{"query": "quantum energy"}'
175+
```
241176

242-
**Web Interface Users** can:
243-
1. Navigate to the `/stream` URL in their browser
244-
2. Enter search terms in a simple form
245-
3. View formatted results with titles, content, and relevance scores
246-
4. Refine their searches and try different terms
177+
### Form Configuration
247178

248-
**API Users** can:
249-
1. Send POST requests to the base URL with JSON payloads containing queries
250-
2. Receive structured JSON responses with search results
251-
3. Integrate search functionality into their own applications
252-
4. Build more complex interfaces on top of the search API
179+
`story_search_ui.yml` defines the web form (fields, layout, theme). You can change the UI without touching the search logic.
253180

254-
### Configuring chatterlang_serve
181+
---
255182

256-
The `story_search_ui.yml` file (included in the tutorial directory) contains configuration for the web interface:
257-
- **Form field definitions** - What search options to present
258-
- **Styling configuration** - How the interface should look
259-
- **Validation rules** - What types of queries are allowed
260-
- **Response formatting** - How results should be displayed
183+
## Next Steps
261184

262-
This separation of configuration from code means you can:
263-
- **Modify the interface** without changing the underlying search logic
264-
- **Customize for different use cases** with different YAML files
265-
- **Maintain consistency** across multiple search interfaces
185+
- **Tutorial 2**: Add semantic search and RAG with vector embeddings.
186+
- **Tutorial 3**: Build report generation from search results.
187+
- **Customize**: Swap prompts, models, or indexes; the pipeline structure stays the same.
266188

267189
---
268-
Last Reviewed: 20251128
190+
191+
*Last reviewed: 20260212*

0 commit comments

Comments
 (0)