You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Prototyping a Searchable Document System with TalkPipe
1
+
# Tutorial 1: Document Indexing and Search
2
2
3
-
This tutorial walks through building a document indexing and search system using TalkPipe. You'll learn how to rapidly prototype and experiment with different approaches to make large collections of text searchable and retrievable.
3
+
**Build a full document search system in under 30 minutes—from zero to a working web interface and API.**
4
+
5
+
TalkPipe lets you prototype searchable document systems without external databases or custom code. This tutorial shows how: you’ll generate test content, index it with full-text search, and expose it via a web UI and REST API—all using TalkPipe’s built-in components.
6
+
7
+
---
8
+
9
+
## Why This Tutorial?
10
+
11
+
-**Start fast**: No database setup—Whoosh is included with TalkPipe.
12
+
-**Real results**: You get a working search UI and API, not just a demo.
13
+
-**Learn the basics**: Foundation for Tutorials 2 (semantic search) and 3 (report generation).
14
+
-**Reuse it**: The same patterns work for docs, wikis, tickets, and other text collections.
15
+
16
+
---
4
17
5
18
## What You'll Build
6
19
7
-
A complete document search system in three steps:
8
-
1.**Creating the Content** - Generate synthetic test documents using LLMs
9
-
2.**Building the Index** - Make documents searchable with full-text indexing
10
-
3.**Implementing Search** - Create web and API interfaces for searching
20
+
| Step | Goal | Outcome |
21
+
|------|------|---------|
22
+
|**1**| Create content | 50 AI-generated stories in `stories.json`|
23
+
|**2**| Build the index | Full-text search index in `full_text_index/`|
24
+
|**3**| Add search | Web form + REST API for searching |
25
+
26
+
---
27
+
28
+
## Prerequisites
11
29
12
-
This tutorial uses minimal LLM integration (only for test data generation) to demonstrate TalkPipe's core pipeline concepts. The actual search functionality uses Whoosh, a pure-Python full-text search library included with TalkPipe - no external databases or services required.
-**Step 1 only**: Ollama installed locally with the `llama3.2` model (or adjust the script to use another model)
13
32
14
-
**Use Cases**: These three steps might be your complete solution for small projects, or your proof-of-concept phase for larger deployments where you'll extract the proven pipelines into custom applications.
33
+
> **Tip:** If you skip Step 1, you can use the included `stories.json` and go straight to Step 2.
34
+
35
+
---
36
+
37
+
## Quick Start
38
+
39
+
All commands must be run from the tutorial directory:
40
+
41
+
```bash
42
+
cd docs/tutorials/Tutorial_1-Document_Indexing
43
+
```
44
+
45
+
| Step | Command | Time |
46
+
|------|---------|------|
47
+
| 1 |`./Step_1_CreateSyntheticData.sh` or `chatterlang_script --script Step_1_CreateSyntheticData.script`|~5–10 min |
48
+
| 2 |`./Step_2_IndexStories.sh` or `chatterlang_script --script Step_2_IndexStories.script`|~5 sec |
49
+
| 3 |`./Step_3_SearchStories.sh` or `chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script`| Starts server |
50
+
51
+
Step 3 starts a web server. Use the URL printed in the terminal—append `/stream` for the search form, or POST to the base URL for the REST API.
15
52
16
53
---
17
54
18
55
## Step 1: Creating Synthetic Data
19
-
*Generating realistic test content for our indexing system*
20
56
21
-
### The Challenge
22
-
Before we can build a search system, we need documents to search through. In real-world scenarios, you might be working with existing content like:
23
-
- Research papers and articles
24
-
- Company documentation and wikis
25
-
- Customer support tickets
26
-
- Product descriptions
27
-
- Legal documents
57
+
*Generating realistic test content for indexing*
58
+
59
+
### The Problem
28
60
29
-
However, for testing and development purposes, we often need synthetic data that mimics real content patterns without using sensitive or copyrighted material.
61
+
You need documents to search. In real projects you might use papers, wikis, or tickets. For prototyping, synthetic data is safer and easier to control.
30
62
31
-
### The Solution: AI-Generated Stories
63
+
### The Solution
32
64
33
-
The first step uses TalkPipe's ChatterLang scripting language to generate 50 fictional stories about technology development. The pipeline is defined in `Step_1_CreateSyntheticData.script`:
65
+
A ChatterLang pipeline in `Step_1_CreateSyntheticData.script` generates 50 short stories using an LLM:
34
66
35
67
```
36
68
INPUT FROM "Generating 50 synthetic stories into stories.json" | print;
@@ -43,70 +75,35 @@ INPUT FROM echo[data="Write a fictitious five sentence story about technology de
43
75
| writeString[fname="stories.json"];
44
76
```
45
77
46
-
To run this script (from the `Tutorial_1-Document_Indexing` directory):
The `echo` source generates 50 copies of the prompt, which are then sent to the LLM to generate 50 different stories. The `multi_turn=False` parameter ensures each story is independent, preventing the AI from building on previous stories.
68
-
69
-
**3. Data Structuring**
70
-
```
71
-
| toDict[field_list="_:content"]
72
-
```
73
-
This transforms the generated text into a structured format, creating a dictionary with the story content in a "content" field. This structure is crucial for later indexing.
74
-
75
-
**4. Title Generation**
76
-
```
77
-
| llmPrompt[system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
78
-
```
79
-
This step generates appropriate titles for each story. Notice how it:
80
-
- Uses the story content as input (`field="content"`)
81
-
- Adds the title to the existing data (`set_as="title"`)
82
-
- Ensures clean output with specific formatting instructions
The `progressTicks` segment provides visual feedback during processing. Finally, each complete document (with both content and title) is formatted as JSON Lines (JSONL) format and written directly to `stories.json` using `writeString`.
84
+
### Pipeline Breakdown
91
85
86
+
| Segment | Purpose |
87
+
|---------|---------|
88
+
|`echo[data="...", n=50]`| Sends 50 copies of the prompt to the pipeline |
89
+
|`llmPrompt[source="ollama", model="llama3.2", multi_turn=False]`| Generates 50 different stories (no context between calls) |
90
+
|`toDict[field_list="_:content"]`| Puts each story into a dict with a `content` field |
91
+
|`llmPrompt[..., field="content", set_as="title", ...]`| Adds a title for each story |
92
+
|`dumpsJsonl`\|`writeString[fname="stories.json"]`| Writes JSONL to `stories.json`|
92
93
93
94
---
94
95
95
96
## Step 2: Indexing the Stories
96
-
*Making documents searchable through full-text indexing*
97
97
98
-
### The Challenge
99
-
Having documents isn't enough – we need to make them searchable. Raw text files can't efficiently answer queries like "find all stories about artificial intelligence" or "show me documents mentioning quantum computing." We need an index.
98
+
*Making documents searchable with full-text indexing*
99
+
100
+
### The Problem
100
101
101
-
Think of an index like the index at the back of a book, but much more sophisticated. Instead of just noting page numbers for specific terms, a full-text search index:
102
-
- Breaks down every document into searchable terms
103
-
- Creates reverse mappings from terms to documents
104
-
- Enables complex queries with multiple terms
105
-
- Provides relevance scoring for search results
102
+
Raw text files don’t support queries like “find stories about quantum computing.” You need an index: terms → documents, with relevance scoring.
106
103
107
-
### The Solution: Whoosh Indexing
104
+
### The Solution
108
105
109
-
Step 2 takes our generated stories and creates a searchable index using the Whoosh library. The pipeline is defined in `Step_2_IndexStories.script`:
106
+
Whoosh creates a full-text index from `stories.json`:
This reads our generated stories file and converts each line from JSON format back into Python objects that can be processed.
132
-
133
-
**2. Progress Monitoring**
134
-
```
135
-
| progressTicks[tick_count=1, print_count=True]
136
-
```
137
-
This provides feedback during processing, showing how many documents have been indexed. For larger datasets, this helps track progress and identify any bottlenecks.
This is where the magic happens. The `indexWhoosh` segment:
144
-
-**Creates the index structure** in the `./full_text_index` directory
145
-
-**Indexes specified fields** - both "content" and "title" become searchable
146
-
-**Overwrites existing indexes** if they exist (`overwrite=True`)
147
-
148
-
### Why Whoosh?
149
-
150
-
Whoosh is a pure-Python search library that provides:
151
-
-**Full-text search capabilities** - Find documents containing specific terms
152
-
-**Boolean queries** - Combine terms with AND, OR, NOT operators
153
-
-**Phrase searching** - Find exact phrases within documents
154
-
-**Wildcard support** - Search with partial matches using * and ?
155
-
-**Relevance scoring** - Rank results by how well they match queries
121
+
### What Whoosh Provides
156
122
157
-
Whoosh is included with TalkPipe, providing access to full-text search capabilities without additional setup, configuration, or external dependencies. This makes it possible to start experimenting with search functionality quickly.
For production deployments that need to scale beyond Whoosh's capabilities, TalkPipe's modular design allows you to replace the Whoosh components with enterprise search engines like Elasticsearch or Solr while keeping the rest of your pipeline unchanged.
127
+
TalkPipe includes Whoosh, so there’s no extra install. For larger deployments, you can swap in Elasticsearch or Solr and keep the rest of the pipeline.
160
128
161
-
### What Happens During Indexing
129
+
### Data Flow
162
130
163
-
The indexing process:
164
-
165
-
1.**Tokenization** - Breaks text into individual terms and phrases
166
-
2.**Normalization** - Converts terms to lowercase, removes punctuation
167
-
3.**Term Analysis** - Identifies important terms and their frequencies
168
-
4.**Inverse Mapping** - Creates mappings from terms to documents containing them
169
-
5.**Storage** - Saves the index structure to disk for fast retrieval
170
-
171
-
After this step, you'll have a `full_text_index` directory containing the searchable index of all your stories.
172
-
173
-
### The Data Flow Transformation
174
-
175
-
Notice how the data changes as it flows through the pipeline:
176
-
-**Input**: A filename string (`"stories.json"`)
177
-
-**After readJsonl**: Individual JSON objects (one per story)
178
-
-**After indexWhoosh**: The same objects, but now also stored in a searchable index
179
-
180
-
This demonstrates TalkPipe's power in transforming data while maintaining its flow through the pipeline.
131
+
-**Input**: Filename `"stories.json"`
132
+
-**After `readJsonl`**: One JSON object per story
133
+
-**After `indexWhoosh`**: Same objects, plus searchable index in `full_text_index/`
181
134
182
135
---
183
136
184
137
## Step 3: Implementing Search
185
-
*Creating user interfaces for finding and retrieving documents*
186
138
187
-
### The Challenge
188
-
We now have a searchable index, but we need a way for users to actually search it. The challenge is providing both:
189
-
-**Programmatic access** - APIs that other systems can call
190
-
-**Human-friendly interfaces** - Web forms that people can use directly
139
+
*Web interface and REST API*
191
140
192
-
Most search implementations require significant custom development, but TalkPipe provides built-in solutions for both needs.
141
+
### The Problem
193
142
194
-
### The Solution: Dual Interface Search
143
+
You need both a UI for humans and an API for other systems. TalkPipe provides both from one configuration.
195
144
196
-
Step 3 creates both an API endpoint and a web interface using a single command. The pipeline is defined in `Step_3_SearchStories.script`:
145
+
### The Solution
146
+
147
+
`chatterlang_serve` starts a server with a search form and JSON API. The pipeline in `Step_3_SearchStories.script`:
-**Selects relevant information** (title, content, and relevance score)
236
-
-**Prepares data** for both API responses and web display
167
+
The server prints the base URL when it starts (default: `http://localhost:2025`).
237
168
238
-
### How Users Interact with the System
169
+
**Example API call:**
239
170
240
-
When you run the `chatterlang_serve` command, it will start a web server and display the URL to access it. Visiting that URL directly will show you the raw API endpoint interface, while adding `/stream` to the URL will bring up the user-friendly web form interface.
171
+
```bash
172
+
curl -X POST http://localhost:2025/process \
173
+
-H "Content-Type: application/json" \
174
+
-d '{"query": "quantum energy"}'
175
+
```
241
176
242
-
**Web Interface Users** can:
243
-
1. Navigate to the `/stream` URL in their browser
244
-
2. Enter search terms in a simple form
245
-
3. View formatted results with titles, content, and relevance scores
246
-
4. Refine their searches and try different terms
177
+
### Form Configuration
247
178
248
-
**API Users** can:
249
-
1. Send POST requests to the base URL with JSON payloads containing queries
250
-
2. Receive structured JSON responses with search results
251
-
3. Integrate search functionality into their own applications
252
-
4. Build more complex interfaces on top of the search API
179
+
`story_search_ui.yml` defines the web form (fields, layout, theme). You can change the UI without touching the search logic.
253
180
254
-
### Configuring chatterlang_serve
181
+
---
255
182
256
-
The `story_search_ui.yml` file (included in the tutorial directory) contains configuration for the web interface:
257
-
-**Form field definitions** - What search options to present
258
-
-**Styling configuration** - How the interface should look
259
-
-**Validation rules** - What types of queries are allowed
260
-
-**Response formatting** - How results should be displayed
183
+
## Next Steps
261
184
262
-
This separation of configuration from code means you can:
263
-
-**Modify the interface** without changing the underlying search logic
264
-
-**Customize for different use cases** with different YAML files
265
-
-**Maintain consistency** across multiple search interfaces
185
+
-**Tutorial 2**: Add semantic search and RAG with vector embeddings.
186
+
-**Tutorial 3**: Build report generation from search results.
187
+
-**Customize**: Swap prompts, models, or indexes; the pipeline structure stays the same.
0 commit comments