Skip to content

Commit 843e9a3

Browse files
committed
Updated tutorials to use lancedb and pull out scripts to a separate file
1 parent d59de5a commit 843e9a3

22 files changed

+318
-307
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Changelog
22

3+
## 0.9.3 (in development)
4+
### Improvements
5+
- Updated tutorials to use lancedb.
6+
- Moved tutorial scripts out of shell scripts into their own files
7+
38
## 0.9.2
49
### Improvements
510
- Added ability for sources and segments to have multiple names in chatterlang.

docs/tutorials/Tutorial_1-Document_Indexing/README.md

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -57,20 +57,22 @@ However, for testing and development purposes, we often need synthetic data that
5757

5858
### The Solution: AI-Generated Stories
5959

60-
The first step uses TalkPipe's ChatterLang scripting language to generate 50 fictional stories about technology development. Here's what happens:
60+
The first step uses TalkPipe's ChatterLang scripting language to generate 50 fictional stories about technology development. The pipeline is defined in `Step_1_CreateSyntheticData.script`:
6161

62-
```bash
63-
export TALKPIPE_CHATTERLANG_SCRIPT='
64-
LOOP 50 TIMES {
65-
INPUT FROM "Write a fictitious five sentence story about technology development in an imaginary country."
66-
| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
67-
| toDict[field_list="_:content"]
68-
| llmPrompt[source="ollama", model="llama3.2", system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
69-
| dumpsJsonl | print;
70-
}
71-
'
62+
```
63+
LOOP 50 TIMES {
64+
INPUT FROM "Write a fictitious five sentence story about technology development in an imaginary country."
65+
| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
66+
| toDict[field_list="_:content"]
67+
| llmPrompt[source="ollama", model="llama3.2", system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
68+
| dumpsJsonl | print;
69+
}
70+
```
71+
72+
To run this script:
7273

73-
python -m talkpipe.app.chatterlang_script --script CHATTERLANG_SCRIPT > stories.json
74+
```bash
75+
chatterlang_script --script Step_1_CreateSyntheticData.script > stories.json
7476
```
7577

7678
### Breaking Down the Pipeline
@@ -166,17 +168,19 @@ Think of an index like the index at the back of a book, but much more sophistica
166168
167169
### The Solution: Whoosh Indexing
168170
169-
Step 2 takes our generated stories and creates a searchable index using the Whoosh library:
171+
Step 2 takes our generated stories and creates a searchable index using the Whoosh library. The pipeline is defined in `Step_2_IndexStories.script`:
170172
171-
```bash
172-
export TALKPIPE_CHATTERLANG_SCRIPT='
173-
INPUT FROM "stories.json"
174-
| readJsonl
175-
| progressTicks[tick_count=1, print_count=True]
176-
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
177-
'
173+
```
174+
INPUT FROM "stories.json"
175+
| readJsonl
176+
| progressTicks[tick_count=1, print_count=True]
177+
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
178+
```
179+
180+
To run this script:
178181
179-
python -m talkpipe.app.chatterlang_script --script CHATTERLANG_SCRIPT
182+
```bash
183+
chatterlang_script --script Step_2_IndexStories.script
180184
```
181185
182186
### Understanding the Indexing Pipeline
@@ -283,15 +287,17 @@ Most search implementations require significant custom development, but TalkPipe
283287
284288
### The Solution: Dual Interface Search
285289
286-
Step 3 creates both an API endpoint and a web interface using a single command:
290+
Step 3 creates both an API endpoint and a web interface using a single command. The pipeline is defined in `Step_3_SearchStories.script`:
287291
288-
```bash
289-
export TALKPIPE_CHATTERLANG_SCRIPT='
290-
| searchWhoosh[index_path="full_text_index", field="query"]
291-
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
292-
'
292+
```
293+
| searchWhoosh[index_path="full_text_index", field="query"]
294+
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
295+
```
293296
294-
python -m talkpipe.app.chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script CHATTERLANG_SCRIPT
297+
To run this script:
298+
299+
```bash
300+
chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script
295301
```
296302
297303
### Understanding the Search System
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
###################################################################################
2+
# Step 1: Create Synthetic Data
3+
#
4+
# This script generates synthetic data for document indexing.
5+
# "chatterlang_script" is a command installed with talkpipe that allows you
6+
# to run Chatterlang scripts from the command line.
7+
#
8+
# This particular script generates a set of fictitious stories that we'll use
9+
# to test the document indexing and search.
10+
# The pipeline is:
11+
# 1. Loop 50 times
12+
# 2. "INPUT FROM..." issues a prompt to the LLM to generate a five-sentence story
13+
# about technology development in an imaginary country.
14+
# 3. The output is processed to create a dictionary with the story content.
15+
# 4. A second LLM prompt generates a title for the story.
16+
# 5. The results are formatted as JSONL and printed to the console.
17+
# 6. The output is redirected to a file named "stories.json".
18+
###################################################################################
19+
20+
LOOP 50 TIMES {
21+
INPUT FROM "Write a fictitious five sentence story about technology development in an imaginary country."
22+
| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
23+
| toDict[field_list="_:content"]
24+
| llmPrompt[source="ollama", model="llama3.2", system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
25+
| dumpsJsonl | print;
26+
}
Lines changed: 2 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,2 @@
1-
###################################################################################
2-
# Step 1: Create Synthetic Data
3-
#
4-
# This script generates synthetic data for document indexing.
5-
# "chatterlang_script" is a command installed with talkpipe that allows you
6-
# to run Chatterlang scripts from the command line.
7-
#
8-
# This particular script generates a set of fictitious stories that we'll use
9-
# to test the document indexing and search.
10-
# The pipeline is:
11-
# 1. Loop 50 times
12-
# 2. "INPUT FROM..." issues a prompt to the LLM to generate a five-sentence story
13-
# about technology development in an imaginary country.
14-
# 3. The output is processed to create a dictionary with the story content.
15-
# 4. A second LLM prompt generates a title for the story.
16-
# 5. The results are formatted as JSONL and printed to the console.
17-
# 6. The output is redirected to a file named "stories.json".
18-
###################################################################################
19-
20-
export TALKPIPE_CHATTERLANG_SCRIPT='
21-
LOOP 50 TIMES {
22-
INPUT FROM "Write a fictitious five sentence story about technology development in an imaginary country."
23-
| llmPrompt[source="ollama", model="llama3.2", multi_turn=False]
24-
| toDict[field_list="_:content"]
25-
| llmPrompt[source="ollama", model="llama3.2", system_prompt="Write exactly one title for this story in plain text with no markdown", field="content", set_as="title", multi_turn=False]
26-
| dumpsJsonl | print;
27-
}
28-
'
29-
#chatterlang_script --script "
30-
python -m talkpipe.app.chatterlang_script --script CHATTERLANG_SCRIPT > stories.json
1+
#!/bin/bash
2+
chatterlang_script --script Step_1_CreateSyntheticData.script > stories.json
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
###################################################################################
2+
# Step 2: Index Stories
3+
#
4+
# This script indexes the stories generated in Step 1 using the Whoosh library.
5+
# It reads the JSON file created in Step 1 and indexes the content and titles of the stories.
6+
# The indexed data can then be used for full-text search.
7+
#
8+
# As a side note, the first half of this script issues a single piece of data, the
9+
# filename. The next segment, `readJsonl`, reads the JSONL file line by line and
10+
# issues one decoded JSON object at a time. This is a good example of how the
11+
# constitution of the data being processed can change as it flows through the pipeline.
12+
#
13+
# The pipeline is:
14+
# 1. Read the JSONL file "stories.json" created in Step 1.
15+
# 2. Use the `indexWhoosh` segment to index the content and title fields.
16+
# 3. The index is stored in the specified path "./full_text_index".
17+
###################################################################################
18+
19+
INPUT FROM "stories.json"
20+
| readJsonl
21+
| progressTicks[tick_count=1, print_count=True]
22+
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
Lines changed: 2 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,2 @@
1-
###################################################################################
2-
# Step 2: Index Stories
3-
#
4-
# This script indexes the stories generated in Step 1 using the Whoosh library.
5-
# It reads the JSON file created in Step 1 and indexes the content and titles of the stories.
6-
# The indexed data can then be used for full-text search.
7-
#
8-
# As a side note, the first half of this script issues a single piece of data, the
9-
# filename. The next segment, `readJsonl`, reads the JSONL file line by line and
10-
# issues one decoded JSON object at a time. This is a good example of how the
11-
# constitution of the data being processed can change as it flows through the pipeline.
12-
#
13-
# The pipeline is:
14-
# 1. Read the JSONL file "stories.json" created in Step 1.
15-
# 2. Use the `indexWhoosh` segment to index the content and title fields.
16-
# 3. The index is stored in the specified path "./full_text_index".
17-
###################################################################################
18-
19-
export TALKPIPE_CHATTERLANG_SCRIPT='
20-
INPUT FROM "stories.json"
21-
| readJsonl
22-
| progressTicks[tick_count=1, print_count=True]
23-
| indexWhoosh[index_path="./full_text_index", field_list="content,title", overwrite=True]
24-
'
25-
26-
#chatterlang_script --script "
27-
python -m talkpipe.app.chatterlang_script --script CHATTERLANG_SCRIPT
1+
#!/bin/bash
2+
chatterlang_script --script Step_2_IndexStories.script
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
###################################################################################
2+
# Step 3: Search Stories
3+
# This script allows users to search for indexed stories using the Whoosh library.
4+
# It opens two interface. The first is an API endpoint that accepts search queries
5+
# and returns matching stories. The second is a command-line interface that allows
6+
# users to enter search terms interactively.
7+
#
8+
# It accomplishes this by using the chatterlang_serve to serve a pipeline that
9+
# reads queries in the form of JSON objects, processes them, and returns results.
10+
# The same application provides a search-like interface, configured by a yaml file,
11+
# that makes it easy for a user to create the JSON sent to the endpoint without
12+
# needing to write any code.
13+
###################################################################################
14+
15+
| searchWhoosh[index_path="full_text_index", field="query"]
16+
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
Lines changed: 2 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,2 @@
1-
###################################################################################
2-
# Step 3: Search Stories
3-
# This script allows users to search for indexed stories using the Whoosh library.
4-
# It opens two interface. The first is an API endpoint that accepts search queries
5-
# and returns matching stories. The second is a command-line interface that allows
6-
# users to enter search terms interactively.
7-
#
8-
# It accomplishes this by using the chatterlang_serve to serve a pipeline that
9-
# reads queries in the form of JSON objects, processes them, and returns results.
10-
# The same application provides a search-like interface, configured by a yaml file,
11-
# that makes it easy for a user to create the JSON sent to the endpoint without
12-
# needing to write any code.
13-
###################################################################################
14-
15-
export TALKPIPE_CHATTERLANG_SCRIPT='
16-
| searchWhoosh[index_path="full_text_index", field="query"]
17-
| formatItem[field_list="document.title:Title,document.content:Content,score:Score"]
18-
'
19-
chatterlang_serve --form-config story_search_ui.yml --title \"Story\ Search\" --display-property query --script CHATTERLANG_SCRIPT
1+
#!/bin/bash
2+
chatterlang_serve --form-config story_search_ui.yml --title "Story Search" --display-property query --script Step_3_SearchStories.script

docs/tutorials/Tutorial_2-Search_by_Example_and_RAG/README.md

Lines changed: 44 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -34,16 +34,20 @@ Vector embeddings solve this by converting text into high-dimensional mathematic
3434

3535
### The Implementation
3636

37+
The pipeline is defined in `Step_1_CreateVectorDatabase.script`:
38+
39+
```
40+
INPUT FROM "../Tutorial_1-Document_Indexing/stories.json"
41+
| readJsonl
42+
| progressTicks[tick_count=1, print_count=True]
43+
| llmEmbed[field="content", source="ollama", model="mxbai-embed-large", set_as="vector"]
44+
| addToLanceDB[path="./vector_index", table_name="stories", vector_field="vector", metadata_field_list="title,content", overwrite=True]
45+
```
46+
47+
To run this script:
48+
3749
```bash
38-
export TALKPIPE_CHATTERLANG_SCRIPT='
39-
INPUT FROM "../Tutorial_1-Document_Indexing/stories.json"
40-
| readJsonl
41-
| progressTicks[tick_count=1, print_count=True]
42-
| llmEmbed[field="content", source="ollama", model="mxbai-embed-large", set_as="vector"]
43-
| addVector[path="./vector_index", vector_field="vector", metadata_field_list="title,content", overwrite=True]
44-
'
45-
46-
python -m talkpipe.app.chatterlang_script --script CHATTERLANG_SCRIPT
50+
chatterlang_script --script Step_1_CreateVectorDatabase.script
4751
```
4852

4953
### Breaking Down the Pipeline
@@ -67,12 +71,12 @@ The `mxbai-embed-large` model is specifically designed for semantic search - it'
6771

6872
**3. Building the Index**
6973
```
70-
| addVector[path="./vector_index", vector_field="vector", metadata_field_list="title,content", overwrite=True]
74+
| addToLanceDB[path="./vector_index", table_name="stories", vector_field="vector", metadata_field_list="title,content", overwrite=True]
7175
```
7276
This creates a specialized index that:
73-
- Stores vectors for similarity search
77+
- Stores vectors for similarity search in a LanceDB table named "stories"
7478
- Preserves original metadata (title and content) for retrieval
75-
- Enables fast nearest-neighbor queries
79+
- Enables fast nearest-neighbor queries using LanceDB's efficient vector search capabilities
7680

7781
### Real-World Applications
7882

@@ -96,15 +100,19 @@ Your users don't always know the right keywords. Sometimes they have an example
96100

97101
### The Solution: Semantic Search Interface
98102

103+
The pipeline is defined in `Step_2_SearchByExample.script`:
104+
105+
```
106+
| copy
107+
| llmEmbed[field="example", source="ollama", model="mxbai-embed-large", set_as="vector"]
108+
| searchLanceDB[field="vector", path="./vector_index", table_name="stories", limit=10]
109+
| formatItem[field_list="document.title:Title, document.content:Content, score:Score"]
110+
```
111+
112+
To run this script:
113+
99114
```bash
100-
export TALKPIPE_CHATTERLANG_SCRIPT='
101-
| copy
102-
| llmEmbed[field="example", source="ollama", model="mxbai-embed-large", set_as="vector"]
103-
| searchVector[vector_field="vector", path="./vector_index"]
104-
| formatItem[field_list="document.title:Title, document.content:Content, score:Score"]
105-
'
106-
107-
python -m talkpipe.app.chatterlang_serve --form-config story_by_example_ui.yml --display-property example --script CHATTERLANG_SCRIPT
115+
chatterlang_serve --form-config story_by_example_ui.yml --display-property example --script Step_2_SearchByExample.script
108116
```
109117

110118
### Understanding the Search Pipeline
@@ -123,9 +131,9 @@ The user's example text is converted to a vector using the same model that index
123131

124132
**3. Vector Search**
125133
```
126-
| searchVector[vector_field="vector", path="./vector_index"]
134+
| searchLanceDB[field="vector", path="./vector_index", table_name="stories", limit=10]
127135
```
128-
This finds the documents whose vectors are closest to the query vector - literally the nearest neighbors in high-dimensional space.
136+
This finds the documents whose vectors are closest to the query vector - literally the nearest neighbors in high-dimensional space. LanceDB provides efficient approximate nearest neighbor search for fast retrieval.
129137

130138
**4. Result Formatting**
131139
```
@@ -167,23 +175,27 @@ Finding relevant documents is helpful, but what users often really want is an an
167175
168176
### The RAG Implementation
169177
178+
The pipeline is defined in `Step_3_SpecializedRag.script`:
179+
180+
```
181+
| copy
182+
| llmEmbed[field="example", source="ollama", model="mxbai-embed-large", set_as="vector"]
183+
| searchLanceDB[field="vector", path="./vector_index", table_name="stories", all_results_at_once=True, set_as="results"]
184+
| ragPrompt
185+
| llmPrompt[source="ollama", model="llama3.2"]
186+
```
187+
188+
To run this script:
189+
170190
```bash
171-
export TALKPIPE_CHATTERLANG_SCRIPT='
172-
| copy
173-
| llmEmbed[field="example", source="ollama", model="mxbai-embed-large", set_as="vector"]
174-
| searchVector[vector_field="vector", path="./vector_index", all_results_at_once=True, set_as="results"]
175-
| ragPrompt
176-
| llmPrompt[source="ollama", model="llama3.2"]
177-
'
178-
179-
python -m talkpipe.app.chatterlang_serve --form-config story_by_example_ui.yml --load-module step_3_extras.py --display-property example --script CHATTERLANG_SCRIPT
191+
chatterlang_serve --form-config story_by_example_ui.yml --load-module step_3_extras.py --display-property example --script Step_3_SpecializedRag.script
180192
```
181193

182194
### What's Different in the RAG Pipeline
183195

184196
**1. Batch Results Collection**
185197
```
186-
| searchVector[..., all_results_at_once=True, set_as="results"]
198+
| searchLanceDB[..., all_results_at_once=True, set_as="results"]
187199
```
188200
Instead of processing results one by one, we collect all search results together. This allows the next step to see the full context.
189201

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
###################################################################################
2+
# This script creates a vector database using the provided configuration file.
3+
# It uses the `chatterlang_script` command to run a Chatterlang script that
4+
# uses the synthetic data generated in the previous tutorial.
5+
#
6+
# The pipeline is:
7+
# 1. Read the JSONL file "stories.json" created in the previous tutorial.
8+
# 2. Use the `llmEmbed` segment to generate embeddings for the content field
9+
# using the specified model.
10+
# 3. The embeddings are stored in a vector index at the specified path.
11+
###################################################################################
12+
13+
INPUT FROM "../Tutorial_1-Document_Indexing/stories.json"
14+
| readJsonl
15+
| progressTicks[tick_count=1, print_count=True]
16+
| llmEmbed[field="content", source="ollama", model="mxbai-embed-large", set_as="vector"]
17+
| addToLanceDB[path="./vector_index", table_name="stories", vector_field="vector", metadata_field_list="title,content", overwrite=True]

0 commit comments

Comments
 (0)