Skip to content

Commit 606d918

Browse files
authored
docs for regextextextractor (#10202)
1 parent 69a4661 commit 606d918

File tree

9 files changed

+410
-5
lines changed

9 files changed

+410
-5
lines changed

docs-website/docs/pipeline-components/extractors.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ slug: "/extractors"
1010
| --- | --- |
1111
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
1212
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
13-
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents’ meta field. |
13+
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
14+
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
title: "RegexTextExtractor"
3+
id: regextextextractor
4+
slug: "/regextextextractor"
5+
description: "Extracts text from chat messages or strings using a regular expression pattern."
6+
---
7+
8+
# RegexTextExtractor
9+
10+
Extracts text from chat messages or strings using a regular expression pattern.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | After a [Chat Generator](../generators.mdx) to parse structured output from LLM responses |
17+
| **Mandatory init variables** | `regex_pattern`: The regular expression pattern used to extract text |
18+
| **Mandatory run variables** | `text_or_messages`: A string or a list of `ChatMessage` objects to search through |
19+
| **Output variables** | `captured_text`: The extracted text from the first capture group |
20+
| **API reference** | [Extractors](/reference/extractors-api) |
21+
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/regex_text_extractor.py |
22+
23+
</div>
24+
25+
## Overview
26+
27+
`RegexTextExtractor` parses text input or `ChatMessage` objects using a regular expression pattern and extracts text captured by capture groups. This is useful for extracting structured information from LLM outputs that follow specific formats, such as XML-like tags or other patterns.
28+
29+
The component works with both plain strings and lists of `ChatMessage` objects. When given a list of messages, it processes only the last message.
30+
31+
The regex pattern should include at least one capture group (text within parentheses) to specify what text to extract. If no capture group is provided, the entire match is returned instead.
32+
33+
### Handling no matches
34+
35+
By default, when the pattern doesn't match, the component returns an empty dictionary `{}`. You can change this behavior with the `return_empty_on_no_match` parameter:
36+
37+
```python
38+
from haystack.components.extractors import RegexTextExtractor
39+
40+
# Default behavior - returns empty dict when no match
41+
extractor_default = RegexTextExtractor(regex_pattern=r'<answer>(.*?)</answer>')
42+
result = extractor_default.run(text_or_messages="No answer tags here")
43+
print(result) # Output: {}
44+
45+
# Alternative behavior - returns empty string when no match
46+
extractor_explicit = RegexTextExtractor(
47+
regex_pattern=r'<answer>(.*?)</answer>',
48+
return_empty_on_no_match=False
49+
)
50+
result = extractor_explicit.run(text_or_messages="No answer tags here")
51+
print(result) # Output: {'captured_text': ''}
52+
```
53+
54+
:::note
55+
The default behavior of returning `{}` when no match is found is deprecated and will change in a future release to return `{'captured_text': ''}` instead. Set `return_empty_on_no_match=False` explicitly if you want the new behavior now.
56+
:::
57+
58+
## Usage
59+
60+
### On its own
61+
62+
This example extracts a URL from an XML-like tag structure:
63+
64+
```python
65+
from haystack.components.extractors import RegexTextExtractor
66+
67+
# Create extractor with a pattern that captures the URL value
68+
extractor = RegexTextExtractor(regex_pattern='<issue url="(.+?)">')
69+
70+
# Extract from a string
71+
result = extractor.run(text_or_messages='<issue url="github.com/example/issue/123">Issue description</issue>')
72+
print(result)
73+
# Output: {'captured_text': 'github.com/example/issue/123'}
74+
```
75+
76+
### With ChatMessages
77+
78+
When working with LLM outputs in chat pipelines, you can extract structured data from `ChatMessage` objects:
79+
80+
```python
81+
from haystack.components.extractors import RegexTextExtractor
82+
from haystack.dataclasses import ChatMessage
83+
84+
extractor = RegexTextExtractor(regex_pattern=r'```json\s*(.*?)\s*```', return_empty_on_no_match=False)
85+
86+
# Simulating an LLM response with JSON in a code block
87+
messages = [
88+
ChatMessage.from_user("Extract the data"),
89+
ChatMessage.from_assistant('Here is the data:\n```json\n{"name": "Alice", "age": 30}\n```')
90+
]
91+
92+
result = extractor.run(text_or_messages=messages)
93+
print(result)
94+
# Output: {'captured_text': '{"name": "Alice", "age": 30}'}
95+
```
96+
97+
### In a pipeline
98+
99+
This example demonstrates extracting a specific section from a structured LLM response. The pipeline asks an LLM to analyze a topic and format its response with XML-like tags for different sections. The `RegexTextExtractor` then pulls out only the summary, discarding the rest of the response.
100+
101+
The LLM generates a full response with both `<analysis>` and `<summary>` sections, but only the content inside `<summary>` tags is extracted and returned.
102+
103+
104+
```python
105+
from haystack import Pipeline
106+
from haystack.components.builders import ChatPromptBuilder
107+
from haystack.components.generators.chat import OpenAIChatGenerator
108+
from haystack.components.extractors import RegexTextExtractor
109+
from haystack.dataclasses import ChatMessage
110+
111+
pipe = Pipeline()
112+
pipe.add_component("prompt_builder", ChatPromptBuilder())
113+
pipe.add_component("llm", OpenAIChatGenerator())
114+
pipe.add_component("extractor", RegexTextExtractor(regex_pattern=r'<summary>(.*?)</summary>', return_empty_on_no_match=False))
115+
116+
pipe.connect("prompt_builder.prompt", "llm.messages")
117+
pipe.connect("llm.replies", "extractor.text_or_messages")
118+
119+
# Instruct the LLM to use a specific structured format
120+
messages = [
121+
ChatMessage.from_system(
122+
"Respond using this exact format:\n"
123+
"<analysis>Your detailed analysis here</analysis>\n"
124+
"<summary>A one-sentence summary</summary>"
125+
),
126+
ChatMessage.from_user("What are the main benefits and drawbacks of remote work?")
127+
]
128+
129+
# Run the pipeline (requires OPENAI_API_KEY environment variable)
130+
result = pipe.run({"prompt_builder": {"template": messages}})
131+
print(result["extractor"]["captured_text"])
132+
# Output: 'Remote work offers flexibility and eliminates commuting but can lead to isolation and blurred work-life boundaries.'
133+
```

docs-website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,7 @@ export default {
344344
'pipeline-components/extractors/llmdocumentcontentextractor',
345345
'pipeline-components/extractors/llmmetadataextractor',
346346
'pipeline-components/extractors/namedentityextractor',
347+
'pipeline-components/extractors/regextextextractor',
347348
],
348349
},
349350
{

docs-website/versioned_docs/version-2.20/pipeline-components/extractors.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ slug: "/extractors"
1010
| --- | --- |
1111
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
1212
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
13-
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents’ meta field. |
13+
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
14+
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
title: "RegexTextExtractor"
3+
id: regextextextractor
4+
slug: "/regextextextractor"
5+
description: "Extracts text from chat messages or strings using a regular expression pattern."
6+
---
7+
8+
# RegexTextExtractor
9+
10+
Extracts text from chat messages or strings using a regular expression pattern.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | After a [Chat Generator](../generators.mdx) to parse structured output from LLM responses |
17+
| **Mandatory init variables** | `regex_pattern`: The regular expression pattern used to extract text |
18+
| **Mandatory run variables** | `text_or_messages`: A string or a list of `ChatMessage` objects to search through |
19+
| **Output variables** | `captured_text`: The extracted text from the first capture group |
20+
| **API reference** | [Extractors](/reference/extractors-api) |
21+
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/regex_text_extractor.py |
22+
23+
</div>
24+
25+
## Overview
26+
27+
`RegexTextExtractor` parses text input or `ChatMessage` objects using a regular expression pattern and extracts text captured by capture groups. This is useful for extracting structured information from LLM outputs that follow specific formats, such as XML-like tags or other patterns.
28+
29+
The component works with both plain strings and lists of `ChatMessage` objects. When given a list of messages, it processes only the last message.
30+
31+
The regex pattern should include at least one capture group (text within parentheses) to specify what text to extract. If no capture group is provided, the entire match is returned instead.
32+
33+
### Handling no matches
34+
35+
By default, when the pattern doesn't match, the component returns an empty dictionary `{}`. You can change this behavior with the `return_empty_on_no_match` parameter:
36+
37+
```python
38+
from haystack.components.extractors import RegexTextExtractor
39+
40+
# Default behavior - returns empty dict when no match
41+
extractor_default = RegexTextExtractor(regex_pattern=r'<answer>(.*?)</answer>')
42+
result = extractor_default.run(text_or_messages="No answer tags here")
43+
print(result) # Output: {}
44+
45+
# Alternative behavior - returns empty string when no match
46+
extractor_explicit = RegexTextExtractor(
47+
regex_pattern=r'<answer>(.*?)</answer>',
48+
return_empty_on_no_match=False
49+
)
50+
result = extractor_explicit.run(text_or_messages="No answer tags here")
51+
print(result) # Output: {'captured_text': ''}
52+
```
53+
54+
:::note
55+
The default behavior of returning `{}` when no match is found is deprecated and will change in a future release to return `{'captured_text': ''}` instead. Set `return_empty_on_no_match=False` explicitly if you want the new behavior now.
56+
:::
57+
58+
## Usage
59+
60+
### On its own
61+
62+
This example extracts a URL from an XML-like tag structure:
63+
64+
```python
65+
from haystack.components.extractors import RegexTextExtractor
66+
67+
# Create extractor with a pattern that captures the URL value
68+
extractor = RegexTextExtractor(regex_pattern='<issue url="(.+?)">')
69+
70+
# Extract from a string
71+
result = extractor.run(text_or_messages='<issue url="github.com/example/issue/123">Issue description</issue>')
72+
print(result)
73+
# Output: {'captured_text': 'github.com/example/issue/123'}
74+
```
75+
76+
### With ChatMessages
77+
78+
When working with LLM outputs in chat pipelines, you can extract structured data from `ChatMessage` objects:
79+
80+
```python
81+
from haystack.components.extractors import RegexTextExtractor
82+
from haystack.dataclasses import ChatMessage
83+
84+
extractor = RegexTextExtractor(regex_pattern=r'```json\s*(.*?)\s*```', return_empty_on_no_match=False)
85+
86+
# Simulating an LLM response with JSON in a code block
87+
messages = [
88+
ChatMessage.from_user("Extract the data"),
89+
ChatMessage.from_assistant('Here is the data:\n```json\n{"name": "Alice", "age": 30}\n```')
90+
]
91+
92+
result = extractor.run(text_or_messages=messages)
93+
print(result)
94+
# Output: {'captured_text': '{"name": "Alice", "age": 30}'}
95+
```
96+
97+
### In a pipeline
98+
99+
This example demonstrates extracting a specific section from a structured LLM response. The pipeline asks an LLM to analyze a topic and format its response with XML-like tags for different sections. The `RegexTextExtractor` then pulls out only the summary, discarding the rest of the response.
100+
101+
The LLM generates a full response with both `<analysis>` and `<summary>` sections, but only the content inside `<summary>` tags is extracted and returned.
102+
103+
104+
```python
105+
from haystack import Pipeline
106+
from haystack.components.builders import ChatPromptBuilder
107+
from haystack.components.generators.chat import OpenAIChatGenerator
108+
from haystack.components.extractors import RegexTextExtractor
109+
from haystack.dataclasses import ChatMessage
110+
111+
pipe = Pipeline()
112+
pipe.add_component("prompt_builder", ChatPromptBuilder())
113+
pipe.add_component("llm", OpenAIChatGenerator())
114+
pipe.add_component("extractor", RegexTextExtractor(regex_pattern=r'<summary>(.*?)</summary>', return_empty_on_no_match=False))
115+
116+
pipe.connect("prompt_builder.prompt", "llm.messages")
117+
pipe.connect("llm.replies", "extractor.text_or_messages")
118+
119+
# Instruct the LLM to use a specific structured format
120+
messages = [
121+
ChatMessage.from_system(
122+
"Respond using this exact format:\n"
123+
"<analysis>Your detailed analysis here</analysis>\n"
124+
"<summary>A one-sentence summary</summary>"
125+
),
126+
ChatMessage.from_user("What are the main benefits and drawbacks of remote work?")
127+
]
128+
129+
# Run the pipeline (requires OPENAI_API_KEY environment variable)
130+
result = pipe.run({"prompt_builder": {"template": messages}})
131+
print(result["extractor"]["captured_text"])
132+
# Output: 'Remote work offers flexibility and eliminates commuting but can lead to isolation and blurred work-life boundaries.'
133+
```

docs-website/versioned_docs/version-2.21/pipeline-components/extractors.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ slug: "/extractors"
1010
| --- | --- |
1111
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
1212
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
13-
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents’ meta field. |
13+
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
14+
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |

0 commit comments

Comments
 (0)