Skip to content

Commit eb816b5

Browse files
Merge branch 'main' into nb-failures-debug-1
2 parents 1dd3229 + 4855635 commit eb816b5

File tree

10 files changed

+375
-165
lines changed

10 files changed

+375
-165
lines changed

docs/concepts/async_streaming.ipynb

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Async Stream-validate LLM responses\n",
7+
"# Async stream-validate LLM responses\n",
88
"\n",
9-
"Asynchronous behavior is generally useful in LLM applciations. It allows multiple, long-running LLM requests to execute at once. Adding streaming to this situation allows us to make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n",
9+
"Asynchronous behavior is generally useful in LLM applications. It allows multiple, long-running LLM requests to execute at once. \n",
10+
"\n",
11+
"With streaming, you can make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n",
1012
"\n",
1113
"**Note**: learn more about streaming [here](./streaming).\n"
1214
]

docs/concepts/deploying.md

Lines changed: 54 additions & 21 deletions
Large diffs are not rendered by default.

docs/concepts/performance.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Performance
2+
3+
Performance for Gen AI apps can mean two things:
4+
5+
* Application performance: The total time taken to return a response to a user request
6+
* Accuracy: How often a given LLM returns an accurate answer
7+
8+
This document addresses application performance and strategies to minimize latency in responses. For tracking accuracy, see our [Telemetry](/docs/concepts/telemetry) page.
9+
10+
## Basic application performance
11+
12+
Guardrails consist of a guard and a series of validators that the guard uses to validate LLM responses. Generally, a guard runs in sub-10ms performance. Validators should only add around 100ms of additional latency when configured correctly.
13+
14+
The largest latency and performance issues will come from your selection of LLM. It's important to capture metrics around LLM usage and assess how different LLMs handle different workloads in terms of both performance and result accuracy. [Guardrails AI's LiteLLM support](https://www.guardrailsai.com/blog/guardrails-litellm-validate-llm-output) makes it easy to switch out LLMs with minor changes to your guard calls.
15+
16+
## Performance tips
17+
18+
Here are a few tips to get the best performance out of your Guardrails-enabled applications.
19+
20+
**Use async guards for the best performance**. Use the `AsyncGuard` class to make concurrent calls to multiple LLMs and process the response chunks as they arrive. For more information, see [Async stream-validate LLM responses](/docs/async-streaming).
21+
22+
**Use a remote server for heavy workloads**. More compute-intensive workloads, such as remote inference endpoints, work best when run with dedicated memory and CPU. For example, guards that use a single Machine Learning (ML) model for validation can run in milliseconds on GPU-equipped machines, while they may take tens of seconds on normal CPUs. However, guardrailing orchestration itself performs better on general compute.
23+
24+
To account for this, offload performance-critical validation work by:
25+
26+
* Using [Guardrails Server](/docs/concepts/deploying) to run certain guard executions on a dedicated server
27+
* Leverage [remote validation inference](/docs/concepts/remote_validation_inference) to configure validators to call a REST API for inference results instead of running them locally
28+
29+
The Guardrails client/server model is hosted via Flask. For best performance, [follow our guidelines on configuring your WSGI servers properly](/docs/concepts/deploying) for production.
30+
31+
**Use purpose-built LLMs for re-validators**. When a guard fails, you can decide how to handle it by setting the appropriate OnFail action. The `OnFailAction.REASK` and `OnFailAction.FIX_REASK` action will ask the LLM to correct its output, with `OnFailAction.FIX_REASK` running re-validation on the revised output. In general, re-validation works best when using a small, purpose-built LLM fine-tuned to your use case.
32+
33+
## Measure performance using telemetry
34+
35+
Guardrails supports OpenTelemetry (OTEL) and a number of OTEL-compatible telemetry providers. You can use telemetry to measure the performance and accuracy of Guardrails AI-enabled applications, as well as the performance of your LLM calls.
36+
37+
For more, read our [Telemetry](/docs/concepts/telemetry) documentation.

docs/concepts/remote_validation_inference.ipynb

Lines changed: 23 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,22 @@
66
"source": [
77
"# Remote Validation Inference\n",
88
"\n",
9-
"## The Need\n",
9+
"## The problem\n",
1010
"\n",
11-
"As a concept, guardrailing has a few areas which, when unoptimized, can be extremely latency and resource expensive to run. The main two areas are in guardrailing orchestration and in the ML models used for validating a single guard. These two are resource heavy in slightly different ways. ML models can run with really low latency on GPU-equipped machines, while guardrailing orchestration benefits from general memory and compute resources. Some ML models used for validation run in tens of seconds on CPUs, while they run in milliseconds on GPUs.\n",
11+
"As a concept, [guardrailing](https://www.guardrailsai.com/docs/concepts/guard) has a few areas that, when unoptimized, can introduce latency and be extremely resource-expensive. The main two areas are: \n",
12+
"\n",
13+
"* Guardrailing orchestration; and\n",
14+
"* ML models that validate a single guard\n",
15+
"\n",
16+
"These are resource-heavy in slightly different ways. ML models can run with low latency on GPU-equipped machines. (Some ML models used for validation run in tens of seconds on CPUs, while they run in milliseconds on GPUs.) Meanwhile, guardrailing orchestration benefits from general memory and compute resources. \n",
1217
"\n",
1318
"## The Guardrails approach\n",
1419
"\n",
15-
"The Guardrails library tackles this problem by providing an interface that allows users to separate the execution of orchestraion from the exeuction of ML-based validation.\n",
20+
"The Guardrails library tackles this problem by providing an interface that allows users to separate the execution of orchestration from the execution of ML-based validation.\n",
1621
"\n",
17-
"The layout of this solution is a simple upgrade to validator libraries themselves. Instead of *always* downloading and installing ML models, they can be configured to reach out to a remote endpoint. This remote endpoint hosts the ML model behind an API that has a uninfied interface for all validator models. Guardrails hosts some of these as a preview feature for free, and users can host their own models as well by following the same interface.\n",
22+
"The layout of this solution is a simple upgrade to validator libraries themselves. Instead of *always* downloading and installing ML models, you can configure them to call a remote endpoint. This remote endpoint hosts the ML model behind an API that presents a unified interface for all validator models. \n",
23+
"\n",
24+
"Guardrails hosts some of these for free as a preview feature. Users can host their own models by following the same interface.\n",
1825
"\n",
1926
"\n",
2027
":::note\n",
@@ -26,15 +33,15 @@
2633
"cell_type": "markdown",
2734
"metadata": {},
2835
"source": [
29-
"## Using Guardrails Inferencing Endpoints\n",
36+
"## Using Guardrails inferencing endpoints\n",
3037
"\n",
31-
"To use an guardrails endpoint, you simply need to find a validator that has implemented support. Validators with a Guardrails hosted endpoint are labeled as such on the [Validator Hub](https://hub.guardrailsai.com). One example is ToxicLanguage.\n",
38+
"To use a guardrails endpoint, find a validator that has implemented support. Validators with a Guardrails-hosted endpoint are labeled as such on the [Validator Hub](https://hub.guardrailsai.com). One example is [Toxic Language](https://hub.guardrailsai.com/validator/guardrails/toxic_language).\n",
3239
"\n",
3340
"\n",
3441
":::note\n",
35-
"To use remote inferencing endpoints, you need to have a Guardrails API key. You can get one by signing up at [the Guardrails Hub](https://hub.guardrailsai.com).\n",
42+
"To use remote inferencing endpoints, you need a Guardrails API key. You can get one by signing up at [the Guardrails Hub](https://hub.guardrailsai.com). \n",
3643
"\n",
37-
"Then, run `guardrails configure`\n",
44+
"Then, run `guardrails configure`.\n",
3845
":::"
3946
]
4047
},
@@ -79,7 +86,7 @@
7986
"cell_type": "markdown",
8087
"metadata": {},
8188
"source": [
82-
"The major benefit of hosting a validator inference endpoint is the increase in speed and throughput compared to running locally. This implementation makes use cases such as streaming much more viable!\n"
89+
"The benefit of hosting a validator inference endpoint is the increase in speed and throughput compared to running locally. This implementation makes use cases such as [streaming](https://www.guardrailsai.com/docs/concepts/streaming) much more viable in production.\n"
8390
]
8491
},
8592
{
@@ -114,11 +121,9 @@
114121
"cell_type": "markdown",
115122
"metadata": {},
116123
"source": [
117-
"## Toggling Remote Inferencing\n",
118-
"\n",
119-
"To enable/disable remote inferencing, you can run the cli command `guardrails configure` or modify your `~/.guardrailsrc`.\n",
124+
"## Toggling remote inferencing\n",
120125
"\n",
121-
"\n"
126+
"To enable/disable remote inferencing, you can run the CLI command `guardrails configure` or modify your `~/.guardrailsrc`."
122127
]
123128
},
124129
{
@@ -142,10 +147,10 @@
142147
"cell_type": "markdown",
143148
"metadata": {},
144149
"source": [
145-
"To disable remote inferencing from a specific validator, you can add a `use_local` kwarg to the validator's initializer\n",
150+
"To disable remote inferencing from a specific validator, add a `use_local` kwarg to the validator's initializer. \n",
146151
"\n",
147152
":::note\n",
148-
"When runnning locally, you may need to reinstall the validator with the --install-local-models flag.\n",
153+
"When running locally, you may need to reinstall the validator with the `--install-local-models` flag.\n",
149154
":::"
150155
]
151156
},
@@ -172,9 +177,9 @@
172177
"source": [
173178
"## Hosting your own endpoint\n",
174179
"\n",
175-
"Validators are able to point to any endpoint that implements the interface that Guardrails validators expect. This interface can be found in the `_inference_remote` method of the validator.\n",
180+
"Validators can point to any endpoint that implements the interface that Guardrails validators expect. This interface can be found in the `_inference_remote` method of the validator.\n",
176181
"\n",
177-
"After implementing this interface, you can host your own endpoint (for example, using gunicorn and Flask) and point your validator to it by setting the `validation_endpoint` constructor argument.\n"
182+
"After implementing this interface, you can host your own endpoint (for example, [using gunicorn and Flask](https://flask.palletsprojects.com/en/stable/deploying/gunicorn/)) and point your validator to it by setting the `validation_endpoint` constructor argument.\n"
178183
]
179184
},
180185
{
@@ -225,7 +230,7 @@
225230
"name": "python",
226231
"nbconvert_exporter": "python",
227232
"pygments_lexer": "ipython3",
228-
"version": "3.11.9"
233+
"version": "3.12.7"
229234
}
230235
},
231236
"nbformat": 4,

docs/how_to_guides/continuous_integration_continuous_deployment.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -639,7 +639,7 @@ resource "aws_lb_listener" "app_lb_listener" {
639639
resource "aws_lb_target_group" "app_lb" {
640640
name = "${local.deployment_name}-nlb-tg"
641641
protocol = "TCP"
642-
port = 80
642+
port = var.backend_server_port
643643
vpc_id = aws_vpc.backend.id
644644
target_type = "ip"
645645
@@ -650,6 +650,7 @@ resource "aws_lb_target_group" "app_lb" {
650650
timeout = "3"
651651
unhealthy_threshold = "3"
652652
path = "/"
653+
port = var.backend_server_port
653654
}
654655
655656
lifecycle {

docs/integrations/llama_index.ipynb

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
},
3939
{
4040
"cell_type": "code",
41-
"execution_count": 13,
41+
"execution_count": 3,
4242
"metadata": {},
4343
"outputs": [
4444
{
@@ -50,15 +50,14 @@
5050
"\n",
5151
"\n",
5252
"Installing hub:\u001b[35m/\u001b[0m\u001b[35m/guardrails/\u001b[0m\u001b[95mcompetitor_check...\u001b[0m\n",
53-
"✅Successfully installed guardrails/competitor_check version \u001b[1;36m0.0\u001b[0m.\u001b[1;36m1\u001b[0m!\n",
53+
"✅Successfully installed guardrails/competitor_check!\n",
5454
"\n",
5555
"\n"
5656
]
5757
}
5858
],
5959
"source": [
60-
"! guardrails hub install hub://guardrails/detect_pii --no-install-local-models -q\n",
61-
"! guardrails hub install hub://guardrails/competitor_check --no-install-local-models -q"
60+
"! guardrails hub install hub://guardrails/detect_pii hub://guardrails/competitor_check --no-install-local-models -q"
6261
]
6362
},
6463
{
@@ -70,7 +69,7 @@
7069
},
7170
{
7271
"cell_type": "code",
73-
"execution_count": 6,
72+
"execution_count": 4,
7473
"metadata": {},
7574
"outputs": [
7675
{
@@ -79,7 +78,7 @@
7978
"text": [
8079
" % Total % Received % Xferd Average Speed Time Time Time Current\n",
8180
" Dload Upload Total Spent Left Speed\n",
82-
"100 75042 100 75042 0 0 959k 0 --:--:-- --:--:-- --:--:-- 964k\n"
81+
"100 75042 100 75042 0 0 353k 0 --:--:-- --:--:-- --:--:-- 354k\n"
8382
]
8483
}
8584
],
@@ -99,7 +98,7 @@
9998
},
10099
{
101100
"cell_type": "code",
102-
"execution_count": 7,
101+
"execution_count": 1,
103102
"metadata": {},
104103
"outputs": [],
105104
"source": [
@@ -136,7 +135,7 @@
136135
},
137136
{
138137
"cell_type": "code",
139-
"execution_count": 8,
138+
"execution_count": null,
140139
"metadata": {},
141140
"outputs": [],
142141
"source": [
@@ -148,7 +147,12 @@
148147
" competitors=[\"Fortran\", \"Ada\", \"Pascal\"],\n",
149148
" on_fail=\"fix\"\n",
150149
" )\n",
151-
").use(DetectPII(pii_entities=\"pii\", on_fail=\"fix\"))"
150+
").use(\n",
151+
" DetectPII(\n",
152+
" pii_entities=[\"PERSON\", \"EMAIL_ADDRESS\"], \n",
153+
" on_fail=\"fix\"\n",
154+
" )\n",
155+
")"
152156
]
153157
},
154158
{
@@ -162,21 +166,21 @@
162166
},
163167
{
164168
"cell_type": "code",
165-
"execution_count": 9,
169+
"execution_count": 3,
166170
"metadata": {},
167171
"outputs": [
168172
{
169173
"name": "stdout",
170174
"output_type": "stream",
171175
"text": [
172-
"The author worked on writing short stories and programming, starting with early attempts on an IBM 1401 using Fortran in 9th grade, and later transitioning to microcomputers like the TRS-80 and Apple II to write games, rocket prediction programs, and a word processor.\n"
176+
"The author is Paul Graham. Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of Fortran. Later, he transitioned to microcomputers like the TRS-80 and began programming more extensively, creating simple games and a word processor.\n"
173177
]
174178
}
175179
],
176180
"source": [
177181
"# Use index on it's own\n",
178182
"query_engine = index.as_query_engine()\n",
179-
"response = query_engine.query(\"What did the author do growing up?\")\n",
183+
"response = query_engine.query(\"Who is the author and what did they do growing up?\")\n",
180184
"print(response)"
181185
]
182186
},
@@ -189,14 +193,14 @@
189193
},
190194
{
191195
"cell_type": "code",
192-
"execution_count": 11,
196+
"execution_count": null,
193197
"metadata": {},
194198
"outputs": [
195199
{
196200
"name": "stdout",
197201
"output_type": "stream",
198202
"text": [
199-
"The author worked on writing short stories and programming, starting with early attempts on an IBM 1401 using [COMPETITOR] in 9th <URL>er, the author transitioned to microcomputers, building a Heathkit kit and eventually getting a TRS-80 to write simple games and <URL>spite enjoying programming, the author initially planned to study philosophy in college but eventually switched to AI due to a lack of interest in philosophy courses.\n"
203+
"The author is <PERSON>. Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of [COMPETITOR]. Later, he transitioned to microcomputers like the TRS-80 and Apple II, where he wrote simple games, programs, and a word processor. \n"
200204
]
201205
}
202206
],
@@ -206,7 +210,7 @@
206210
"\n",
207211
"guardrails_query_engine = GuardrailsQueryEngine(engine=query_engine, guard=guard)\n",
208212
"\n",
209-
"response = guardrails_query_engine.query(\"What did the author do growing up?\")\n",
213+
"response = guardrails_query_engine.query(\"Who is the author and what did they do growing up?\")\n",
210214
"print(response)\n",
211215
" "
212216
]
@@ -220,14 +224,14 @@
220224
},
221225
{
222226
"cell_type": "code",
223-
"execution_count": 12,
227+
"execution_count": null,
224228
"metadata": {},
225229
"outputs": [
226230
{
227231
"name": "stdout",
228232
"output_type": "stream",
229233
"text": [
230-
"The author worked on writing short stories and programming while growing <URL>ey started with early attempts on an IBM 1401 using [COMPETITOR] in 9th <URL>er, they transitioned to microcomputers, building simple games and a word processor on a TRS-80 in <DATE_TIME>.\n"
234+
"The author is <PERSON>. Growing up, he worked on writing short stories and programming. He started with early attempts on an IBM 1401 using [COMPETITOR] in 9th grade. Later, he transitioned to microcomputers, building a Heathkit kit and eventually getting a TRS-80 to write simple games and programs. Despite enjoying programming, he initially planned to study philosophy in college but eventually switched to AI due to a lack of interest in philosophy courses. \n"
231235
]
232236
}
233237
],
@@ -237,7 +241,7 @@
237241
"chat_engine = index.as_chat_engine()\n",
238242
"guardrails_chat_engine = GuardrailsChatEngine(engine=chat_engine, guard=guard)\n",
239243
"\n",
240-
"response = guardrails_chat_engine.chat(\"Tell me what the author did growing up.\")\n",
244+
"response = guardrails_chat_engine.chat(\"Tell me who the author is and what they did growing up.\")\n",
241245
"print(response)"
242246
]
243247
}

0 commit comments

Comments
 (0)