diff --git a/docs/concepts/async_streaming.ipynb b/docs/concepts/async_streaming.ipynb index 1ff42e65a..d6ad5b0ce 100644 --- a/docs/concepts/async_streaming.ipynb +++ b/docs/concepts/async_streaming.ipynb @@ -4,9 +4,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Async Stream-validate LLM responses\n", + "# Async stream-validate LLM responses\n", "\n", - "Asynchronous behavior is generally useful in LLM applciations. It allows multiple, long-running LLM requests to execute at once. Adding streaming to this situation allows us to make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n", + "Asynchronous behavior is generally useful in LLM applications. It allows multiple, long-running LLM requests to execute at once. \n", + "\n", + "With streaming, you can make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.\n", "\n", "**Note**: learn more about streaming [here](./streaming).\n" ] diff --git a/docs/concepts/deploying.md b/docs/concepts/deploying.md index f7f871fde..0b6f64c92 100644 --- a/docs/concepts/deploying.md +++ b/docs/concepts/deploying.md @@ -1,22 +1,28 @@ # Deploying Guardrails -This document is a guide on our preffered method to deploy Guardrails to production. We will discuss the new client/server model and the benefits this approach gains us. We'll also look at some patterns we find useful when deploying to a production environment as well as some practices to keep in mind when developing with this new pattern. +This document is a guide on our preferred method to deploy Guardrails to production. We discuss the Guardrails client/server model and the benefits it provides. We also look at some patterns we find useful when deploying to a production environment, as well as some practices to keep in mind when developing with this new pattern. :::note Read the quick start guide on using Guardrails on the server [here](https://www.guardrailsai.com/docs/getting_started/guardrails_server) ::: -## The Client/Server Model +## The Client/server model ### Guardrails As A Service -As part of the v0.5.0 release, we introduced the `guardrails-api`. The Guardrails API offers a way to offload the tasks of initializing and executing Guards to a dedicated server. As of the time of writing this document, this is a simple Flask application that uses the core Guardrails validation engine to interact with Guards over HTTP(S) via a RESTful pattern. There are two main ways you can specify Guards for use on the server: -1. By writing a simple python config file for most use cases + +As part of the v0.5.0 release, we introduced the `guardrails-api`. The Guardrails API offers a way to offload the tasks of initializing and executing Guards to a dedicated server. Currently, this is a simple Flask application that uses the core Guardrails validation engine to interact with Guards over HTTP(S) via a RESTful pattern. + +There are two main ways you can specify Guards for use on the server: + +1. By writing a simple Python config file for most use cases 2. By adding a PostgreSQL database for advanced use cases -We will focus on the first use case in this document. +Here, we'll focus on the first use case. + +### Using the Guardrails API as a dev server + +0. *Optional:* We generally recommend utilizing virtual environments. If you don't already have a preference, you can use Python's built [venv](https://docs.python.org/3/library/venv.html) module for this. -### A Quick Demonstration of Using the Guardrails API as a Dev Server -0. *Optional:* We generally recommend utilizing virtual environments. If you don't already have a preference, you can use python's built [venv](https://docs.python.org/3/library/venv.html) module for this. 1. Install Guardrails with the `api` extra: ```sh pip install "guardrails-ai[api]" @@ -58,39 +64,66 @@ We will focus on the first use case in this document. validation_outcome = name_guard.validate("John Doe") ``` -### Why a Client/Server Model? -Moving the computational load of validation off of the client application and onto a dedicated server has many benefits. Beyond enabling potential future enhancements such as proxy implementations and supporting other programming languages with client SDKs, it also solves many of the problems we've encountered ourselves when considering how to best put Guardrails in production. +### Why a client/server model? + +Moving the computational load of validation off of the client application and onto a dedicated server has many benefits. Beyond enabling potential future enhancements such as proxy implementations and supporting other programming languages with client SDKs, it also solves many of the problems we've encountered ourselves when considering how to best put Guardrails in production. + +These benefits include: + +**Slimmer deployments**. Previously, if you were using Guardrails in your application you were also including all of the validators that were part of the core library as well as the models of any of those validators you utilized. This added significant overhead to the storage required to run Guardrails on a production server. + +Beyond this, you also had to account for the resources required to run these models efficiently as part of your production server. As an extreme example, when setting up a server to facilitate the Guardrails Hub playground which included every public validator available for use, the resulting image was over 6GB which is clearly not sustainable. + +In version 0.5.0 as part of the client/server model, we removed all validators from the main repo in favor of selecting only the ones you need from the Guardrails Hub. With this new approach that removes the baggage of maintaining validators within the core Guardrails package, a Docker image of a client application built on python:3.12-slim with a `venv` environment comes in at ~350MB uncompressed. This will continue to decrease as we introduce optimized install profiles for various use cases. -Previously, if you were using Guardrails in your application you were also including all of the validators that were part of the core library as well as the models of any of those validators you utilized. This added a lot of overhead to the storage required to run Guardrails on a production server. Beyond this, you also had to account for the resources required to run these models effieciently as part of your production server. As an extreme example, when setting up a server to facilitate the Guardrails Hub playground which included every public validator available for use, the resulting image was over 6GB which is clearly not sustainable. +Another feature that reduces deployment size is remote validation for select validators. You can utilize certain validators _without_ downloading or running their underlying models on your hardware. Instead, after a one time configuration, they can offload the heavy lifting to dedicated, remotely hosted models without the need to change the way you interact with the Guardrails package. You can read more about this [here](/concepts/remote_validation_inference). -In version 0.5.0 as part of the client/server model, all validators were removed from the main repo in favor of selecting only the ones you need from the Guardrails Hub. With this new approach that removes the baggage of maintaining validators within the core Guardrails package, a Docker image of a client application built on python:3.12-slim with a `venv` environment comes in at ~350MB uncompressed. This will continue to decrease as we introduce optimized install profiles for various use cases. +**Scaling**. So far, we've talked a lot about reducing the resources necessary to run Guardrails in production. Another important factor we considered when shifting to the new client/server paradigm is how this pattern enables better scaling practices. -One last notable improvement we're making to help reduce the impact validator models have on your deployables is introducting remote validation for select validators. With this feature, you can utilize certain validators _without_ downloading or running their underlying models on your hardware. Instead, after a one time configuration, they can offload the heavy lifting to dedicated, remotely hosted models without the need to change the way you interact with the Guardrails package. You can read more about this [here](/concepts/remote_validation_inference). +Since the Guardrails API is now separate and distinct from your application code, you can scale both separately and according to their own needs. This means that if your client application needs to scale it can do so without accounting for the additional resources required for validation. -So far, we've talked a lot about reducing the resources necessary to run Guardrails in production. Another important factor we considered when shifting to the new client/server paradigm is how this pattern enables better scaling practices. Since the Guardrails API is now separate and distinct from your application code, you can scale both separately and according to their own needs. This means that if your client application needs to scale it can do so without accounting for the additional resources required for validation. Likewise, if the validation traffic is the limiting factor, you client application can stay small on fewer instances while the Guardrails API scales out to meet the demand. You're also not limited to only one Guardrails API deployable giving you the option to scale more heavily utilized use-cases independently of those less frequently used. +Likewise, if the validation traffic is the limiting factor, your client application can stay small on fewer instances while the Guardrails API scales out to meet the demand. You're also not limited to only one Guardrails API deployable giving you the option to scale more heavily utilized use cases independently of those less frequently used. -## Considerations for Productionization -As previously mentioned, the Guardrails API is currently a simple Flask application. This means you'll want a WSGI server to serve the application for a production environment. There are many options out there and we do not particularly endorse one of another. For demonstration purposes we will show using Gunicorn since it is a common choice in the industry. +## Considerations for productionization + +### Configuring your WSGI server properly + +As previously mentioned, the Guardrails API is currently a simple Flask application. This means you'll want a WSGI server to serve the application for a production environment. + +There are many options. We don't endorse one over another. For demonstration purposes, we'll use Gunicorn since it's a common choice in the industry. + +Previously, we showed how to start the Guardrails API as a dev server using the `guardrails start` command. When launching the Guardrails API with a WSGI server, you reference the underlying `guardrails_api` module instead. + +For example, when we Dockerize the Guardrails API for internal use, our final line is: -Previously we showed how to start the Guardrails API as a dev server using the `guardrails start` command. When launching the Guardrails API with a WSGI server, you will reference the underlying `guardrails_api` module instead. For example, when we Dockerize the Guardrails API for internal use, our final line is: ```Dockerfile CMD gunicorn --bind 0.0.0.0:8000 --timeout=90 --workers=4 'guardrails_api.app:create_app(None, "config.py")' ``` -This line starts the Guardrails API Flask application with a gunicorn WSGI server. It specifies what port to bind the server to, as well as the timeout for workers and the maximum number of worker threads for handling requests. We typically use the `gthread` worker class with gunicorn because of compatibility issues between how some async workers try to monkeypatch dependencies and how some libraries specify optional imports. +This line starts the Guardrails API Flask application with a gunicorn WSGI server. It specifies what port to bind the server to, as well as the timeout for workers and the maximum number of worker threads for handling requests. -The [Official Gunicorn Documentation](https://docs.gunicorn.org/en/latest/design.html#how-many-workers) recommends setting the number of threads/workers to (2 x num_cores) + 1, though this may prove to be too resource intensive, depending on the choice of models in validators. Specifying `--threads=` instead of `--workers=` will cause gunicorn to use multithreading instead of multiprocessing. Threads will be lighter weight, as they can share the models loaded at startup from `config.py`, but [risk hitting race conditions](https://github.com/guardrails-ai/guardrails/discussions/899) when manipulating history. For cases that have several larger models, need longer to process requests, have square-wave-like traffic, or have sustained high traffic, `--threads` may prove to be a desirable tradeoff. +We typically use the `gthread` worker class with gunicorn. This is because of compatibility issues between how some async workers try to monkeypatch dependencies and how some libraries specify optional imports. + +The [Official Gunicorn Documentation](https://docs.gunicorn.org/en/latest/design.html#how-many-workers) recommends setting the number of threads/workers to (2 x num_cores) + 1, though this may prove to be too resource intensive, depending on the choice of models in validators. Specifying `--threads=` instead of `--workers=` will cause gunicorn to use multithreading instead of multiprocessing. + +Threads will be lighter weight, as they can share the models loaded at startup from `config.py`, but [risk hitting race conditions](https://github.com/guardrails-ai/guardrails/discussions/899) when manipulating history. For cases that have several larger models, need longer to process requests, have square-wave-like traffic, or have sustained high traffic, `--threads` may prove to be a desirable tradeoff. For further reference, you can find a bare-bones example of Dockerizing the Guardrails API here: https://github.com/guardrails-ai/guardrails-lite-server +### Types of validators + +When selecting a deployment environment, consider what types of validators you plan to use. If most of the validators you require are static or LLM based, the Guardrails API can perform well in a serverless environment. + +However, if you use multiple ML-based validators, these models will have a large memory footprint. You'll also be forced to load them on init. These are good reasons to choose a more persistent hosting option. + +When utilizing a containerized hosting option that allows for auto-scaling, we find that under load the tasks are generally more CPU bound than memory bound. In other words, they benefit more from scaling on CPU utilization or request queue depth. -When selecting a deployment environment it is important to consider what types of validators you plan to use. If most of the validators you require are static or LLM based, the Guardrails API can perform well in a serverless environment. However if you make use of multiple ML based validators, the sheer memory footprint the underlying models bring and the need to load the models on init are good reasons to choose a more persistent hosting option. When utilizing a containerized hosting option that allows for auto-scaling, we find that under load the tasks are generally more CPU bound than memory bound and therefore benefit more from scaling on CPU utilization or request queue depth. +## Patterns and practices for using the client/server model -## Patterns and Practices for Using the Client/Server Model When considering what to put where when splitting your Guardrails implementation between your client application and the Guardrails API, it mostly comes down to shifting the heavy lifting to the server and keeping your implementation on the client side to a minimum. -For example, you should define your Guards in the `config.py` that is loaded onto the server, not in your client application. Additionally validators from the Guardrails HUB should also be installed on the server since that is where they will be executed; no need to install these in the client application. This also means that _generally_ any extras you need alongside Guardrails would also be installed server side; that is, you would only want to install `guardails-ai` in your application whereas you would install `guardrails-ai[api]` on the server. This keeps additional dependencies where they belong. +For example, you should define your Guards in the `config.py` that is loaded onto the server, not in your client application. Additionally validators from the Guardrails HUB should also be installed on the server since that is where they will be executed; no need to install these in the client application. This also means that _generally_ any extras you need alongside Guardrails would also be installed server side; that is, you would only want to install `guardrails-ai` in your application whereas you would install `guardrails-ai[api]` on the server. This keeps additional dependencies where they belong. ## Next Steps diff --git a/docs/concepts/performance.md b/docs/concepts/performance.md new file mode 100644 index 000000000..cafd5f66f --- /dev/null +++ b/docs/concepts/performance.md @@ -0,0 +1,37 @@ +# Performance + +Performance for Gen AI apps can mean two things: + +* Application performance: The total time taken to return a response to a user request +* Accuracy: How often a given LLM returns an accurate answer + +This document addresses application performance and strategies to minimize latency in responses. For tracking accuracy, see our [Telemetry](/docs/concepts/telemetry) page. + +## Basic application performance + +Guardrails consist of a guard and a series of validators that the guard uses to validate LLM responses. Generally, a guard runs in sub-10ms performance. Validators should only add around 100ms of additional latency when configured correctly. + +The largest latency and performance issues will come from your selection of LLM. It's important to capture metrics around LLM usage and assess how different LLMs handle different workloads in terms of both performance and result accuracy. [Guardrails AI's LiteLLM support](https://www.guardrailsai.com/blog/guardrails-litellm-validate-llm-output) makes it easy to switch out LLMs with minor changes to your guard calls. + +## Performance tips + +Here are a few tips to get the best performance out of your Guardrails-enabled applications. + +**Use async guards for the best performance**. Use the `AsyncGuard` class to make concurrent calls to multiple LLMs and process the response chunks as they arrive. For more information, see [Async stream-validate LLM responses](/docs/async-streaming). + +**Use a remote server for heavy workloads**. More compute-intensive workloads, such as remote inference endpoints, work best when run with dedicated memory and CPU. For example, guards that use a single Machine Learning (ML) model for validation can run in milliseconds on GPU-equipped machines, while they may take tens of seconds on normal CPUs. However, guardrailing orchestration itself performs better on general compute. + +To account for this, offload performance-critical validation work by: + +* Using [Guardrails Server](/docs/concepts/deploying) to run certain guard executions on a dedicated server +* Leverage [remote validation inference](/docs/concepts/remote_validation_inference) to configure validators to call a REST API for inference results instead of running them locally + +The Guardrails client/server model is hosted via Flask. For best performance, [follow our guidelines on configuring your WSGI servers properly](/docs/concepts/deploying) for production. + +**Use purpose-built LLMs for re-validators**. When a guard fails, you can decide how to handle it by setting the appropriate OnFail action. The `OnFailAction.REASK` and `OnFailAction.FIX_REASK` action will ask the LLM to correct its output, with `OnFailAction.FIX_REASK` running re-validation on the revised output. In general, re-validation works best when using a small, purpose-built LLM fine-tuned to your use case. + +## Measure performance using telemetry + +Guardrails supports OpenTelemetry (OTEL) and a number of OTEL-compatible telemetry providers. You can use telemetry to measure the performance and accuracy of Guardrails AI-enabled applications, as well as the performance of your LLM calls. + +For more, read our [Telemetry](/docs/concepts/telemetry) documentation. \ No newline at end of file diff --git a/docs/integrations/llamaindex.md b/docs/integrations/llamaindex.md new file mode 100644 index 000000000..5dc253d0b --- /dev/null +++ b/docs/integrations/llamaindex.md @@ -0,0 +1,126 @@ +# LlamaIndex + +LlamaIndex is an open source data orchestration framework that simplifies integrating private and public data to build new Large Language Models (LLMs). With this integration, you can use Guardrails AI to validate the output of LlamaIndex LLM calls with minimal code changes to your application. + +The sample below walks through setting up a single vector database for Retrieval-Augemnted Generation (RAG) and then querying the index, using Guardrails AI to ensure the answer doesn't contain [Personally Identifiable Information (PII)](https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp) and doesn't mention competitive products. + +Guardrails AI works with both LlamaIndex's [query engine](https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/) and its [chat engine](https://docs.llamaindex.ai/en/stable/module_guides/deploying/chat_engines/). The query engine is a generic natural language interface for asking questions of data. The chat engine is a higher-level interface that enables a conversation around your data over time, leveraging both the general language capabilities of an LLM and your own private data to generate accurate, up-to-date responses. + +## Prerequisites + +This document assumes you have set up Guardrails AI. You should also be familiar with foundational Guardrails AI concepts, such as [guards](https://www.guardrailsai.com/docs/concepts/guard) and [validators](https://www.guardrailsai.com/docs/concepts/validators). For more information, see [Quickstart: In-Application](/docs/getting_started/quickstart). + +You should be familiar with the basic concepts of RAG. For the basics, see our blog post on [reducing hallucination issues in GenAI apps](https://www.guardrailsai.com/blog/reduce-ai-hallucinations-provenance-guardrails). + +This walkthrough downloads files from Guardrails Hub, our public directory of free validators. If you haven't already, create a [Guardrails Hub API key](https://hub.guardrailsai.com/keys) and run `guardrails configure` to set it. For more information on Guardrails Hub, see the [Guardrails Hub documentation](/docs/concepts/hub). + +Unless you specify another LLM, LlamaIndex uses OpenAI for natural language queries as well as to generate vector embeddings. This requires generating and setting an [OpenAI API key](https://platform.openai.com/api-keys), which you can do on Linux using: + +```bash +export OPENAI_API_KEY=KEY_VALUE +``` + +And in Windows Cmd or Powershell using: + +```powershell +set OPENAI_API_KEY=KEY_VALUE +``` + +You will need sufficient OpenAI credits to run this sample (between USD $0.01 and $0.03 per run-through). To use another hosted LLM or a locally-running LLM, [see the LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/). + +## Install LlamaIndex + +Install the LlamaIndex package: + +```bash +pip install llama-index -q +``` + +Next, install the [Detect PII](https://hub.guardrailsai.com/validator/guardrails/detect_pii) and [Competitor Check](https://hub.guardrailsai.com/validator/guardrails/competitor_check) validators if you don't already have them installed: + +```bash +guardrails hub install hub://guardrails/detect_pii --no-install-local-models -q +guardrails hub install hub://guardrails/competitor_check --no-install-local-models -q +``` + +## Set up your data + +Next, we'll need some sample data to feed into a vector database. [Download the essay located here](https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt) using curl on the command line (or read it with your browser and save it): + +```bash +mkdir -p ./data +curl https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt > ./data/paul_graham_essay.txt +``` + +This essay from Paul Graham, "What I Worked On," contains details of Graham's life and career, some of which qualify as PII. Additionally, Graham mentions several programming languages - Fortran, Pascal, and BASIC - which, for the purposes of this tutorial, we'll treat as "competitors." + +Next, use the following code [from the LlamaIndex starter tutorial](https://docs.llamaindex.ai/en/stable/getting_started/starter_example/) to create vector embeddings for this document and store the vector database to disk: + +```python +import os.path +from llama_index.core import ( + VectorStoreIndex, + SimpleDirectoryReader, + StorageContext, + load_index_from_storage, +) + +# check if storage already exists +PERSIST_DIR = "./storage" +if not os.path.exists(PERSIST_DIR): + # load the documents and create the index + documents = SimpleDirectoryReader("data").load_data() + index = VectorStoreIndex.from_documents(documents) + # store it for later + index.storage_context.persist(persist_dir=PERSIST_DIR) +else: + # load the existing index + storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR) + index = load_index_from_storage(storage_context) +``` + +By default, this will save the embeddings in a simple document store on disk. You can also use [an open-source or commercial vector database](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/). + +## Validate LlamaIndex calls + +Next, call LlamaIndex without any guards to see what values it returns if you don't validate the output. + +```python +query_engine = index.as_query_engine() +response = query_engine.query("What did the author do growing up?") +print(response) +``` + +You should get back a response like this: + +```bash +The author is Paul Graham. Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of Fortran. Later, he transitioned to microcomputers like the TRS-80 and began programming more extensively, creating simple games and a word processor. +``` + +Now, run the same call using the PII and competitor check guards: + +```python +from guardrails.integrations.llama_index import GuardrailsQueryEngine + +guardrails_query_engine = GuardrailsQueryEngine(engine=query_engine, guard=guard) + +response = guardrails_query_engine.query("What did the author do growing up?") +print(response) +``` + +This replaces the call to LlamaIndex's query engine with the `guardrails.integrations.llama_index.GuardrailsQueryEngine` class, which is a thin wrapper around the LlamaIndex query engine. The response will look something like this: + +``` +The author is . Growing up, he worked on writing short stories and programming, starting with the IBM 1401 in 9th grade using an early version of [COMPETITOR]. Later, he transitioned to microcomputers like the TRS-80 and Apple II, where he wrote simple games, programs, and a word processor. +``` + +To use Guardrails AI validators with LlamaIndex's chat engine, use the `GuardrailsChatEngine` class instead: + +```python +from guardrails.integrations.llama_index import GuardrailsChatEngine +chat_engine = index.as_chat_engine() +guardrails_chat_engine = GuardrailsChatEngine(engine=chat_engine, guard=guard) + +response = guardrails_chat_engine.chat("Tell me what the author did growing up.") +print(response) +``` \ No newline at end of file diff --git a/docusaurus/sidebars.js b/docusaurus/sidebars.js index 10640e644..f3d922651 100644 --- a/docusaurus/sidebars.js +++ b/docusaurus/sidebars.js @@ -55,6 +55,7 @@ const sidebars = { "concepts/validator_on_fail_actions", // "concepts/guardrails", "concepts/hub", + "concepts/performance", "concepts/deploying", "concepts/remote_validation_inference", { @@ -104,7 +105,7 @@ const sidebars = { integrations: [ // "integrations/azure_openai", "integrations/langchain", - "integrations/llama_index", + "integrations/llamaindex", { type: "category", label: "Telemetry",