Overview

Developed by	Guardrails AI
Date of development	Mar 2026
Validator type	Moderation
Blog
License	Apache 2
Input/Output	Output

Description

Intended Use

This validator detects toxic language in LLM-generated text using an LLM as the detection backbone (via LiteLLM). It is a clean, LLM-based alternative to the model-based ToxicLanguage validator, which relies on the Detoxify toxic-bert model.

Instead of downloading and running a local classification model, this validator sends text to an LLM that evaluates it across seven toxicity categories:

toxicity - general toxic content
severe_toxicity - extremely toxic content
obscene - obscene language
threat - threatening language
insult - insulting language
identity_attack - identity-based attacks
sexual_explicit - sexually explicit content

The validator supports two validation modes:

sentence (default): Evaluates each sentence individually. Toxic sentences are removed while clean sentences are preserved in the fix_value.
full: Evaluates the entire text as a whole. If any toxicity is detected, the entire text fails validation.

Requirements

Dependencies:
- guardrails-ai>=0.4.0
- litellm
Foundation model access keys:
- ANTHROPIC_API_KEY (required for the default Claude Haiku model)
- Or the appropriate API key for your chosen model (e.g., OPENAI_API_KEY for OpenAI models)

Installation

guardrails hub install hub://guardrails/toxic_language_llm

Usage Examples

Validating string output via Python

In this example, we apply the validator to a string output generated by an LLM.

# Import Guard and Validator
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use with default settings (sentence mode, threshold 0.5, Claude Haiku)
guard = Guard().use(ToxicLanguageLLM)

guard.validate("The weather is beautiful today.")  # Validator passes
guard.validate("You are a terrible person.")  # Validator fails

Customizing threshold and validation method

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Strict full-text validation with a lower threshold
guard = Guard().use(
    ToxicLanguageLLM,
    threshold=0.3,
    validation_method="full",
    on_fail="exception",
)

guard.validate("The project is going well.")  # Validator passes

Using a different LLM model

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use OpenAI model instead of the default Claude Haiku
guard = Guard().use(
    ToxicLanguageLLM,
    model="openai/gpt-4o-mini",
    on_fail="fix",
)

result = guard.validate("Clean sentence. Toxic sentence here.")
# result.validated_output contains only the clean sentences

Validating JSON output via Python

In this example, we apply the validator to a string field of a JSON output generated by an LLM.

# Import Guard and Validator
from pydantic import BaseModel, Field
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Initialize Validator
val = ToxicLanguageLLM(threshold=0.5, validation_method="sentence")

# Create Pydantic BaseModel
class ChatResponse(BaseModel):
    user_name: str
    message: str = Field(validators=[val])

# Create a Guard to check for valid Pydantic output
guard = Guard.from_pydantic(output_class=ChatResponse)

# Run LLM output generating JSON through guard
guard.parse("""
{
    "user_name": "Alice",
    "message": "Hello, how are you today?"
}
""")

API Reference

__init__(self, threshold=0.5, validation_method="sentence", model=None, on_fail="noop")

Parameters

threshold (float): Confidence score threshold for toxicity classification. Scores at or above this value are flagged as toxic. Defaults to 0.5.
validation_method (str): Either "sentence" to evaluate individual sentences or "full" to evaluate the entire text. Defaults to "sentence".
model (str, optional): LiteLLM model identifier to use for toxicity detection. Defaults to the latest Claude Haiku model (anthropic/claude-haiku-4-5-20251001).
on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.

validate(self, value, metadata) -> ValidationResult

Note:

This method should not be called directly by the user. Instead, invoke guard.parse(...) or guard.validate(...) where this method will be called internally for each associated Validator.
When invoking guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.

Parameters

value (Any): The input text to validate.
metadata (dict): A dictionary containing metadata. This validator does not require any specific metadata keys.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
tests		tests
validator		validator
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Description

Intended Use

Requirements

Installation

Usage Examples

Validating string output via Python

Customizing threshold and validation method

Using a different LLM model

Validating JSON output via Python

API Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Description

Intended Use

Requirements

Installation

Usage Examples

Validating string output via Python

Customizing threshold and validation method

Using a different LLM model

Validating JSON output via Python

API Reference

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages