Skip to content

guardrails-ai/toxic_language_llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Developed by Guardrails AI
Date of development Mar 2026
Validator type Moderation
Blog
License Apache 2
Input/Output Output

Description

Intended Use

This validator detects toxic language in LLM-generated text using an LLM as the detection backbone (via LiteLLM). It is a clean, LLM-based alternative to the model-based ToxicLanguage validator, which relies on the Detoxify toxic-bert model.

Instead of downloading and running a local classification model, this validator sends text to an LLM that evaluates it across seven toxicity categories:

  • toxicity - general toxic content
  • severe_toxicity - extremely toxic content
  • obscene - obscene language
  • threat - threatening language
  • insult - insulting language
  • identity_attack - identity-based attacks
  • sexual_explicit - sexually explicit content

The validator supports two validation modes:

  • sentence (default): Evaluates each sentence individually. Toxic sentences are removed while clean sentences are preserved in the fix_value.
  • full: Evaluates the entire text as a whole. If any toxicity is detected, the entire text fails validation.

Requirements

  • Dependencies:

    • guardrails-ai>=0.4.0
    • litellm
  • Foundation model access keys:

    • ANTHROPIC_API_KEY (required for the default Claude Haiku model)
    • Or the appropriate API key for your chosen model (e.g., OPENAI_API_KEY for OpenAI models)

Installation

guardrails hub install hub://guardrails/toxic_language_llm

Usage Examples

Validating string output via Python

In this example, we apply the validator to a string output generated by an LLM.

# Import Guard and Validator
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use with default settings (sentence mode, threshold 0.5, Claude Haiku)
guard = Guard().use(ToxicLanguageLLM)

guard.validate("The weather is beautiful today.")  # Validator passes
guard.validate("You are a terrible person.")  # Validator fails

Customizing threshold and validation method

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Strict full-text validation with a lower threshold
guard = Guard().use(
    ToxicLanguageLLM,
    threshold=0.3,
    validation_method="full",
    on_fail="exception",
)

guard.validate("The project is going well.")  # Validator passes

Using a different LLM model

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use OpenAI model instead of the default Claude Haiku
guard = Guard().use(
    ToxicLanguageLLM,
    model="openai/gpt-4o-mini",
    on_fail="fix",
)

result = guard.validate("Clean sentence. Toxic sentence here.")
# result.validated_output contains only the clean sentences

Validating JSON output via Python

In this example, we apply the validator to a string field of a JSON output generated by an LLM.

# Import Guard and Validator
from pydantic import BaseModel, Field
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Initialize Validator
val = ToxicLanguageLLM(threshold=0.5, validation_method="sentence")

# Create Pydantic BaseModel
class ChatResponse(BaseModel):
    user_name: str
    message: str = Field(validators=[val])

# Create a Guard to check for valid Pydantic output
guard = Guard.from_pydantic(output_class=ChatResponse)

# Run LLM output generating JSON through guard
guard.parse("""
{
    "user_name": "Alice",
    "message": "Hello, how are you today?"
}
""")

API Reference

__init__(self, threshold=0.5, validation_method="sentence", model=None, on_fail="noop")

    Initializes a new instance of the ToxicLanguageLLM class.

    Parameters

    • threshold (float): Confidence score threshold for toxicity classification. Scores at or above this value are flagged as toxic. Defaults to 0.5.
    • validation_method (str): Either "sentence" to evaluate individual sentences or "full" to evaluate the entire text. Defaults to "sentence".
    • model (str, optional): LiteLLM model identifier to use for toxicity detection. Defaults to the latest Claude Haiku model (anthropic/claude-haiku-4-5-20251001).
    • on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.

validate(self, value, metadata) -> ValidationResult

    Validates the given `value` for toxic language using the configured LLM, relying on the `metadata` provided to customize the validation process. This method is automatically invoked by `guard.parse(...)` or `guard.validate(...)`.

    Note:

    1. This method should not be called directly by the user. Instead, invoke guard.parse(...) or guard.validate(...) where this method will be called internally for each associated Validator.
    2. When invoking guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.

    Parameters

    • value (Any): The input text to validate.
    • metadata (dict): A dictionary containing metadata. This validator does not require any specific metadata keys.

About

Uses an LLM to determine whether text is toxic.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors