| Developed by | Guardrails AI |
|---|---|
| Date of development | Mar 2026 |
| Validator type | Moderation |
| Blog | |
| License | Apache 2 |
| Input/Output | Output |
This validator detects toxic language in LLM-generated text using an LLM as the detection backbone (via LiteLLM). It is a clean, LLM-based alternative to the model-based ToxicLanguage validator, which relies on the Detoxify toxic-bert model.
Instead of downloading and running a local classification model, this validator sends text to an LLM that evaluates it across seven toxicity categories:
- toxicity - general toxic content
- severe_toxicity - extremely toxic content
- obscene - obscene language
- threat - threatening language
- insult - insulting language
- identity_attack - identity-based attacks
- sexual_explicit - sexually explicit content
The validator supports two validation modes:
- sentence (default): Evaluates each sentence individually. Toxic sentences are removed while clean sentences are preserved in the
fix_value. - full: Evaluates the entire text as a whole. If any toxicity is detected, the entire text fails validation.
-
Dependencies:
- guardrails-ai>=0.4.0
- litellm
-
Foundation model access keys:
ANTHROPIC_API_KEY(required for the default Claude Haiku model)- Or the appropriate API key for your chosen model (e.g.,
OPENAI_API_KEYfor OpenAI models)
guardrails hub install hub://guardrails/toxic_language_llmIn this example, we apply the validator to a string output generated by an LLM.
# Import Guard and Validator
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Use with default settings (sentence mode, threshold 0.5, Claude Haiku)
guard = Guard().use(ToxicLanguageLLM)
guard.validate("The weather is beautiful today.") # Validator passes
guard.validate("You are a terrible person.") # Validator failsfrom guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Strict full-text validation with a lower threshold
guard = Guard().use(
ToxicLanguageLLM,
threshold=0.3,
validation_method="full",
on_fail="exception",
)
guard.validate("The project is going well.") # Validator passesfrom guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Use OpenAI model instead of the default Claude Haiku
guard = Guard().use(
ToxicLanguageLLM,
model="openai/gpt-4o-mini",
on_fail="fix",
)
result = guard.validate("Clean sentence. Toxic sentence here.")
# result.validated_output contains only the clean sentencesIn this example, we apply the validator to a string field of a JSON output generated by an LLM.
# Import Guard and Validator
from pydantic import BaseModel, Field
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Initialize Validator
val = ToxicLanguageLLM(threshold=0.5, validation_method="sentence")
# Create Pydantic BaseModel
class ChatResponse(BaseModel):
user_name: str
message: str = Field(validators=[val])
# Create a Guard to check for valid Pydantic output
guard = Guard.from_pydantic(output_class=ChatResponse)
# Run LLM output generating JSON through guard
guard.parse("""
{
"user_name": "Alice",
"message": "Hello, how are you today?"
}
""")__init__(self, threshold=0.5, validation_method="sentence", model=None, on_fail="noop")
-
Initializes a new instance of the ToxicLanguageLLM class.
threshold(float): Confidence score threshold for toxicity classification. Scores at or above this value are flagged as toxic. Defaults to0.5.validation_method(str): Either"sentence"to evaluate individual sentences or"full"to evaluate the entire text. Defaults to"sentence".model(str, optional): LiteLLM model identifier to use for toxicity detection. Defaults to the latest Claude Haiku model (anthropic/claude-haiku-4-5-20251001).on_fail(str, Callable): The policy to enact when a validator fails. Ifstr, must be one ofreask,fix,filter,refrain,noop,exceptionorfix_reask. Otherwise, must be a function that is called when the validator fails.
Parameters
validate(self, value, metadata) -> ValidationResult
-
Validates the given `value` for toxic language using the configured LLM, relying on the `metadata` provided to customize the validation process. This method is automatically invoked by `guard.parse(...)` or `guard.validate(...)`.
- This method should not be called directly by the user. Instead, invoke
guard.parse(...)orguard.validate(...)where this method will be called internally for each associated Validator. - When invoking
guard.parse(...), ensure to pass the appropriatemetadatadictionary that includes keys and values required by this validator. Ifguardis associated with multiple validators, combine all necessary metadata into a single dictionary. value(Any): The input text to validate.metadata(dict): A dictionary containing metadata. This validator does not require any specific metadata keys.
Note:
Parameters