Toxic Language

16 min

the toxic language guardrail detects harmful content including hate speech, threats, insults, and other communication that could damage your community or brand unlike simple keyword filters, this guardrail uses ai to understand context, tone, and intent when to use this guardrail you should use toxic language detection when you need to protect your community from harmful interactions, prevent your ai from receiving poisoned context that could influence its behavior, or maintain brand safety in ai generated outputs this guardrail is particularly valuable in applications with user generated content like forums, chat systems, and comment sections, as well as in customer service scenarios where you need to catch hostile messages before they reach agents or ai systems the key advantage over simple keyword filtering is that this guardrail understands nuance someone can be hostile without using profanity, and they can use strong language without being hostile the guardrail analyzes intent and tone, not just word choice understanding sensitivity levels the sensitivity setting controls how strict the guardrail is think of it as adjusting the threshold for what constitutes a violation this is one of the most important configuration choices you'll make because it fundamentally changes what content passes or fails low sensitivity flags only severe violations like explicit threats and extreme hate speech when you set sensitivity to low, you're saying that you expect strong opinions and robust disagreement, and you only want to block content that crosses a clear line into threatening or hateful territory for example, "i strongly disagree with this approach and think it's misguided" would pass at low sensitivity, as would "this is a terrible idea " only content like "i will find you and hurt you" or explicit hate speech would fail low sensitivity works well for public forums where debate is expected, professional feedback environments where directness is valued, and technical communities where people discuss contentious topics the tradeoff is that some content that makes people uncomfortable might still pass through medium sensitivity is the default setting and represents a balanced approach at this level, the guardrail flags clear violations including insults and hostile language while still allowing professional disagreement and constructive criticism a message like "i disagree with your reasoning" would pass, but "you're an idiot" or "people like you are the problem" would fail medium sensitivity works well for most applications including customer service systems, business communications, collaborative tools, and social platforms it strikes a balance between allowing meaningful discourse and maintaining a respectful environment high sensitivity creates the strictest environment by flagging any potentially toxic content including mild rudeness and dismissive language at this level, even content like "whatever, dude" or "that's pretty dumb" would fail only respectful, neutral content passes at high sensitivity high sensitivity is appropriate for children's applications where you need maximum protection, educational platforms where you want to model respectful communication, safe spaces and support communities where people need to feel secure, and compliance critical contexts where any potential issue needs to be caught configuration options the toxic language guardrail accepts several configuration options that let you tune its behavior for your specific use case // available options // sensitivity "low", "medium", or "high" (default "medium") // model model identifier (default claude 3 5 haiku) // temperature 0 1, lower values = more consistent (default 0 1) // maxtokens response length limit (default 200) await abv guardrails toxiclanguage validate(text, { sensitivity "medium", model "model name", temperature 0 1, maxtokens 200, }); \# available options \# sensitivity "low", "medium", or "high" (default "medium") \# model model identifier (default claude 3 5 haiku) \# temperature 0 1, lower values = more consistent (default 0 1) \# maxtokens response length limit (default 200) abv guardrails toxic language validate(text, { "sensitivity" "medium", "model" "model name", "temperature" 0 1, "maxtokens" 200 }) the sensitivity option is the most important and you'll use it frequently the model, temperature, and maxtokens options are advanced settings that you typically won't need to change the default model is optimized for guardrail tasks and provides the best balance of speed, accuracy, and cost the default temperature of 0 1 ensures consistent results the default maxtokens of 200 is sufficient for the explanation field real world examples let's look at concrete examples of how different sensitivity levels handle various types of content understanding these patterns will help you choose the right sensitivity for your application consider a message like "i disagree with your approach to this problem " this is professional disagreement and passes at all sensitivity levels the language is neutral and respectful despite expressing disagreement now consider "this is a terrible idea and shows poor judgment " this passes at low and medium sensitivity because while it's critical, it focuses on the idea rather than attacking the person however, it might fail at high sensitivity because "terrible" and "poor judgment" could be seen as dismissive a message like "you don't know what you're talking about" fails at medium and high sensitivity because it attacks the person's competence directly it might pass at low sensitivity since it doesn't contain explicit threats or hate speech, though it's borderline content like "you're an idiot" or "people like you are the problem" fails at all sensitivity levels these are clear personal attacks with no constructive value finally, explicit threats like "i will find you and hurt you" fail at all sensitivity levels with maximum confidence this is unambiguous toxic content implementation patterns here's how you'd typically use toxic language detection in different parts of your application for input validation, you check user messages before sending them to your ai or displaying them to other users async function validateusermessage(message string) promise\<boolean> { const result = await abv guardrails toxiclanguage validate( message, { sensitivity "medium" } ); if (result status === "pass") { return true; } // log the reason for monitoring, but don't expose it to the user console log("blocked message ", result reason); return false; } // usage in your message handler if (await validateusermessage(userinput)) { await processmessage(userinput); } else { return { error "your message violates our community guidelines " }; } async def validate user message(message str) > bool result = await abv guardrails toxic language validate async( message, {"sensitivity" "medium"} ) if result\["status"] == "pass" return true \# log the reason for monitoring, but don't expose it to the user print(f"blocked message {result\['reason']}") return false \# usage in your message handler if await validate user message(user input) await process message(user input) else return {"error" "your message violates our community guidelines "} for output validation, you check ai generated responses before showing them to users async function generatesaferesponse(prompt string) promise\<string> { // generate initial response let response = await callai(prompt); // validate the response const validation = await abv guardrails toxiclanguage validate( response, { sensitivity "high" } ); // if toxic, regenerate with explicit safety instruction if (validation status === "fail") { response = await callai( prompt + "\n\nimportant respond in a professional, respectful tone " ); } return response; } async def generate safe response(prompt str) > str \# generate initial response response = await call ai(prompt) \# validate the response validation = await abv guardrails toxic language validate async( response, {"sensitivity" "high"} ) \# if toxic, regenerate with explicit safety instruction if validation\["status"] == "fail" response = await call ai( f"{prompt}\n\nimportant respond in a professional, respectful tone " ) return response for handling ambiguous cases, you might implement a review queue for unsure results async function handleusercontent(content string) { const result = await abv guardrails toxiclanguage validate( content, { sensitivity "medium" } ); if (result status === "pass") { // content is clearly acceptable await publishcontent(content); } else if (result status === "fail" && result confidence > 0 8) { // high confidence violation, auto reject await rejectcontent(content, "community guidelines violation"); } else { // low confidence or unsure flag for human review await flagformoderation(content, result); } } async def handle user content(content str) result = await abv guardrails toxic language validate async( content, {"sensitivity" "medium"} ) if result\["status"] == "pass" \# content is clearly acceptable await publish content(content) elif result\["status"] == "fail" and result\["confidence"] > 0 8 \# high confidence violation, auto reject await reject content(content, "community guidelines violation") else \# low confidence or unsure flag for human review await flag for moderation(content, result) performance optimization since toxic language detection uses ai, it takes one to three seconds per check and consumes tokens you can optimize performance by running a fast rule based check first to catch obvious violations before making the expensive ai call async function efficienttoxiccheck(text string) promise\<boolean> { // quick check for explicitly forbidden terms (under 10ms, free) const quickcheck = await abv guardrails containsstring validate( text, { strings \["explicit slur", "forbidden term"], mode "none", } ); // if quick check fails, no need for expensive ai check if (quickcheck status === "fail") { return false; } // only run ai check if quick check passed const deepcheck = await abv guardrails toxiclanguage validate(text); return deepcheck status === "pass"; } async def efficient toxic check(text str) > bool \# quick check for explicitly forbidden terms (under 10ms, free) quick check = await abv guardrails contains string validate async( text, { "strings" \["explicit slur", "forbidden term"], "mode" "none" } ) \# if quick check fails, no need for expensive ai check if quick check\["status"] == "fail" return false \# only run ai check if quick check passed deep check = await abv guardrails toxic language validate async(text) return deep check\["status"] == "pass" security best practices never expose the reason field to end users the reason explains why content failed validation, and exposing this information helps bad actors learn how to evade your guardrails instead, use generic error messages while logging the detailed reason internally for monitoring and improvement // bad exposes validation logic if (result status === "fail") { return { error result reason }; // don't do this! } // good generic message, internal logging if (result status === "fail") { logger info("blocked toxic content", { reason result reason }); return { error "your message violates our community guidelines " }; } \# bad exposes validation logic if result\["status"] == "fail" return {"error" result\["reason"]} # don't do this! \# good generic message, internal logging if result\["status"] == "fail" logger info(f"blocked toxic content {result\['reason']}") return {"error" "your message violates our community guidelines "} choosing the right sensitivity here's a decision framework for choosing sensitivity based on your application type if you're building for children or vulnerable populations, always use high sensitivity the potential harm from allowing toxic content through far outweighs the cost of false positives if you're building a customer facing application like customer service, social media, or collaborative tools, medium sensitivity is usually appropriate it catches clear violations while allowing professional disagreement if you're building for professional or technical audiences where robust debate is expected, consider low sensitivity technical forums, code review systems, and professional feedback tools benefit from allowing strong opinions you can also adjust sensitivity based on user context authenticated users with good history might get lower sensitivity while anonymous users get higher sensitivity users who identify as minors automatically get high sensitivity regardless of the default setting next steps the toxic language guardrail is often used alongside other guardrails for comprehensive content validation consider combining it with biased language detection for a more complete content safety solution you might also want to use contains string to quickly catch explicit forbidden terms before running the more expensive toxic language check for more detailed implementation guidance, see the best practices documentation which covers optimization strategies, error handling, and monitoring approaches