Edit

Share via


Harm categories and severity levels in Microsoft Foundry

Note

This article refers to the Microsoft Foundry (classic) portal.

🔄 Switch to the Microsoft Foundry (new) documentation if you're using the new portal.

Note

This article refers to the Microsoft Foundry (new) portal.

Microsoft Foundry content filtering ensures that AI-generated outputs align with ethical guidelines and safety standards. Content filtering capabilities classify harmful content into four categories — hate, sexual, violence, and self-harm — each graded at four severity levels (safe, low, medium, and high) for both text and image content. Use these categories and levels to configure guardrail controls that detect and mitigate risks associated with harmful content in your model deployments and agents.

Guardrails in Microsoft Foundry ensure that AI-generated outputs align with ethical guidelines and safety standards. Guardrails classify harmful content into four categories — hate, sexual, violence, and self-harm — each graded at four severity levels (safe, low, medium, and high) for both text and image content. Use these categories and levels to configure Guardrail controls that detect and mitigate risks associated with harmful content in your model deployments and agents.

For an overview of how guardrails work, see Guardrails and controls overview.

The content safety system uses neural multiclass classification models to detect and filter harmful content for both text and image. Content detected at the "safe" severity level is labeled in annotations but isn't subject to filtering and isn't configurable.

Note

The text content safety models for the hate, sexual, violence, and self-harm categories are trained and tested on the following languages: English, German, Japanese, Spanish, French, Italian, Portuguese, and Chinese. However, the service can work in many other languages, but the quality might vary. In all cases, you should do your own testing to ensure that it works for your application.

Harm category descriptions

The following table summarizes the harm categories supported by Foundry guardrails:

Category Description
Hate and Fairness Hate and fairness-related harms refer to any content that attacks or uses discriminatory language with reference to a person or identity group based on certain differentiating attributes of these groups.

This category includes, but isn't limited to:
• Race, ethnicity, nationality
• Gender identity groups and expression
• Sexual orientation
• Religion
• Personal appearance and body size
• Disability status
• Harassment and bullying
Sexual Sexual describes language related to anatomical organs and genitals, romantic relationships and sexual acts, acts portrayed in erotic or affectionate terms, including those portrayed as an assault or a forced sexual violent act against one's will.

This category includes but isn't limited to:
• Vulgar content
• Prostitution
• Nudity and pornography
• Abuse
• Child exploitation, child abuse, child grooming
Violence Violence describes language related to physical actions intended to hurt, injure, damage, or kill someone or something; describes weapons, guns, and related entities.

This category includes, but isn't limited to:
• Weapons
• Bullying and intimidation
• Terrorist and violent extremism
• Stalking
Self-Harm Self-harm describes language related to physical actions intended to purposely hurt, injure, damage one's body or kill oneself.

This category includes, but isn't limited to:
• Eating disorders
• Bullying and intimidation

Severity levels

The content safety system classifies harmful content at four severity levels:

Severity level Description
Safe No harmful material detected. Annotated but never filtered.
Low Mild harmful material. Includes prejudiced views, mild depictions in fictional contexts, or personal experiences.
Medium Moderate harmful material. Includes graphic depictions, bullying, or content that promotes harmful acts.
High Severe harmful material. Includes extremist content, explicit depictions, or content that endorses serious harm.

How severity levels map to guardrail configuration

When you configure a guardrail control for a harm category, you set a severity threshold that determines which content is flagged:

Threshold setting Behavior
Off Detection is disabled for this category. No content is flagged or blocked.
Low Flags content at low severity and higher. Least restrictive setting.
Medium Flags content at medium severity and higher.
High Flags only the most severe content. Most restrictive setting.

Content at the "safe" level is always annotated but never blocked, regardless of threshold setting. To configure these thresholds, see How to configure guardrails and controls.

Detailed severity definitions for text

The following tables provide detailed descriptions and examples for each severity level within each harm category for text content. Select the Severity definitions tab to view examples.

Text content

Warning

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Detailed severity definitions for images

The following tables provide detailed descriptions and examples for each severity level within each harm category for image content. Select the Severity definitions tab to view examples.

Image content

Warning

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Next steps