Clear cases of False negative - incorrect identification of harmless content - From Azure OpenAI

Venkatesan, Sangeetha 40 Reputation points
2025-05-15T17:40:26.44+00:00

Lets start with why there is a need of content safety - Sure - we need to gaurd rail the input prompts being sent to the model - But it comes down to prevention of usability of system if there is consistent False negative - incorrect identification of harmless content - incorrectly flagging the prompts of user as Jail breaks. BadRequestError: Content Filter Triggered.

Consider I have a simple prompt - that generates a list of followup questions based on the user question, chat_history - very simple prompt.

I use GPT 40 MINI model and users send questions to the application. Almost 10 percent of calls end up in this false negative bucket.

Questions that needs clear explanation in the documentation:

  1. How does the prompt then look like? if its always flagged as jail break in most essence
  2. If there are always multilingual, some code based questions in prompt - I see a larger window of chances to be falsely tagged
  3. For false negatives - retry is a quick band aid that app team can do, how does MSFT work on these false alarms and how can we have a realistic deterministic view towards the classification model [I can understand its probabilistic] - but here the predictions are very stochastic in nature.

Note: I have read all documentation related to Azure content filtering and Azure Content safety resource

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,069 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.