Clear cases of False negative - incorrect identification of harmless content - From Azure OpenAI
Lets start with why there is a need of content safety - Sure - we need to gaurd rail the input prompts being sent to the model - But it comes down to prevention of usability of system if there is consistent False negative - incorrect identification of harmless content - incorrectly flagging the prompts of user as Jail breaks. BadRequestError: Content Filter Triggered.
Consider I have a simple prompt - that generates a list of followup questions based on the user question, chat_history - very simple prompt.
I use GPT 40 MINI model and users send questions to the application. Almost 10 percent of calls end up in this false negative bucket.
Questions that needs clear explanation in the documentation:
- How does the prompt then look like? if its always flagged as jail break in most essence
- If there are always multilingual, some code based questions in prompt - I see a larger window of chances to be falsely tagged
- For false negatives - retry is a quick band aid that app team can do, how does MSFT work on these false alarms and how can we have a realistic deterministic view towards the classification model [I can understand its probabilistic] - but here the predictions are very stochastic in nature.
Note: I have read all documentation related to Azure content filtering and Azure Content safety resource