Azure OpenAI API: Inconsistent false positive jailbreak detection for benign user prompt

Question

Azure OpenAI API: Inconsistent false positive jailbreak detection for benign user prompt

Nikola Gavranović 40

I'm experiencing inconsistent false positive jailbreak detections in Azure OpenAI when using basic scheduling commands in my agentic system.

The issue:

The same prompt sometimes works perfectly and other times triggers the jailbreak filter
This happens with completely benign and simple user requests
The system is built using langchain with Azure OpenAI integration

Example conversation that sometimes works fine but occasionally fails with jailbreak detection:

User: "I want to set reporter to send me automatic daily reports!"

Bot: "The reporter has been successfully set up to send you automatic daily reports. You will receive them at 9:00 UTC each day."

User: "Disable it"

Bot: "The reporter has been successfully disabled. You will no longer receive automatic daily reports."

User: "Can you enable it now, but each 10th in the month"

Bot: ERROR - Content filter triggered with jailbreak detection

The error received:

openai.BadRequestError: Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766", 'type': None, 'param': 'prompt', 'code': 'content_filter', 'status': 400, 'innererror': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_result': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': True, 'detected': True}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}}}

This is causing reliability issues in our application as users receive unexpected errors during normal usage. There appears to be nothing in these prompts that could reasonably be interpreted as attempting to bypass safety measures or guidelines.

Technical details:

Model: gpt-4o
API version: 2024-08-01-preview
Temperature: 0.5

Note: This is an agentic system built with langchain that includes additional system prompts and context beyond what I've shared here, but the issue occurs with these simple user interactions.

I would appreciate any guidance on how to address these inconsistent false positives or if this is a known issue with the content filtering system.

Accepted answer

0 additional answers

Your answer

Answer 1

Hi ngavranovic,

Even though your prompt seems benign, there are three main reasons Azure OpenAI might throw a false jailbreak detection:

1.Contextual System Prompt Leakage

LangChain agents often append:

System prompts

Tool call traces

History (which may include previous queries or intermediate outputs)

If any prior content in the full context (not just the current user message) hints at "bypassing" behavior, the content filter might flag the entire sequence.

Check full prompt history + system prompt. It may contain words like:

“bypass”

“disable filters”

“simulate”

“override” Even words like “enable” and “disable” in certain formats can get misconstrued if prior context seems suspicious.

2.Jailbreak Pattern Detection Is Probabilistic

Azure uses pattern recognition trained to flag sequences that resemble jailbreak attempts. This includes:

Certain command patterns (e.g., “enable X”, “disable Y”)

Polite imperative tones (like “can you do X now…”)

Loops and conditions (like “each 10th in the month”) that look like scheduled automation—this can sometimes resemble behavior-scripting patterns that jailbreaks use.

Since these models work probabilistically, they can occasionally:

Flag a benign message

Not flag the same message another time (hence the inconsistency)

3.Agentic Tools Can Introduce Risky Traces

If LangChain tools, chains, or memory modules include debug output, agent traces, or partial completions, they may contain pattern-like signals that Azure's safety filter misinterprets.

Solution Strategies:

Inspect Full Prompt Sent to Azure:

Use:

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

Or in LangChain:

llm.predict(_input, verbose=True)

Log the exact request payload (system + history + user input) that triggers the block.

Look for:

Previous content that might resemble jailbreak triggers

Phrases like “disable filters”, “simulate behavior”, “ignore instructions”

Long chain memory with too much agent trace

Reduce Context Size or Strip Tool Outputs:

In LangChain, make sure to

memory = ConversationBufferMemory(return_messages=True, memory_key="chat_history")
# But sanitize it

Consider manual sanitization:

def sanitize_prompt(text):
    banned_keywords = ["ignore", "bypass", "simulate", "jailbreak", "override", "disable safety"]
    for word in banned_keywords:
        text = text.replace(word, "")
    return text

Then apply it to all previous messages or intermediate outputs.

Use logit_bias to Reduce Ambiguity:

Sometimes giving more structure helps:

prompt = """
You are a scheduling assistant. Only follow safe instructions.
If a user asks for something unusual or unclear, ask for clarification.
"""

Giving a structured system prompt reduces chance of misinterpretation.

Retry Strategy with Backoff:

Because the flag is inconsistent, implementing a retry fallback helps:

import time
def call_with_retry(api_func, retries=2):
    for attempt in range(retries):
        try:
            return api_func()
        except openai.BadRequestError as e:
            if "jailbreak" in str(e):
                print("Retrying due to content filter...")
                time.sleep(1.5 * (attempt + 1))
                continue
            else:
                raise e
    raise Exception("All retries failed.")

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Thank you!

Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-04-08T14:40:42.24+00:00

Hi @Anonymous

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let me know.

Thank You.

Share via

Azure OpenAI API: Inconsistent false positive jailbreak detection for benign user prompt

0 additional answers

Your answer