Azure OpenAI API: Inconsistent false positive jailbreak detection for benign user prompt

Nikola Gavranović 40 Reputation points
2025-04-07T14:25:21.7433333+00:00

I'm experiencing inconsistent false positive jailbreak detections in Azure OpenAI when using basic scheduling commands in my agentic system.

The issue:

  • The same prompt sometimes works perfectly and other times triggers the jailbreak filter
  • This happens with completely benign and simple user requests
  • The system is built using langchain with Azure OpenAI integration

Example conversation that sometimes works fine but occasionally fails with jailbreak detection:

User: "I want to set reporter to send me automatic daily reports!"

Bot: "The reporter has been successfully set up to send you automatic daily reports. You will receive them at 9:00 UTC each day."

User: "Disable it"

Bot: "The reporter has been successfully disabled. You will no longer receive automatic daily reports."

User: "Can you enable it now, but each 10th in the month"

Bot: ERROR - Content filter triggered with jailbreak detection

The error received:

openai.BadRequestError: Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766", 'type': None, 'param': 'prompt', 'code': 'content_filter', 'status': 400, 'innererror': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_result': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': True, 'detected': True}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}}}

This is causing reliability issues in our application as users receive unexpected errors during normal usage. There appears to be nothing in these prompts that could reasonably be interpreted as attempting to bypass safety measures or guidelines.

Technical details:

  • Model: gpt-4o
  • API version: 2024-08-01-preview
  • Temperature: 0.5

Note: This is an agentic system built with langchain that includes additional system prompts and context beyond what I've shared here, but the issue occurs with these simple user interactions.

I would appreciate any guidance on how to address these inconsistent false positives or if this is a known issue with the content filtering system.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,092 questions
0 comments No comments
{count} votes

Accepted answer
  1. Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator
    2025-04-07T16:26:22.67+00:00

    Hi ngavranovic,

    Even though your prompt seems benign, there are three main reasons Azure OpenAI might throw a false jailbreak detection:

    1.Contextual System Prompt Leakage

    LangChain agents often append:

    System prompts

    Tool call traces

    History (which may include previous queries or intermediate outputs)

    If any prior content in the full context (not just the current user message) hints at "bypassing" behavior, the content filter might flag the entire sequence.

    Check full prompt history + system prompt. It may contain words like:

    “bypass”

    “disable filters”

    “simulate”

    “override” Even words like “enable” and “disable” in certain formats can get misconstrued if prior context seems suspicious.

    2.Jailbreak Pattern Detection Is Probabilistic

    Azure uses pattern recognition trained to flag sequences that resemble jailbreak attempts. This includes:

    Certain command patterns (e.g., “enable X”, “disable Y”)

    Polite imperative tones (like “can you do X now…”)

    Loops and conditions (like “each 10th in the month”) that look like scheduled automation—this can sometimes resemble behavior-scripting patterns that jailbreaks use.

    Since these models work probabilistically, they can occasionally:

    Flag a benign message

    Not flag the same message another time (hence the inconsistency)

    3.Agentic Tools Can Introduce Risky Traces

    If LangChain tools, chains, or memory modules include debug output, agent traces, or partial completions, they may contain pattern-like signals that Azure's safety filter misinterprets.

    Solution Strategies:

    Inspect Full Prompt Sent to Azure:

    Use:

    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
    

    Or in LangChain:

    llm.predict(_input, verbose=True)
    

    Log the exact request payload (system + history + user input) that triggers the block.

    Look for:

    Previous content that might resemble jailbreak triggers

    Phrases like “disable filters”, “simulate behavior”, “ignore instructions”

    Long chain memory with too much agent trace

    Reduce Context Size or Strip Tool Outputs:

    In LangChain, make sure to

    memory = ConversationBufferMemory(return_messages=True, memory_key="chat_history")
    # But sanitize it
    

    Consider manual sanitization:

    def sanitize_prompt(text):
        banned_keywords = ["ignore", "bypass", "simulate", "jailbreak", "override", "disable safety"]
        for word in banned_keywords:
            text = text.replace(word, "")
        return text
    

    Then apply it to all previous messages or intermediate outputs.

    Use logit_bias to Reduce Ambiguity:

    Sometimes giving more structure helps:

    prompt = """
    You are a scheduling assistant. Only follow safe instructions.
    If a user asks for something unusual or unclear, ask for clarification.
    """
    

    Giving a structured system prompt reduces chance of misinterpretation.

    Retry Strategy with Backoff:

    Because the flag is inconsistent, implementing a retry fallback helps:

    import time
    def call_with_retry(api_func, retries=2):
        for attempt in range(retries):
            try:
                return api_func()
            except openai.BadRequestError as e:
                if "jailbreak" in str(e):
                    print("Retrying due to content filter...")
                    time.sleep(1.5 * (attempt + 1))
                    continue
                else:
                    raise e
        raise Exception("All retries failed.")
    

    Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

    **

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    Thank you! 


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.