Hi ngavranovic,
Even though your prompt seems benign, there are three main reasons Azure OpenAI might throw a false jailbreak detection:
1.Contextual System Prompt Leakage
LangChain agents often append:
System prompts
Tool call traces
History (which may include previous queries or intermediate outputs)
If any prior content in the full context (not just the current user message) hints at "bypassing" behavior, the content filter might flag the entire sequence.
Check full prompt history + system prompt. It may contain words like:
“bypass”
“disable filters”
“simulate”
“override” Even words like “enable” and “disable” in certain formats can get misconstrued if prior context seems suspicious.
2.Jailbreak Pattern Detection Is Probabilistic
Azure uses pattern recognition trained to flag sequences that resemble jailbreak attempts. This includes:
Certain command patterns (e.g., “enable X”, “disable Y”)
Polite imperative tones (like “can you do X now…”)
Loops and conditions (like “each 10th in the month”) that look like scheduled automation—this can sometimes resemble behavior-scripting patterns that jailbreaks use.
Since these models work probabilistically, they can occasionally:
Flag a benign message
Not flag the same message another time (hence the inconsistency)
3.Agentic Tools Can Introduce Risky Traces
If LangChain tools, chains, or memory modules include debug output, agent traces, or partial completions, they may contain pattern-like signals that Azure's safety filter misinterprets.
Solution Strategies:
Inspect Full Prompt Sent to Azure:
Use:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
Or in LangChain:
llm.predict(_input, verbose=True)
Log the exact request payload (system + history + user input) that triggers the block.
Look for:
Previous content that might resemble jailbreak triggers
Phrases like “disable filters”, “simulate behavior”, “ignore instructions”
Long chain memory with too much agent trace
Reduce Context Size or Strip Tool Outputs:
In LangChain, make sure to
memory = ConversationBufferMemory(return_messages=True, memory_key="chat_history")
# But sanitize it
Consider manual sanitization:
def sanitize_prompt(text):
banned_keywords = ["ignore", "bypass", "simulate", "jailbreak", "override", "disable safety"]
for word in banned_keywords:
text = text.replace(word, "")
return text
Then apply it to all previous messages or intermediate outputs.
Use logit_bias to Reduce Ambiguity:
Sometimes giving more structure helps:
prompt = """
You are a scheduling assistant. Only follow safe instructions.
If a user asks for something unusual or unclear, ask for clarification.
"""
Giving a structured system prompt reduces chance of misinterpretation.
Retry Strategy with Backoff:
Because the flag is inconsistent, implementing a retry fallback helps:
import time
def call_with_retry(api_func, retries=2):
for attempt in range(retries):
try:
return api_func()
except openai.BadRequestError as e:
if "jailbreak" in str(e):
print("Retrying due to content filter...")
time.sleep(1.5 * (attempt + 1))
continue
else:
raise e
raise Exception("All retries failed.")
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
**
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.
Thank you!