The provided data failed validation. contains CAPTCHA

Lin, Fanghe (Emily) 20 Reputation points
2025-03-13T14:33:20.7566667+00:00

Hello, I'm trying to fine-tune GPT4o with image data, but I'm encountering an issue: most of images in my training file are being flagged as "containing CAPTCHAs". Images looks fine. This is the first time I've seen this problem in training data validation. 

I retested the old training files that I used to fine-tune GPT4o a month ago without any issues, and now they also flag most images as CAPTCHAs. Have there been any changes to the CAPTCHA threshold this month?

Azure AI Content Safety
Azure AI Content Safety
An Azure service that enables users to identify content that is potentially offensive, risky, or otherwise undesirable. Previously known as Azure Content Moderator.
48 questions
{count} votes

Accepted answer
  1. Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator
    2025-03-14T06:18:31.2333333+00:00

    Hi Lin, Fanghe (Emily),

    This issue occurs because OpenAI has likely updated its data validation and content moderation policies, making the CAPTCHA detection stricter.

    OpenAI may have recently changed its rules for detecting CAPTCHAs in training data. Check their official fine-tuning documentation.

    In this document in Fine-Tuning - Content moderation policy you can see Images containing the following will be excluded from your dataset and not used for training:

    CAPTCHAs:

    Contains CAPTCHAs, contains people, contains faces, contains children

    Remove the image. For now, we cannot fine-tune models with images containing these entities.

    Why?

    Allowing AI models to train on images containing CAPTCHAs poses serious security risks:

    Bypassing Security Measures:

    CAPTCHAs are specifically designed to block automated systems. If an AI model is trained to recognize and solve them, it could potentially be used to circumvent security systems, making websites and services vulnerable to bot attacks.

    Facilitating Malicious Use Cases:

    Cybercriminals could exploit AI trained on CAPTCHAs to automate attacks, such as:

    Credential stuffing (brute-force login attempts using leaked passwords).

    Spamming and phishing by automating bot-driven form submissions.

    Scraping protected content from websites that use CAPTCHAs as a defense.

    Legal and Ethical Concerns:

    Many platforms (Google, reCAPTCHA, Cloudflare, etc.) have terms of service that prohibit AI models from being trained on CAPTCHA data.

    Hope this helps. Do let us know if you any further queries.  

    ------------- 

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you. 

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.