Edit

Share via


Create a custom analyzer via REST APIs

Content Understanding analyzers define how to process and extract insights from your content. They ensure uniform processing and output structure across all your content to deliver reliable and predictable results. We offer prebuilt analyzers for common use cases. This guide shows how these analyzers can be customized to better fit your needs.

In this guide, we use the cURL command line tool. If it isn't installed, you can download the appropriate version for your dev environment.

Prerequisite

To get started, make sure you have the following resources and permissions:

  • An Azure subscription. If you don't have an Azure subscription, create a free account.
  • Once you have your Azure subscription, create a Microsoft Foundry resource in the Azure portal. Be sure to create it in a supported region.
    • This resource is listed under Foundry > Foundry in the portal.
  • Set up default model deployments for your Content Understanding resource. Setting defaults creates a connection to the Foundry models you use for Content Understanding requests. Choose one of the following methods:
    1. Go to the Content Understanding settings page
    2. Select the "+ Add resource" button in the upper left
    3. Select the Foundry resource that you want to use and click Next, then Save
      • Make sure to leave "Enable autodeployment for required models if no defaults are available." checked. This ensures your resource is fully set up with the required GPT-4.1, GPT-4.1-mini, and text-embedding-3-large models. Different prebuilt analyzers require different models.
    By taking these steps, you set up a connection between Content Understanding and Foundry models in your Foundry resource.

Define an analyzer schema

To create a custom analyzer, define a field schema that describes the structured data you want to extract. In the following example, we create an analyzer based on prebuilt document analyzer for processing a receipt.

Create a JSON file named receipt.json with the following content:

{
  "description": "Sample receipt analyzer",
  "baseAnalyzerId": "prebuilt-document",
  "models": {
      "completion": "gpt-4.1",
      "embedding": "text-embedding-ada-002"

    },
  "config": {
    "returnDetails": true,
    "enableFormula": false,
    "disableContentFiltering": false,
    "estimateFieldSourceAndConfidence": true,
    "tableFormat": "html"
  },
 "fieldSchema": {
    "fields": {
      "VendorName": {
        "type": "string",
        "method": "extract",
        "description": "Vendor issuing the receipt"
      },
      "Items": {
        "type": "array",
        "method": "extract",
        "items": {
          "type": "object",
          "properties": {
            "Description": {
              "type": "string",
              "method": "extract",
              "description": "Description of the item"
            },
            "Amount": {
              "type": "number",
              "method": "extract",
              "description": "Amount of the item"
            }
          }
        }
      }
    }
  }
}

If you have various types of documents you need to process, but want to categorize and analyze only the receipts, you can create an analyzer that categorizes the document first. Then, route it to the analyzer you created above with the following schema.

Create a JSON file named categorize.json with the following content:

{
  "baseAnalyzerId": "prebuilt-document",
  // Use the base analyzer to invoke the document specific capabilities.

  //Specify the model the analyzer should use. This is one of the supported completion models and one of the supported embeddings model. The specific deployment used during analyze is set on the resource or provided in the analyze request.
  "models": {
      "completion": "gpt-4.1",
      "embedding": "text-embedding-ada-002"

    },
  "config": {
    // Enable splitting of the input into segments. Set this property to false if you only expect a single document within the input file. When specified and enableSegment=false, the whole content will be classified into one of the categories.
    "enableSegment": false,

    "contentCategories": {
      // Category name.
      "receipt": {
        // Description to help with classification and splitting.
        "description": "Any images or documents of receipts",

        // Define the analyzer that any content classified as a receipt should be routed to
        "analyzerId": "receipt"
      },

      "invoice": {
        "description": "Any images or documents of invoice",
        "analyzerId": "prebuilt-invoice"
      },
      "policeReport": {
        "description": "A police or law enforcement report detailing the events that lead to the loss."
        // Don't perform analysis for this category.
      }

    },

    // Omit original content object and only return content objects from additional analysis.
    "omitContent": true
  }

  //You can use fieldSchema here to define fields that are needed from the entire input content.

}

Create analyzer

PUT request

Create a receipt analyzer first and then create the categorize analyzer.

curl -i -X PUT "{endpoint}/contentunderstanding/analyzers/{analyzerId}?api-version=2025-11-01" \
  -H "Ocp-Apim-Subscription-Key: {key}" \
  -H "Content-Type: application/json" \
  -d @receipt.json

PUT response

The 201 Created response includes an Operation-Location header containing a URL that you can use to track the status of this asynchronous analyzer creation operation.

201 Created
Operation-Location: {endpoint}/contentunderstanding/analyzers/{analyzerId}/operations/{operationId}?api-version=2025-05-01-preview

Upon completion, performing an HTTP GET on the operation location URL returns "status": "succeeded".

curl -i -X GET "{endpoint}/contentunderstanding/analyzers/{analyzerId}/operations/{operationId}?api-version=2025-11-01" \
  -H "Ocp-Apim-Subscription-Key: {key}"

Analyze file

Send file

You can now use the custom analyzer you created to process files and extract the fields you defined in the schema.

Before running the cURL command, make the following changes to the HTTP request:

  1. Replace {endpoint} and {key} with the endpoint and key values from your Azure portal Foundry instance.
  2. Replace {analyzerId} with the name of the custom analyzer you created with the categorize.json file.
  3. Replace {fileUrl} with a publicly accessible URL of the file to analyze, such as a path to an Azure Storage Blob with a shared access signature (SAS) or the sample URL https://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/receipt.png.

POST Request

This example uses the custom analyzer you created with the categorize.json file to analyze a receipt.

curl -i -X POST "{endpoint}/contentunderstanding/analyzers/{analyzerId}:analyze?api-version=2025-11-01" \
  -H "Ocp-Apim-Subscription-Key: {key}" \
  -H "Content-Type: application/json" \
  -d '{
        "inputs":[
          {
            "url": "https://github.com/Azure-Samples/azure-ai-content-understanding-python/raw/refs/heads/main/data/receipt.png"
          }          
        ]
      }'  

POST Response

The 202 Accepted response includes the {resultId} which you can use to track the status of this asynchronous operation.

{
  "id": {resultId},
  "status": "Running",
  "result": {
    "analyzerId": {analyzerId},
    "apiVersion": "2025-11-01",
    "createdAt": "YYYY-MM-DDTHH:MM:SSZ",
    "warnings": [],
    "contents": []
  }
}

Get Analyze Result

Use the Operation-Location from the POST response and retrieve the result of the analysis.

GET Request

curl -i -X GET "{endpoint}/contentunderstanding/analyzerResults/{resultId}?api-version=2025-11-01" \
  -H "Ocp-Apim-Subscription-Key: {key}"

GET Response

A 200 OK response includes a status field that shows the operation's progress.

  • status is Succeeded if the operation is completed successfully.
  • If it's running or notStarted, call the API again manually or with a script: wait at least one second between requests.
Sample Response
{
  "id": {resultId},
  "status": "Succeeded",
  "result": {
    "analyzerId": {analyzerId},
    "apiVersion": "2025-11-01",
    "createdAt": "YYYY-MM-DDTHH:MM:SSZ",
    "warnings": [],
    "contents": [
      {
        "path": "input1/segment1",
        "category": "receipt",
        "markdown": "Contoso\n\n123 Main Street\nRedmond, WA 98052\n\n987-654-3210\n\n6/10/2019 13:59\nSales Associate: Paul\n\n\n<table>\n<tr>\n<td>2 Surface Pro 6</td>\n<td>$1,998.00</td>\n</tr>\n<tr>\n<td>3 Surface Pen</td>\n<td>$299.97</td>\n</tr>\n</table> ...",
        "fields": {
          "VendorName": {
            "type": "string",
            "valueString": "Contoso",
            "spans": [{"offset": 0,"length": 7}],
            "confidence": 0.996,
            "source": "D(1,774.0000,72.0000,974.0000,70.0000,974.0000,111.0000,774.0000,113.0000)"
          },
          "Items": {
            "type": "array",
            "valueArray": [
              {
                "type": "object",
                "valueObject": {
                  "Description": {
                    "type": "string",
                    "valueString": "2 Surface Pro 6",
                    "spans": [ { "offset": 115, "length": 15}],
                    "confidence": 0.423,
                    "source": "D(1,704.0000,482.0000,875.0000,482.0000,875.0000,508.0000,704.0000,508.0000)"
                  },
                  "Amount": {
                    "type": "number",
                    "valueNumber": 1998,
                    "spans": [{ "offset": 140,"length": 9}
                    ],
                    "confidence": 0.957,
                    "source": "D(1,952.0000,482.0000,1048.0000,482.0000,1048.0000,508.0000,952.0000,509.0000)"
                  }
                }
              }, ...
            ]
          }
        },
        "kind": "document",
        "startPageNumber": 1,
        "endPageNumber": 1,
        "unit": "pixel",
        "pages": [
          {
            "pageNumber": 1,
            "angle": -0.0944,
            "width": 1743,
            "height": 878
          }
        ],
        "analyzerId": "{analyzerId}",
        "mimeType": "image/png"
      }
    ]
  },
  "usage": {
    "documentPages": 1,
    "tokens": {
      "contextualization": 1000
    }
  }
}

Next steps