How to use AI Search to extract a portion of content from a document in Azure?

Scorpio X 0 Reputation points
2024-11-21T02:00:43.86+00:00

How to use AI Search to extract a portion of content from a document in Azure?

The document itself contains a lot of content, but I only want to extract a part of it that meets the conditions.

For example, document content:

XXXXXXXXXXXX Quantity of apples: 200. XXXXXXXXXXXXXXXXX

How to extract only the quantity of apples?

The only result I want is: 200 apples.

Rather than the document itself that meets the conditions.

Thank you so much!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 41,121 Reputation points Volunteer Moderator
    2025-10-14T20:04:52.7466667+00:00

    Hello !

    Thank you for posting on Microsoft Learn Q&A.

    You need to do the extraction at indexing time with a skillset and store just the value you need in your index then your query returns that small field instead of the whole chunk of text.

    The 1st solution is to use regex with a custom web API skill when the text pattern is predictable. You add fields to your index

    {
      "name": "docs",
      "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "filterable": true },
        { "name": "content", "type": "Edm.String", "searchable": true },
        { "name": "applesQuantity", "type": "Edm.Int32", "filterable": true, "sortable": true, "facetable": true, "retrievable": true },
        { "name": "applesText", "type": "Edm.String", "retrievable": true }
      ]
    }
    

    Then create a skillset that calls your Azure Function and runs a regex :

    {
      "name": "docs-skillset",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
          "inputs": [{ "name": "document", "source": "/document" }],
          "outputs": [{ "name": "content", "targetName": "content" }]
        },
        {
          "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
          "name": "#extract-apples",
          "description": "Extract 'Quantity of apples' as a number and as 'N apples'",
          "context": "/document",
          "uri": "https://<your-function>.azurewebsites.net/api/extractApples",
          "httpMethod": "POST",
          "inputs": [
            { "name": "text", "source": "/document/content" },
            { "name": "pattern", "value": "Quantity\\s*of\\s*apples\\s*[:=]\\s*(\\d+)" }
          ],
          "outputs": [
            { "name": "applesQuantity", "targetName": "applesQuantity" },
            { "name": "applesText", "targetName": "applesText" }
          ]
        }
      ]
    }
    

    Your function only needs to parse the input and return something like:

    { "values": [ { "recordId": "1", "data": { "applesQuantity": 200, "applesText": "200 apples" } } ] }
    

    Then map the skill outputs to index fields in the indexer :

    {
      "name": "docs-indexer",
      "dataSourceName": "docs-ds",
      "targetIndexName": "docs",
      "skillsetName": "docs-skillset",
      "outputFieldMappings": [
        { "sourceFieldName": "/document/applesQuantity", "targetFieldName": "applesQuantity" },
        { "sourceFieldName": "/document/applesText", "targetFieldName": "applesText" }
      ]
    }
    

    and query only the extracted value :

    POST https://<service>.search.windows.net/indexes/docs/docs/search?api-version=2025-09-01
    api-key: <key>
    {
      "search": "*",
      "select": "id,applesText,applesQuantity",
      "filter": "applesQuantity gt 0"
    }
    

    The 2nd solution is to use entity recognition skill if you can’t rely on a fixed string, but you still want to pull numbers from prose.

    Add the Entity Recognition (v3) skill with categories: ["Quantity"]:

    {
      "@odata.type": "#Microsoft.Skills.Text.V3.EntityRecognitionSkill",
      "context": "/document",
      "categories": [ "Quantity" ],
      "inputs": [{ "name": "text", "source": "/document/content" }],
      "outputs": [{ "name": "entities", "targetName": "entities" }]
    }
    

    You’ll get entities with category "Quantity" and you can post process via a Conditional and Shaper skill or your app to keep the one near the term apples and map it into applesQuantity.

    Azure AI Search supports Lucene regex queries (set queryType=full), but regex at query time only helps with matching and it doesn’t return capture groups on its own. You’d still have to fetch the text and extract the number in your app. Index-time extraction is cleaner and cheaper to serve.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.