Create indexer with data source as a field in json document inside index

Question

I have an Index containing Document in JSON format in Azure Search Service.

Index Schema

{
"name": "product-api",
"defaultScoringProfile": null,
"fields": [
    {
        "name": "upcid",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": true,
        "facetable": false,
        "key": true,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "productName",
        "type": "Edm.String",
        "searchable": true,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "imageUrl",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    },
    {
        "name": "ocrText",
        "type": "Edm.String",
        "searchable": false,
        "filterable": false,
        "retrievable": true,
        "sortable": false,
        "facetable": false,
        "key": false,
        "indexAnalyzer": null,
        "searchAnalyzer": null,
        "analyzer": null,
        "synonymMaps": []
    }
],
"scoringProfiles": [],
"corsOptions": {
    "allowedOrigins": [
        "*"
    ],
    "maxAgeInSeconds": null
},
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
    "@odata.type": "#Microsoft.Azure.Search.ClassicSimilarity"
}
}

My requirement

Create an Indexer which could use the imageUrl (image not stored in azure storage service) field as data source, Microsoft.Skills.Vision.OcrSkill as a skill and maps the output to field ocrText.

Problem

From what I have read from the docs, the data source (in my case, image) must be in Azure Blob Storage to create Indexer.

Have anyone done something similar to my requirement? Or does anyone know any direct or indirect method to achieve the requirement?

It would be great if any leads are provided, I could not find anything related to this on the Internet.

Answer

@Ashish Kumar Thanks for raising this question! Firstly, apologies for the delay in responding here and any inconvenience this issue may have caused.

You can pass the URL of a image to the OCR skill, normalization is applied to the image before the skill is invoked. This can result in the skill failing if the image does not conform to the skill limits.

Here’s a  sample of how you can invoke the skill.  
{  
      "description": "Extracts text (plain and structured) from image.",  
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",  
      "context": "/document”,  
      "defaultLanguageCode": null,  
      "detectOrientation": true,  
      "inputs": [  
        {  
          "name": “url",  
          "source": "/document/url"  
        }  
      ],  
      "outputs": [  
        {  
          "name": "text",  
          "targetName": "myText"  
        },  
        {  
          "name": "layoutText",  
          "targetName": "myLayoutText"  
        }  
      ]  
    }  
  
  
Alternatively you can use the document extraction skill to extract normalized images and then add the OCR skill to use the extracted normalized images as input.  
{  
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",  
      "name": "#1",  
      "description": null,  
      "context": "/document",  
      "parsingMode": "default",  
      "dataToExtract": "contentAndMetadata",  
      "inputs": [  
        {  
          "name": "file_data",  
          "sourceContext": "/document",  
          "inputs": [  
            {  
              "name": "$type",  
              "source": "= 'file'"  
            },  
            {  
              "name": "url",  
              "source": "= $(/document/FileURL)"  
            }  
          ]  
        }  
      ],  
      "outputs": [  
        {  
          "name": "content",  
          "targetName": "content"  
        },  
        {  
          "name": "normalized_images",  
          "targetName": "extracted_normalized_images"  
        }  
      ],  
      "configuration": {  
        "imageAction": "generateNormalizedImages",  
        "normalizedImageMaxWidth": 2000,  
        "normalizedImageMaxHeight": 2000  
      }  
    }

Additional information: A blob indexer is used for ingesting content from Azure Blob Storage into a Cognitive Search index. Blob indexers are frequently used in AI enrichment, where an attached skillset adds image and natural language processing to create searchable content. But you can also use blob indexers without AI enrichment, to ingest content from text-based documents such as PDFs, Microsoft Office documents, and file formats.

Azure Cognitive Search can index JSON documents and arrays in Azure Blob Storage using an indexer that knows how to read semi-structured data. Semi-structured data contains tags or markings which separate content within the data. It splits the difference between unstructured data, which must be fully indexed, and formally structured data that adheres to a data model, such as a relational database schema, that can be indexed on a per-field basis.

This article shows you how to configure a blob indexer for either scenario. If you're unfamiliar with indexer concepts, start with Indexers in Azure Cognitive Search and Create a search indexer before diving into blob indexing.

https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage

Index design through the portal enforces requirements and schema rules for specific data types, such as disallowing full text search capabilities on numeric fields. Once you have a workable index, you can copy the JSON from the portal and add it to your solution.

Note: The Import data wizard connects to an external data source using the internal logic provided by Azure Cognitive Search indexers, which are equipped to sample the source, read metadata, crack documents to read content and structure, and serialize contents as JSON for subsequent import to Azure Cognitive Search. You can use " Azure Blob Storage or Azure Cosmos DB"(Cosmos DB allows you to choose your consistency level for balancing performance and optimizing costs.)

If they are all JSON based, then consider Cosmos DB instead of Azure Search and Blob storage. You can achieve your result using only one service. Store the JSON in your blob in a document and then you can query over it using the APIs. Here is a simple example.

I’m pleased to announce that Blob index tags is now generally available! Blob index tags solves a problem that today requires customers to add a separate DB in addition to blob storage to provide query capabilities to find specific objects. Blob index tags reduces the complexity of having to add a DB and think through concurrency by allowing user meta data to be indexed and used directly on the blobs themselves. Blob index tags is one of the app-dev primitives that will make it easier for 1st and 3rd party customers to develop solutions on top of blob storage. Capabilities such as this speed up adoption of blob storage for application development For a good overview, read this.

If you still have any question, please let us know I would like to work close on this issue.

Hope this helps!
Kindly let us know if the above helps or you need further assistance on this issue.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Create indexer with data source as a field in json document inside index

1 answer