Large files (>16 MB) in Azure AI Search

Malte Martienßen 65 Reputation points
2024-08-02T13:22:32.85+00:00

I want to build a RAG system using Azure AI Search and Blob Storage with files that are larger than 16 MB, which is the file size limit of Azure AI Search.

Some of the files are larger just because they are scanned files in PDF's and would be smaller as text PDF's. Some text PDFs are also larger.

I can upload the files to Blob Storage and could use Skills in AI search to extract the text out of the scanned documents or split them to reduce the file size, but the maximum file size of 16 mb prevents me from doing these operations.

Is there a way within Azure (Blob Storage) to extract the text or split the files before importing them into Ai Search to still be able to work with these files?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,350 questions
{count} votes

Accepted answer
  1. ajkuma 28,036 Reputation points Microsoft Employee Moderator
    2024-08-13T14:36:17.0566667+00:00

    edit: (to benefit the community, updating the answer from the comments)

    Malte Martienßen, analyzed the documents with document Intelligence and uploaded the resulting .md files to AI Search. This involved splitting larger files into smaller parts, analyzing each, and then recombining them.

    -

    Malte Martienßen, Firstly, apologies for the delayed response.

    Based on my understanding of your scenario description,
    are you referring to the doc section Document size limits per API call?

    The maximum document size when calling an Index API is approximately 16 megabytes. Document size is actually a limit on the size of the Index API request body. Since you can pass a batch of multiple documents to the Index API at once, the size limit realistically depends on how many documents are in the batch. For a batch with a single document, the maximum document size is 16 MB of JSON. When estimating document size, remember to consider only those fields that add value to your search scenarios, and exclude any source fields that have no purpose in the queries you intend to run.

    In Azure AI Search, the body of a request is subject to an upper limit of 16 MB, imposing a practical limit on the contents of individual fields or collections that aren't otherwise constrained by theoretical limits (see Supported data types for more information about field composition and restrictions).

    If you are using a lower tier of Azure AI Search, you may consider upgrading to a higher SKU that supports larger file sizes. This would help in handling larger files more efficiently
    Indexer limits :

    User's image

    Kindly let us know more details about your scenario/requirement, I'll follow-up with you further.


    If the answer helped (pointed you in the right direction) > please click Accept Answer Or please share the requested/more info to help you better.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.