" Image file is truncated" when use Azure Document Inteligence to extract image from PDF as bytecode

Tran Minh 20 Reputation points
2024-09-25T08:48:33.6633333+00:00

I'm working on a project that extract images from pdf documents and save it to azure blob storage.
I refered Azure python sdk for Document Intelligence (document intelligence sdk) and extracted images from pdf and saved it to local folder.

  response = document_intelligence_client.get_analyze_result_figure(
                    model_id=result.model_id, 
                    result_id=operation_id, 
                    figure_id=figure['id'] # or figure.name)

 f_save = os.path.join(output_folder) #figure.id
 with open(f_save+ f"{fig_fill}.jpg", "wb") as writer:
    writer.writelines(response)#

Now I want to get byte code from response and then save them to specified blob storage.

document_intelligence_client.get_analyze_result_figure(
                    model_id=result.model_id, 
                    result_id=operation_id, 
                    figure_id=figure['id'] # or figure.name)

when it is ok, the response is as follow

<generator object HttpResponseImpl.iter_bytes at 0x7f65c466bf20>

I want to get bytecode from response, so I did this

 rp_elem = []
 for ith in response:
    rp_elem.append(ith)
 print(rp_elem[0])

but result is

An error occurred: image file is truncated

What should I do now ?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,685 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Sina Salam 11,206 Reputation points
    2024-09-25T14:02:54.33+00:00

    Hello Tran Minh,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having "Image file is truncated" error when using Azure Document Intelligence.

    The "image file is truncated" error always occur when the image data is incomplete or corrupted however, there should be more information in the response that will indicate the reason the image data is incomplete.

    To resolve the error, after you affirm that your code logic read the entire response from the generator object, then you will need to validate by making sure image data you are receiving is valid and complete. For example:

       from PIL import Image
       from io import BytesIO
       image = Image.open(BytesIO(image_data))
       image.verify()  # This will raise an exception if the image is not valid
    

    Secondly, implement retry mechanism in your code should there be any network issues or temporary service disruptions that can cause incomplete data retrieval.

    Also, if you are using azure-storage-blob library to upload the image data, make sure it's upload correctly, the below is a sample code implementation.

       from azure.storage.blob import BlobServiceClient
       blob_service_client = BlobServiceClient.from_connection_string("your_connection_string")
       blob_client = blob_service_client.get_blob_client(container="your_container", blob=f"{fig_fill}.jpg")
       blob_client.upload_blob(image_data, overwrite=True)
    

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


  2. Tran Minh 20 Reputation points
    2024-09-26T07:11:19.3366667+00:00

    Hi, I solved this problem. I will break it down as follow.

    • get_analyze_result_figure method from document_intelligence_client return a response as bytecode for each figure it extracted. Figures bytecode will be split into different parts. The respone is something like this
        b'\x89PNG\r\n\x1a\n
        b'\xf4\x18R\x87\xf3B
        b'\x0f\xa4\x02\x
      
    • Then I use a python bytes object to get byte paths from response
        from PIL import Image
        from io import BytesIO
        
        response = document_intelligence_client.get_analyze_result_figure(
                            model_id=result.model_id, 
                            result_id=operation_id, 
                            figure_id=figure['id'] # or figure.name
                        )
        
        response_bytes = bytes()
        
        for ith in response:
          response_bytes += ith
        
        image = Image.open(io.BytesIO(response_bytes)) 
        
        # Now response became an complete image wi bytecode form, we can parse it to anywhere
        
        from azure.storage.blob import BlobServiceClient blob_service_client = BlobServiceClient.from_connection_string("your_connection_string") blob_client = blob_service_client.get_blob_client(container="your_container", blob=f"{fig_name}.jpg") blob_client.upload_blob(
      
      I tested, this work.
    • About SDK, examples are excellent, but I think it will be better if you write more about what does funtions return.

    Thank for all your support and guidance.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.