" Image file is truncated" when use Azure Document Inteligence to extract image from PDF as bytecode

Question

" Image file is truncated" when use Azure Document Inteligence to extract image from PDF as bytecode

Tran Minh 25

I'm working on a project that extract images from pdf documents and save it to azure blob storage.
I refered Azure python sdk for Document Intelligence (document intelligence sdk) and extracted images from pdf and saved it to local folder.

  response = document_intelligence_client.get_analyze_result_figure(
                    model_id=result.model_id, 
                    result_id=operation_id, 
                    figure_id=figure['id'] # or figure.name)

 f_save = os.path.join(output_folder) #figure.id
 with open(f_save+ f"{fig_fill}.jpg", "wb") as writer:
    writer.writelines(response)#

Now I want to get byte code from response and then save them to specified blob storage.

document_intelligence_client.get_analyze_result_figure(
                    model_id=result.model_id, 
                    result_id=operation_id, 
                    figure_id=figure['id'] # or figure.name)

when it is ok, the response is as follow

<generator object HttpResponseImpl.iter_bytes at 0x7f65c466bf20>

I want to get bytecode from response, so I did this

 rp_elem = []
 for ith in response:
    rp_elem.append(ith)
 print(rp_elem[0])

but result is

An error occurred: image file is truncated

What should I do now ?

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-09-25T15:43:25.0666667+00:00

@Tran Minh I agree with below response, I think in the response itself there is no image data. Does the result or operation of actual analyze result showing the figure id in the figures list?
Tran Minh 25 Reputation points

2024-09-26T00:55:40.7966667+00:00
Hi @romungi-MSFT
about response itself, I think that it have image data because I can save response as image file

for figure in result.figures: if figure['id']: response = document_intelligence_client.get_analyze_result_figure( model_id=result.model_id, result_id=operation_id,figure_id=figure['id'] # or figure.name ) f_save = os.path.join(output_folder) with open(f_save+ f"{fig_fill}.jpg", "wb") as writer: writer.writelines(response)

writelines(response) with "wb" mean that it has image data bytecode on respone from get_analyze_result_figure, right ?

So the problem here is why I can save image as file but can not get byecode directly to BytesIO.
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-09-26T07:07:44.27+00:00

@Tran Minh ok. I am not sure at this point on how to decode the data from response. I was thinking the image was not received in the first place from the response. Since you are able to save the file, maybe you can try PIL's Image.tobytes to get the byte data.

2 answers

Your answer

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-09-25T15:43:25.0666667+00:00

@Tran Minh I agree with below response, I think in the response itself there is no image data. Does the result or operation of actual analyze result showing the figure id in the figures list?
Tran Minh 25 Reputation points

2024-09-26T00:55:40.7966667+00:00

Hi @romungi-MSFT
about response itself, I think that it have image data because I can save response as image file

for figure in result.figures: if figure['id']: response = document_intelligence_client.get_analyze_result_figure( model_id=result.model_id, result_id=operation_id,figure_id=figure['id'] # or figure.name ) f_save = os.path.join(output_folder) with open(f_save+ f"{fig_fill}.jpg", "wb") as writer: writer.writelines(response)

writelines(response) with "wb" mean that it has image data bytecode on respone from get_analyze_result_figure, right ?

So the problem here is why I can save image as file but can not get byecode directly to BytesIO.
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-09-26T07:07:44.27+00:00

@Tran Minh ok. I am not sure at this point on how to decode the data from response. I was thinking the image was not received in the first place from the response. Since you are able to save the file, maybe you can try PIL's Image.tobytes to get the byte data.

Answer 1

Hello Tran Minh,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having "Image file is truncated" error when using Azure Document Intelligence.

The "image file is truncated" error always occur when the image data is incomplete or corrupted however, there should be more information in the response that will indicate the reason the image data is incomplete.

To resolve the error, after you affirm that your code logic read the entire response from the generator object, then you will need to validate by making sure image data you are receiving is valid and complete. For example:

   from PIL import Image
   from io import BytesIO
   image = Image.open(BytesIO(image_data))
   image.verify()  # This will raise an exception if the image is not valid

Secondly, implement retry mechanism in your code should there be any network issues or temporary service disruptions that can cause incomplete data retrieval.

Also, if you are using azure-storage-blob library to upload the image data, make sure it's upload correctly, the below is a sample code implementation.

   from azure.storage.blob import BlobServiceClient
   blob_service_client = BlobServiceClient.from_connection_string("your_connection_string")
   blob_client = blob_service_client.get_blob_client(container="your_container", blob=f"{fig_fill}.jpg")
   blob_client.upload_blob(image_data, overwrite=True)

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Tran Minh 25 Reputation points

2024-09-26T01:42:16.5466667+00:00

Hi @Sina Salam

I did writelines(response) with "wb" and the image from pdf completely saved, but when I tried to get bytecode from response, it said that "Image file is truncated", and the exeption of image.verify() is

incomplete checksum in b'IDAT'

I don't know why I can save image as file but can not get byecode directly to BytesIO.

Answer 2

Hi, I solved this problem. I will break it down as follow.

get_analyze_result_figure method from document_intelligence_client return a response as bytecode for each figure it extracted. Figures bytecode will be split into different parts. The respone is something like this
```
  b'\x89PNG\r\n\x1a\n
  b'\xf4\x18R\x87\xf3B
  b'\x0f\xa4\x02\x
```

Then I use a python bytes object to get byte paths from response

  from PIL import Image
  from io import BytesIO
  
  response = document_intelligence_client.get_analyze_result_figure(
                      model_id=result.model_id, 
                      result_id=operation_id, 
                      figure_id=figure['id'] # or figure.name
                  )
  
  response_bytes = bytes()
  
  for ith in response:
    response_bytes += ith
  
  image = Image.open(io.BytesIO(response_bytes)) 
  
  # Now response became an complete image wi bytecode form, we can parse it to anywhere
  
  from azure.storage.blob import BlobServiceClient blob_service_client = BlobServiceClient.from_connection_string("your_connection_string") blob_client = blob_service_client.get_blob_client(container="your_container", blob=f"{fig_name}.jpg") blob_client.upload_blob(

I tested, this work.

About SDK, examples are excellent, but I think it will be better if you write more about what does funtions return.

Thank for all your support and guidance.

romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2024-09-26T07:43:52.4666667+00:00

Thanks @Tran Minh for posting the solution.

Share via

" Image file is truncated" when use Azure Document Inteligence to extract image from PDF as bytecode

2 answers

Your answer