Hi,
I am working on image extraction from PDFs and using Azure Computer Vision OCR for text extraction in the image. I am using the azure-cognitiveservices-vision-computervision
library in Python 3.11, and my Vision resource is a Free tier.
The way my program works is as follows:
- I use the
pymupdf
library i.e. fitz
) to extract images from the pdf:
image_list = page.get_image_info(xrefs=True)
for img in image_list:
xref = img["xref"] # get the XREF of the image
if not xref:
continue #if the block is not an image, then skip to the next block
image = fitz.Pixmap(doc,xref)
#Correct the color space
if not image.colorspace.name in (fitz.csGRAY.name, fitz.csRGB.name):
image = fitz.Pixmap(fitz.csRGB, image)
#Save the extracted as a PIL Image
image = PIL.Image.open(io.BytesIO(image.tobytes()))
- Perform preprocessing on the image and return the bytes as an BytesIO
buffer = io.BytesIO()
x, y = image.size
# Clamp the value of the image between 50 and 16000 pixels
new_size = max(50,(min(x,16000))), max(50,min(y,16000))
resized_image = image.resize(new_size)
resized_image.save(buffer, "PNG")
buffer.seek(0)
- Perform OCR using Azure Computer Vision Python API.
result = self.computervision_client.recognize_printed_text_in_stream(buffer)
This works well for most images in PDFs, however I decided to test it on new documents.
One of my test documents contains the following extracted image:
Azure Computer Vision API threw this exception: azure.cognitiveservices.vision.computervision.models._models_py3.ComputerVisionErrorResponseException: (InvalidRequest) Input data is not a valid image.
The image is sent as a PNG file to Azure CV, it has a dimensions of 59 x 93, and it is 58.9KB. I suspected the image may be corrupted or empty, but when saving the contents of the buffer to disk, the image is viewable.
Any help would be appreciated.