Extracting text and interleaved figures from a scanned PDF

Question

I'm using material from a scanned book in an experiment (with publisher permission, of course), which predates ebooks. So I have high-quality scans of every page. The book is novel in that that (a) there is on average at least one image per page, often several and (b) they are not delimited by boxes nor do they have figure numbers. It's a popular science book, so for example textual labels in the images are hand-written. I'm trying to figure out a good way of extracting both the text from each page and the images, ideally into something civilized like JSON or XML, with the rough sequential ordering on the page preserved. Anyone know of a good method for this? Thanks.

Answer

Hello,

The Computer Vision Read API is Azure's latest OCR technology (learn what's new) that extracts printed text (in several languages), handwritten text (English only), digits, and currency symbols from images and multi-page PDF documents. It's optimized to extract text from text-heavy images and multi-page PDF documents with mixed languages. It supports detecting both printed and handwritten text in the same image or document.

The Read API includes the following features.

Print text extraction in 73 languages
Handwritten text extraction in English
Text lines and words with location and confidence scores
No language identification required
Support for mixed languages, mixed mode (print and handwritten)
Select pages and page ranges from large, multi-page documents
Natural reading order for text lines
Handwriting classification for text lines
Available as Distroless Docker container for on-premise deployment

I think this is a good way for novel since novel is a kind of heavy text document.

https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview-ocr#read-api

Regards,
Yutong

Share via

Extracting text and interleaved figures from a scanned PDF

1 answer

Your answer