Text content missing from the returned PageResult in Java sdk

Haiquan Li 1 Reputation point
2021-03-17T13:43:31.773+00:00

Hi, when testing computer vision java sdk with two similar pdf files (one Englist, one French). Some string visible in the pdf files are not reported by PageResult when testing with one version of pdf, although they read properly from another version of pdf file. It is assumed all visible strings should be reported in PageResult object. For example, at the end of the each page, there is a file version of 3885A (11/20) in both version of pdf files. But the computer vision Java sdk only returns this string when testing with v1.pdf, but not with v2.pdf. Could someone help on this issue and find out why some strings are missing even if the strings are visible in pdfÉ v1.pdf and v2.pdf are attached for reference. Thanks Jonathan[78822-v1.pdf][1][78823-v2.pdf][2] [1]: /api/attachments/78822-v1.pdf?platform=QnA [2]: /api/attachments/78823-v2.pdf?platform=QnA

Azure Computer Vision
Azure Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
179 questions
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 24,636 Reputation points Microsoft Employee
    2021-03-18T01:04:15.813+00:00

    Thanks for reaching out to us, but I can not open your pdf file. Could you please upload again?

    And 2 products I will recommend if you are trying to extract text from PDF

    1. Form recognizer https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/
    2. Read API https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#read-api

    Thanks.

    Regards,
    Yutong

    No comments