How to Redirect Users to Specific Page or Section in PDF from Azure Cognitive Search ?

Anonymous
2024-08-06T06:58:12.82+00:00

Hello,

I have developed a chatbot application where company PDFs are stored in SharePoint Online, converted to Azure Blob Storage, and indexed using Azure Cognitive Search. The application processes user queries, generates responses based on the content in these PDFs, and displays them in a React UI with citation links to the PDFs.

While the system successfully generates responses and displays citation links, I am facing an issue with redirecting users to the exact page or section within the PDF from which the content was extracted. Azure Cognitive Search does not provide a default metadata_storage_page field. The metadata_storage_* fields include path, name, size, last modified date, and content type, but they do not include page numbers.

Is there any way to retrieve the page number from Azure Cognitive Search along with the content and other metadata fields? How can I implement a feature that allows users to click on a citation link and be redirected to the exact page or section within the PDF that contains the extracted content?

Any guidance or solutions would be greatly appreciated.

Thank you!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,343 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 34,661 Reputation points MVP Volunteer Moderator
    2024-08-07T12:40:45.0933333+00:00

    Hi Anu Priya Narahari,

    Thanks for reaching out to Microsoft Q&A.

    Azure cognitive search does not provide page numbers by default, you can implement a custom solution(preprocess the pdfs) that can ensure that when a user clicks on a citation link, they are redirected to the exact page or section within the PDF that contains the extracted content.

    Preprocess the PDFs:

    1. Use a tool or library like PyMuPDF/PDFMiner to extract text from PDFs along with page numbers.
    2. Store the extracted text in a structured format, such as JSON, where each text block is associated with its page number.

    Create a Custom Skill:

    1. Create a custom skill in Azure Cognitive Search to process the extracted JSON and index the content along with the page numbers.
    2. Define an output field in your index schema for page numbers, e.g., metadata_storage_page.

    Update Index Schema:

    1. Add a new field in your Azure Cognitive Search index for the page number, such as metadata_storage_page.

    Modify the Indexing Pipeline:

    1. Modify your data ingestion and indexing pipeline to include the page numbers along with the text content.
    2. Ensure that when a document is processed, the page number is included in the fields indexed by Azure Cognitive Search.

    Update Search Results with Page Numbers:

    1. When retrieving search results, include the page number in the response.
    2. Ensure your application captures this information and uses it to create links pointing to the specific pages in the PDFs.

    Generate Links to Specific PDF Pages:

    1. Use the page number information to generate urls pointing to specific pages in the PDF.
    2. PDF viewers such as Adobe Reader and web-based PDF viewers support URL fragments to open a PDF at a specific page. ex: http://xxxx.com/doc.pdf#page=125.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.