Thanks for reaching out to Microsoft Q&A.
Azure cognitive search does not provide page numbers by default, you can implement a custom solution(preprocess the pdfs) that can ensure that when a user clicks on a citation link, they are redirected to the exact page or section within the PDF that contains the extracted content.
Preprocess the PDFs:
- Use a tool or library like PyMuPDF/PDFMiner to extract text from PDFs along with page numbers.
- Store the extracted text in a structured format, such as JSON, where each text block is associated with its page number.
Create a Custom Skill:
- Create a custom skill in Azure Cognitive Search to process the extracted JSON and index the content along with the page numbers.
- Define an output field in your index schema for page numbers, e.g.,
metadata_storage_page
.
Update Index Schema:
- Add a new field in your Azure Cognitive Search index for the page number, such as
metadata_storage_page
.
Modify the Indexing Pipeline:
- Modify your data ingestion and indexing pipeline to include the page numbers along with the text content.
- Ensure that when a document is processed, the page number is included in the fields indexed by Azure Cognitive Search.
Update Search Results with Page Numbers:
- When retrieving search results, include the page number in the response.
- Ensure your application captures this information and uses it to create links pointing to the specific pages in the PDFs.
Generate Links to Specific PDF Pages:
- Use the page number information to generate urls pointing to specific pages in the PDF.
- PDF viewers such as Adobe Reader and web-based PDF viewers support URL fragments to open a PDF at a specific page. ex:
http://xxxx.com/doc.pdf#page=125
.
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.