Extract styles and images using DocumentAnalysisClient

Dmytro Kurdomanenko 0 Reputation points
2024-10-04T13:57:12.16+00:00

Hi there.
I need to create a PDF parser that can extract text with styles(font, font style, font size and hyperlinks) and images for RAG.
I created a basic parser that extracts paragraphs and tables in a structured way. But I also need images and styles.

I attach my code:

 document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, 
        credential=credential
    )

    with open(filepath, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            "prebuilt-document",
            document=f,
            features=[
                AnalysisFeature.OCR_HIGH_RESOLUTION,
                AnalysisFeature.STYLE_FONT,
                AnalysisFeature.BARCODES,
                AnalysisFeature.FORMULAS
            ]
        )

    result: AnalyzeResult = poller.result()

Are there any ways to modify it but saving "prebuilt-document" model and separate table extraction with row/col span?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,713 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 52,861 Reputation points
    2024-10-06T22:28:35.0233333+00:00

    Hello Dmytro,

    Thanks for reaching out to us, for your question how to extract text with styles, you may want to consider enable the add-on capability - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-add-on-capabilities?view=doc-intel-4.0.0&tabs=rest-api#font-property-extraction

    Font property extraction

    The ocr.font capability extracts all font properties of text extracted in the styles collection as a top-level object under content. Each style object specifies a single font property, the text span it applies to, and its corresponding confidence score. The existing style property is extended with more font properties such as similarFontFamily for the font of the text, fontStyle for styles such as italic and normal, fontWeight for bold or normal, color for color of the text, and backgroundColor for color of the text bounding box.

    I hope this helps.

    Regards,

    Yutong

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.