How to read the output of Document Intelligence Prebuilt Layout Model?

Gavin Gao 20 Reputation points
2023-10-03T15:48:30.5633333+00:00

Hi there,

I'm currently working on an Azure Document Intelligence project, and I have a few questions regarding the pre-built Layout model. Specifically, I'm wondering:

  1. Is there any documentation that provides a full specification of the JSON response for this model? I've discovered that some fields such as kind and rowSpan are optional for cell, and role is optional for paragraph. However, I would like to have a complete understanding of all fields and their options.
  2. What is the meaning of the span field in the model output? I understand that it's a fundamental element, but I'm unsure of its significance. In particular, what do the inner fields length and offset represent?

Thanks, Gavin

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,507 questions
0 comments No comments
{count} votes

Accepted answer
  1. dupammi 7,955 Reputation points Microsoft Vendor
    2023-10-04T11:04:03.88+00:00

    Hi @Gavin Gao ,

    Thank you for reaching out to the Azure community forum!

    I understand that you are working on Azure Document Intelligence and have a few questions on the significance of each element in the response JSON of the pre-built Layout model. I will be happy to assist you regarding this.

    The span, length, and offset fields are significant in finding word spans in lines or sentences in the output of the Azure Document Intelligence pre-built Layout model.

    The span field represents the position of the word span in the line or sentence. Span contains two inner fields, length and offset, which represent the length of the word span and its starting position, respectively. By using these fields, you can identify the exact location of the word span in the line or sentence.

    Table elements:

    The rowSpan and colSpan fields are used to specify the number of rows or columns that a cell in a table should span. These fields are optional and can be used to create complex table layouts.

    The "kind" field in the table cell specifies the type of content in the cell. It can be used to identify whether the cell contains text, an image, or a table. By using the kind field, you can programmatically process the content of the table cell based on its type.

    Paragraphs & Role field:

    The role field is optional for paragraphs because not all paragraphs have a specific role. However, when a paragraph has a role, it can provide additional context about the content of the paragraph. For example, a paragraph with the role "header" may contain a section heading, while a paragraph with the role "footer" may contain copyright information.

    The following illustration shows the typical components of a sample page.

    Illustration of document layout example.

    Below is a high-level explanation of the structure, content & its significance of the JSON response for the document analysis:

    "apiVersion": Indicates the REST API version used for this response.

    "modelId": Specifies the Model ID used, which is likely the prebuilt invoice model.

    "stringIndexType": Describes the character unit used for string offsets and lengths, typically using text elements, Unicode code points, or UTF-16 code units.

    "content": Contains the extracted content from the document, including text and line breaks.

    "pages": Represents a list of pages analyzed within the document.

    "spans" within "pages": These represent parts of the top-level content covered by a page, indicating where content appears on a specific page.

    "pageNumber": Indicates the indexed page number of the current page.

    "angle": Specifies the orientation of content on the page in degrees.

    "width" and "height": Provide the page dimensions (width and height) in pixels.

    "unit": Indicates that the unit used for width, height, and polygon coordinates is pixels.

    "words": Contains a list of extracted words on the page, along with their positions and confidence scores.

    "spans" within "words": These represent spans (portions) of text within a word, indicating where a word begins and ends in the document's content.

    "selectionMarks": Lists selection marks (e.g., checkboxes) on the page, including their state and positions.

    "spans" within "selectionMarks": These indicate spans of content within a selection mark (e.g., checkbox), showing where the mark is located within the content.

    "lines": Contains a list of lines on the page, which may include both words and selection marks.

    "spans" within "lines": These indicate spans (portions) of content within a line, showing where the line's content begins and ends.

    "tables": Represents a list of extracted tables, including their row and column counts.

    "spans" within "tables": These represent parts of the top-level content covered by a table. Each span may correspond to a portion of the document's content contained within the table.

    "cells": Contains details about cells within the tables, including their kind, position, and content.

    "keyValuePairs": Lists extracted key-value pairs, including the key, value, and extraction confidence.

    "spans" within "keyValuePairs": These represent spans of text within a key or value of a key-value pair, indicating where the key or value content is located.

    "styles": Represents different styles of content, such as handwritten or printed, with associated spans and confidence scores.

    "documents": Contains information about classified documents, including their type, bounding regions, and spans.

    "fields": Provides details about extracted fields within a document, including their type, value, content, and confidence.

    These elements and their associated attributes help structure and provide detailed information about the content, layout, styles, and extracted data within the document being analyzed.

    For more information about the fields in the output of the Azure Document Intelligence pre-built Layout model, you can refer to the official documentation provided by Microsoft:

    How-to: Migrate Document Intelligence (formerly Form Recognizer) applications to v3.1. - Azure AI services | Microsoft Learn

    Document layout analysis - Document Intelligence (formerly Form Recognizer) - Azure AI services | Microsoft Learn

    I hope this information helps!

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. dupammi 7,955 Reputation points Microsoft Vendor
    2023-10-06T08:31:48.8033333+00:00

    Hi @Gavin Gao ,

    Following up to see my above "answer" comment helps by checking the first answer of this thread. Do let us know if you have any queries.

    To reiterate the resolution here, let me jot down the gist of my answer above.

    The span, length, and offset fields are significant in finding word spans in lines or sentences in the output of the Azure Document Intelligence pre-built Layout model.

    The span field represents the position of the word span in the line or sentence. Span contains two inner fields, length and offset, which represent the length of the word span and its starting position, respectively. By using these fields, you can identify the exact location of the word span in the line or sentence.

    For having a deeper insight of these terms, please refer the Azure documentation links mentioned in the first answer of this thread.

    Please 'Accept as answer' and ‘Upvote’ if it helped so that it can help others in the community looking for help on similar topics.