How to label dynamic tables in Azure AI Document Intelligence (for clinical research use case)

Matt Koch 0 Reputation points
2025-11-30T20:11:37.8+00:00

I'm training a Custom Extraction model in Azure Document Intelligence to extract Schedule-of-Assessments (SoA) tables from clinical trial protocols. These tables are highly dynamic:

  • Variable timepoints (0h, 1.5h, Day −1, Visit 3, etc.) - (these are the "points in time" or days on which visits occur to have the procedure(s) done.)
  • Sometimes tables contain multi-row or merged headers
  • Irregular column counts
  • Cells containing “X”, continuous monitoring notes, or blank values

I'm trying to understand the correct labeling strategy to help train a custom model to help with this type of extraction. Using canned/prebuilt models results in sub par extractions that require a ton of manual manipulation. Not really feasible for anything meaningful downstream.

The ideal scenario is to get a reasonably good extraction to show us what is going to occur and when on the trial. We want to ask things like "when will the patient undergo an EKG?"

Desired Output (simplified):


{

  "procedure": "Physical exam",

  "columns": [

    { "header": "Screening", "value": "X" },

    { "header": "0h", "value": "X" },

    { "header": "1.5h", "value": "" }

  ]

}


Has anyone successfully trained a custom model to extract these types of tables or other scientific tables with dynamic headers? Any tips or examples would be appreciated. I'm attaching a sample protocol PDF Document (all data here is publicly available on clinicaltrials.gov!)

You can see an example of the table(s) containing this data starting on page 19 (section 1.3) -- Pfizer-1.pdf

Azure AI Document Intelligence
{count} votes

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.