How to parse XML files with optional fields in ADF?

Ridham 21 Reputation points
2022-09-26T07:36:46.353+00:00

I have a XML file where we need flattening. The problem is some fields in the XML are optional(as per XSD). The field might be present or might not be. So fixed mapping based on XML is not working. We need mapping based on XSD so every scenario is covered.

For example:-

In this XML file, we have <Audit Record> inside ItemData, but there maybe a possibility that <Audit Record> might be present directly under <ItemGroupData> and <ItemData> is not there at all. Or some other field is present.

<ItemGroupData ItemGroupOID="ABC" mdsol:RecordId="123">
<ItemData ItemOID="DEF" TransactionType="Upsert">
<AuditRecord>
<UserRef UserOID="1233"/>
<LocationRef LocationOID="12345" mdsol:StudyEnvSiteNumber="12344"/>
<DateTimeStamp>2015-09-28T14:21:21</DateTimeStamp>
<ReasonForChange/>
<SourceID>00000</SourceID>
</AuditRecord>
</ItemData>
</ItemGroupData>

Is there any way we can do dynamic mapping based on XSDs?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. MartinJaffer-MSFT 26,161 Reputation points
    2022-09-27T20:35:45.007+00:00

    Hello @Ridham and welcome to Microsoft Q&A.

    As I understand there are two asks here:

    • How to flatten and get the optional fields which might not always appear in the result of "import schema"
    • Use the XSD to inform the schema mapping

    I haven't worked with XSD before. From what I am learning, it basically IS the schema definition. In both Copy activity and Data flow, there are options to validate the XML data against your choice of XSD or DTD. Given this validation feature, it is made clear we do somehow read the XSD. It would make sense to have a feature where we import schema from the validation ... but since you are asking this question, I'm guessing it is not being used. Unless "Allow Schema Drift" causes "import schema" to ignore the XSD in favor of looking at the data instead.

    Anyway, to the useful information. Use of rule-based mapping in flatten transformation allows to catch optional fields and fields not found in the schema.

    245247-image.png

    Here is my working example:

    Input Data:

    <?xml version='1.0'?>  
    <root>  
        <items>  
            <id>1</id>  
            <name>first</name>  
        </items>  
        <items>  
            <id>2</id>  
            <name>second</name>  
        </items>  
        <items>  
            <id>3</id>  
            <name>optional</name>  
            <additional>true</additional>  
        </items>  
        <items>  
            <id>4</id>  
            <name>fourth</name>  
        </items>  
    </root>  
    

    Data Flow script. Note that the second line where the source schema is defined, field "additional" is missing.

    source(output(  
    		root as (items as (id as short, name as string)[])  
    	),  
    	allowSchemaDrift: true,  
    	validateSchema: false,  
    	ignoreNoFilesFound: false,  
    	format: 'xml',  
    	fileSystem: 'data',  
    	folderPath: 'input',  
    	fileName: 'additional.xml',  
    	validationMode: 'none',  
    	namespaces: false) ~> source1  
    source1 foldDown(unroll(root.items),  
    	mapColumn(  
    		every(match(true()))  
    	),  
    	skipDuplicateMapInputs: false,  
    	skipDuplicateMapOutputs: false) ~> flatten1  
    flatten1 sink(allowSchemaDrift: true,  
    	validateSchema: false,  
    	format: 'delimited',  
    	fileSystem: 'data',  
    	folderPath: 'out',  
    	columnDelimiter: ',',  
    	escapeChar: '\\',  
    	quoteChar: '\"',  
    	columnNamesAsHeader: true,  
    	umask: 0022,  
    	preCommands: [],  
    	postCommands: [],  
    	skipDuplicateMapInputs: true,  
    	skipDuplicateMapOutputs: true) ~> sink1  
    
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.