Issue with ADF Data Flow, xml source and optional nodes or attributes

Michael Payne 121 Reputation points
2020-09-10T08:35:37.18+00:00

Hi,

I need some assistance with ADF Data Flow using an xml data source. We have a data lake full of xml which I need to process. We do have a schema for the xml, but the xml isn't validated against it. We have some optional attributes and nodes in the xml. I have created a Data Flow and imported the schema from the Data Source Projection Tab. In my mappings I map all the fields manually through to the Sink.

Everything works fine if all the attributes or nodes are present, however if the attributes are not in one of the xml files, then everything falls over. I have seen a few videos of using schema-less Data Flow and use late binding to map everything using rule based mapping or pattern based mapping, but nothing is working with xml and I need a little help with what to do and syntax so that I can pull out all the nodes and attributes from the xml. Some of the nodes are repeated so they become arrays in the Data Flow which need to be flattened also.

So if I have the 2 xml files below how can I do it schema-less using rule based mappings and pattern matching? Note that at the entrant level teamname and result are optional and in the second example they are not present.

<data xmlns="http://www.mycompany.co.uk/feed/master/cg" xmlns:cg="http://www.mycompany.co.uk/feed/master/cg" category="AA" country="XX" date="2020-07-14" id="1234" name="test" type="master" version="1.0.0">
<game code="AB001" date="2020-07-14" name="Test Gane" sportcode="AB" version="1">
<event date="2020-07-14" id="AB00020" inplay="yes" numberofperiods="4" progresscode="F" result="86-89" starttime="18:19" status="O" title="Player1 v Player2">
<periods>
<period actualendtime="2020-07-14T17:26:02Z" actualstarttime="2020-07-14T17:20:14Z" number="1" result="14-19" status="Finished" type="quarter"/>
<period actualendtime="2020-07-14T17:33:39Z" actualstarttime="2020-07-14T17:27:02Z" number="2" result="18-24" status="Finished" type="quarter"/>
<period actualendtime="2020-07-14T17:40:25Z" actualstarttime="2020-07-14T17:34:33Z" number="3" result="26-22" status="Finished" type="quarter"/>
<period actualendtime="2020-07-14T17:47:38Z" actualstarttime="2020-07-14T17:40:27Z" number="4" result="28-24" status="Finished" type="quarter"/>
</periods>
<entrants>
<entrant avatar="TeamA" id="1" isteam="no" number="1" score="86" teamname="Team1" result="Loser">
<player firstname="Fred" form="L-L-W-L-L" gamerhandle="Gamer1" id="9" rank="9" surname="Bloggs"/>
</entrant>
<entrant avatar="TeamB" id="2" isteam="no" number="2" score="89" teamname="Team2" result="winner">
<player firstname="John" form="L-L-L-L-L" gamerhandle="Gamer2" id="14" rank="14" surname="Doe"/>
</entrant>
</entrants>
</event>
</game>
</data>

<data xmlns="http://www.mycompany.co.uk/feed/master/cg" xmlns:cg="http://www.mycompany.co.uk/feed/master/cg" category="AA" country="XX" date="2020-07-15" id="1235" name="test" type="master" version="1.0.0">
<game code="AB002" date="2020-07-14" name="Test Gane" sportcode="AB" version="1">
<event date="2020-07-15" id="AB00021" inplay="yes" numberofperiods="4" progresscode="O" result="86-89" starttime="18:19" status="O" title="Player1 v Player2">
<periods>
<period actualendtime="2020-07-15T17:26:02Z" actualstarttime="2020-07-15T17:20:14Z" number="1" result="14-19" status="Finished" type="quarter"/>
<period actualendtime="2020-07-15T17:33:39Z" actualstarttime="2020-07-15T17:27:02Z" number="2" result="18-24" status="Finished" type="quarter"/>
<period actualendtime="2020-07-15T17:40:25Z" actualstarttime="2020-07-15T17:34:33Z" number="3" result="26-22" status="Finished" type="quarter"/>
<period actualendtime="2020-07-15T17:47:38Z" actualstarttime="2020-07-15T17:40:27Z" number="4" result="28-24" status="Finished" type="quarter"/>
</periods>
<entrants>
<entrant avatar="TeamA" id="1" isteam="no" number="1" score="90">
<player firstname="Fred" gamerhandle="Gamer1" id="9" rank="9" surname="Bloggs"/>
</entrant>
<entrant avatar="TeamB" id="2" isteam="no" number="2" score="88">
<player firstname="John" gamerhandle="Gamer2" id="14" rank="14" surname="Doe"/>
</entrant>
</entrants>,
</event>
</game>
</data>

If I add the source xml without schema, and then preview the data I see things like data{} and below that @category, @country, @apl , etc but if I try and use a Select or Derived Column Mapping to map the attribute it doesn't pick it up and map it, and the same goes for @teamname and @result. I have tried things like toString(byName("@teamname")) in a column mapping with and without the '@' sign but I can't pick anything out of the xml.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2020-09-17T18:30:28.063+00:00

    Hello ,

    We got some help from the internal team and this is what they have to say .

    The suggested practice is to have a sample xml to include all possible attributes and elements in all levels, and then import projection using this sample file in dataflow source. Then you and design your transformations with the schema.
    Regarding to your question on how the select property like '@category'. The data{} you saw in preview means there is a single column named 'data' for your xml files, and this column has a complex type with sub properties, while '@category' is one its sub property, that's the reason why you can't use the byName('@category') to find it. Instead, you can have something shown below:
    It means: for every sub property of the "data" column, it the name includes '@', then add a new column with name by appending '_new' to the property name, while the value is string converted from the value of the sub property.
    25606-xmlissue.png

    Please do let us know if this info helps , do let us know if you have any queries .

    Thanks
    Himanshu


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.