How do I select only the first element of a xml?

Jose Martinez 0 Reputation points
2023-12-08T08:45:56.8933333+00:00

Dear all,

I am trying to obtain historic data from the ECB exchange rate site. They publish the last 90 days of data in the form of xml, like this:

90 days ecb fx

However, I just need the latest day from there. I am downloading the whole xml data and with a dataflow I intend to get only the latest day and transform it to have a JSON file at the end. I have the following dataflow:

just first element

I have a flat activity to get the dates and their associated currencies and rates:

User's image

But I just need the entries for the latest day:

User's image

Can I add a previous step to the Flat activity? I also thoght to filter the rows by taking those where date== max(date), but I cant use max function in a filter activity. When trying to use max in an aggregation it does not work either. I would really like to know how to get just a particular element of the xml before the flattening.

Thanks!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,134 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 27,681 Reputation points
    2023-12-08T23:35:02.8666667+00:00

    I would go for Python instead. ADF doesn't have direct XML processing capabilities, but you can integrate it with Azure Functions or Data Lake Analytics to process XML files.

    Star using an XML parser like ElementTree in Python to parse the XML file. You'll want to navigate through the XML tree and find the first 'Cube' element with a 'time' attribute (which represents the latest day).

    Once you have the first 'Cube' element, you can extract its child elements and their attributes (currencies and rates).

    import xml.etree.ElementTree as ET
    import json
    
    
    xml_data = '''<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" ... </gesmes:Envelope>'''
    
    
    root = ET.fromstring(xml_data)
    
    
    for cube in root.findall('.//{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube[@time]'):
        latest_cube = cube
        break
    
    
    currencies = []
    for currency in latest_cube:
        currency_data = {
            'currency': currency.get('currency'),
            'rate': currency.get('rate')
        }
        currencies.append(currency_data)
    
    
    json_data = json.dumps(currencies, indent=4)
    
    print(json_data)
    
    
    1 person found this answer helpful.

  2. KranthiPakala-MSFT 46,602 Reputation points Microsoft Employee
    2023-12-14T22:37:59.71+00:00

    @Jose Martinez Welcome to Microsoft Q&A forum and thanks for reaching out here.

    You can also achieve this in ADF Mapping data flow. You were very close to the implementation.

    After flattening the XML data, you will have to use the Aggregate transformation to find the maximum date value in the date column as shown below.

    User's image

    Under Aggregates section configure as below to find the maxDate value. User's image

    Then use a Join transformation to join the FetchhXML output with the output of the Aggregate transformation on the date column.

    Next, in the Join transformation settings, select the "Inner" join type and select the date column as the join key.

    User's image

    This join will output only the records that have the maxDate values.

    Then have a sink transformation and configure as needed and in this transformation, you can select only the column that you would like to copy to your destination.

    Hope this helps. Let me know if you have any questions.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.