How to convert XML files to Parquet without losing any data. In my scenario the xml file is in complex nested format and I am unable to capture all fields?

Moka Rupesh 0 Reputation points
2024-04-22T08:47:27.5233333+00:00

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 20,176 Reputation points
    2024-04-22T14:36:43.83+00:00

    I found this project : https://github.com/blackrock/xml_to_parquet

    Convert XML files into Apache Parquet format using just an XSD schema and XML file. This process uses the XSD to transform all content from the XML into a corresponding Parquet file, maintaining nested data structures that replicate the XML paths.

    Convert a small XML file to a Parquet file
    python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml
    
    INFO - 2021-01-21 12:32:38 - Parsing XML Files..
    INFO - 2021-01-21 12:32:38 - Processing 1 files
    DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
    DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
    DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
    DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml
    

    More links :

    https://stackoverflow.com/questions/36289548/is-there-a-way-to-create-parquet-file-from-xml-json-input-file-without-avsc-fil

    https://medium.com/@sonradata/convert-xml-with-spark-to-parquet-1c5ba561b193

    0 comments No comments