How to convert XML files to Parquet without losing any data. In my scenario the xml file is in complex nested format and I am unable to capture all fields?

Moka Rupesh 0 Reputation points

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,727 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 16,146 Reputation points

    I found this project :

    Convert XML files into Apache Parquet format using just an XSD schema and XML file. This process uses the XSD to transform all content from the XML into a corresponding Parquet file, maintaining nested data structures that replicate the XML paths.

    Convert a small XML file to a Parquet file
    python -x PurchaseOrder.xsd PurchaseOrder.xml
    INFO - 2021-01-21 12:32:38 - Parsing XML Files..
    INFO - 2021-01-21 12:32:38 - Processing 1 files
    DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
    DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
    DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
    DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml

    More links :

    0 comments No comments