Importing parquet files with SSIS

Caristina, Jose 1 Reputation point
2021-09-03T09:20:08.807+00:00

Hi all, I need to import a parquet file with SSIS. I read several forums and it seems it is not possible. I wouldn`t need Azure-related answers since I know it is possible with ADF.
Does anyone know how to do this with SSIS? I am even interested in programmatic ways (C# or .NET) through SSIS.
Any ideas will be much appreciated. Thank you all!

SQL Server Integration Services
SQL Server Integration Services
A Microsoft platform for building enterprise-level data integration and data transformations solutions.
2,593 questions
0 comments No comments
{count} votes

5 answers

Sort by: Most helpful
  1. Igor Gelin 21 Reputation points
    2021-09-03T20:47:38.777+00:00

    You can use SSIS Script Task to process a parquet file.
    Below is an example of C# code to convert a parquet file.

    https://stackoverflow.com/questions/62094616/how-to-convert-parquet-file-to-csv-using-net-core

    0 comments No comments

  2. ZoeHui-MSFT 37,746 Reputation points
    2021-09-06T06:00:13.107+00:00

    Hi anonymous user,

    I did not find a good way to import parquet files in SSIS without ADF.

    You may refer the link IgorGelin-0063 provided to see if it is useful.

    Regards,

    Zoe


    If the answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.
    Hot issues October

    0 comments No comments

  3. Caristina, Jose 1 Reputation point
    2021-09-06T08:39:43.093+00:00

    Thank you very much @Igor Gelin and @ZoeHui-MSFT for you kind replies. I will try to follow @Igor Gelin 's guidance and get back to this conversation with an answer. If anyone else has some insights on this matter, please continue providing answers. Thank you both again!

    0 comments No comments

  4. Caristina, Jose 1 Reputation point
    2021-09-07T10:27:54.283+00:00

    Hi @Igor Gelin , I went through the advised documentation and it is about converting parquet files into csv files using Cinchoo ETL library. I read Cinchoo ETL's documentation and it doesn`t seem to work with SQL, it converts json to csv or Parquet to csv. I would need a way of loading Parquet files in SQL Server tables through SSIS. I apologize if I am misunderstanding how to use Cinchoo ETL framework.
    Thanks again!

    0 comments No comments

  5. Ian Posner 1 Reputation point
    2021-09-07T16:22:25.333+00:00

    The reason ADF supports Parquet is that the engine is based upon Spark, which uses Parquet as its intermediate storage format. It does so because Parquet supports partitioning and is designed for use on the HDFS file system which will distribute 256MB blocks of data to different processing nodes for parallel processing. Since these 256MB blocks represent compressed data, the underlying raw size of this data is likely to be 1-2.5GB per block.

    Therefore you should ask yourself whether the raw data you hold in Parquet files is large enough to justify the Parquet format.

    If the parquet files are not several multiples of 256MB in size, then it is likely that the file format is inappropriate for the volume of data. In this case, consider converting the data to a supported format before using SSIS. As a rule, SSIS can usually process 50,000-100,000 rows per second for a single non-blocking dataflow with a startup time of 2-3 seconds. So you should be able to estimate how long an SSIS package should take to process the number of rows you have per file.

    Another option you have is to either write a custom SSIS source task or to purchase a 3rd party parquet file source.

    You should compare SSIS with ADF, which may take between 30-60 seconds to start up and is really suited to files of 1GB+ in size, processing large parquet files in parallel.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.