Share via

Output SQL server data into Parquet files

Srikanth Sudharma 11 Reputation points
2022-01-25T05:15:48.447+00:00

Hello,

My requirements need me to export data from sql server ( on prem) db to an external source. Currently, the need is to create parquet files of this data for consumption. My question is,
Is this possible using polybase/External file format/OpenRowset ? If yes, are there articles indicating how to? I was reading about polybase and external files but most of these examples indicates reading data from external files/ data sources and not writing to them.

Thanks
Sri

SQL Server | Other
0 comments No comments

4 answers

Sort by: Most helpful
  1. YufeiShao-msft 7,156 Reputation points
    2022-01-25T06:33:51.913+00:00

    Hi @Srikanth Sudharma ,

    Data factory is a good way to do this if you have
    One Way to Create a Parquet File from SQL Server Data

    or try to use Spark

    -------------

    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.

  2. rferraton 0 Reputation points
    2026-03-22T01:05:38.1533333+00:00

    The simplest way to do exports from sql server to parquet local is with fastbcp

    Is it really fast
    This will export you table in chunks (faster for export and for reimport).
    *Prefer to use the first column of a clustered index as distributekeycolumn (if the column have enought value of course)
    *If your table id is an identity switch from Ntile method to RangeId

    .\FastBCP.exe `
      --connectiontype "mssql" `
      --server "myserver,1433" `
      --trusted `
      --database "mydatabase" `
      --query "SELECT * FROM dbo.mytable where mycond = true" `
      --fileoutput "myfile.parquet" `
      --directory "d:\out\{sourceschema}\{sourcetable}" `
      --parallelmethod "Ntile" `
      --distributekeycolumn "myid" `
      --merge false `
      --license "C:\MyFreeTrialLicense.lic"
    

    if you prefer to generate one file per month swtich to Timepartition method:

    .\FastBCP.exe `
      --connectiontype "mssql" `
      --server "myserver,1433" `
      --trusted `
      --database "mydatabase" `
      --query "SELECT * FROM dbo.mytable where mycond = true" `
      --fileoutput "myfile.parquet" `
      --directory "d:\out\{sourceschema}\{sourcetable}" `
      --parallelmethod "Timepartition" `
      --distributekeycolumn "(mydatecolumn,year,month)" `
      --merge false `
      --license "C:\MyFreeTrialLicense.lic"
    

    For small tables:

    .\FastBCP.exe `
      --connectiontype "mssql" `
      --server "myserver,1433" `
      --trusted `
      --database "mydatabase" `
      --sourceschema "dbo" `
      --sourcetable "mysmalltable" `
      --fileoutput "{sourcetable}.parquet" `
      --directory "d:\out\{sourceschema}\{sourcetable}" `
      --parallelmethod "None" `
      --merge false `
      --license "C:\MyFreeTrialLicense.lic"
    
    

    Nota : work also on Linux

    0 comments No comments

  3. AntoineGiraud 0 Reputation points
    2024-09-06T17:44:30.6133333+00:00

    An other pythonic option 👨‍💻😎

    With Pandas, PyMssql (ODBC) & DuckDB 🦆

    import yaml
    from time import time
    from sqlalchemy import create_engine
    import pandas as pd
    import duckdb
    
    config = yaml.safe_load(open("conf.yml"))
    db = config['db']
    
    engine = create_engine(f"mssql+pymssql://{db['user']}:{db['password']}@{db['host']}:{db['port']}/{db['dbname']}")
    con = engine.connect().execution_options(stream_results=True)
    
    def get_tables_in_schema(schema):
        """ fetch table list in given schema of our mssql db """
        sql = f"""select tab.name as [table]
    		      from sys.tables tab
    		      where schema_name(tab.schema_id)='{schema}'"""
        df = pd.read_sql(sql, engine)
        return df['table'].values.tolist()
    
    def export_table_to_parquet(schema, table):
        """ export data from given table to parquet """
        time_step = time()
        print("Let's export", table)
        sql = f"SELECT * FROM {schema}.{table}"
        lines = 0
        for i, df in enumerate(pd.read_sql(sql, con, chunksize=1000000)):
    		# by chunk of 1M rows if needed
            t_step = time()
            file_name = table + ('' if i==0 else f'_{i}m')
            duckdb.sql(f"copy df to 'output/{file_name}.parquet' (format parquet)")
            lines += df.shape[0]
            print('  ', file_name, df.shape[0], f'lines ({round(time() - t_step, 2)}s)')
        print("  ", lines, f"lines exported {'' if i==0 else f' in {i} files'} ({round(time() - time_step, 2)}s)")
    
    schema = 'myschema'
    for table in get_tables_in_schema(schema): # pour exporter TOUTES les tables
        export_table_to_parquet(schema, table)
    
    # export_table_to_parquet(schema, 'stations')
    # export_table_to_parquet(schema, 'rentals')
    
    0 comments No comments

  4. Fritz 0 Reputation points
    2024-07-02T10:27:43.1366667+00:00

    You can do this with Sling (https://slingdata.io). See below.

    # set connection via env var
    export mssql='sqlserver://...'
    
    # test connection
    sling conns test mssql
    
    # run export for many tables
    sling run --src-conn mssql --src-stream 'my_schema.*' --tgt-object 'file://{stream_schema}/{stream_table}.parquet'
    
    # run export for one table
    sling run --src-conn mssql --src-stream 'my_schema.my_table' --tgt-object 'file://my_folder/my_table.parquet'
    
    # run export for custom SQL
    sling run --src-conn mssql --src-stream 'select col1, col2 from my_schema.my_table where col3 > 0' --tgt-object 'file://my_folder/my_table.parquet'
    
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.