Output SQL server data into Parquet files

Srikanth Sudharma 11 Reputation points
2022-01-25T05:15:48.447+00:00

Hello,

My requirements need me to export data from sql server ( on prem) db to an external source. Currently, the need is to create parquet files of this data for consumption. My question is,
Is this possible using polybase/External file format/OpenRowset ? If yes, are there articles indicating how to? I was reading about polybase and external files but most of these examples indicates reading data from external files/ data sources and not writing to them.

Thanks
Sri

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,947 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. YufeiShao-msft 7,121 Reputation points
    2022-01-25T06:33:51.913+00:00

    Hi @Srikanth Sudharma ,

    Data factory is a good way to do this if you have
    One Way to Create a Parquet File from SQL Server Data

    or try to use Spark

    -------------

    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.

  2. Fritz 0 Reputation points
    2024-07-02T10:27:43.1366667+00:00

    You can do this with Sling (https://slingdata.io). See below.

    # set connection via env var
    export mssql='sqlserver://...'
    
    # test connection
    sling conns test mssql
    
    # run export for many tables
    sling run --src-conn mssql --src-stream 'my_schema.*' --tgt-object 'file://{stream_schema}/{stream_table}.parquet'
    
    # run export for one table
    sling run --src-conn mssql --src-stream 'my_schema.my_table' --tgt-object 'file://my_folder/my_table.parquet'
    
    # run export for custom SQL
    sling run --src-conn mssql --src-stream 'select col1, col2 from my_schema.my_table where col3 > 0' --tgt-object 'file://my_folder/my_table.parquet'
    
    0 comments No comments

  3. AntoineGiraud 0 Reputation points
    2024-09-06T17:44:30.6133333+00:00

    An other pythonic option πŸ‘¨β€πŸ’»πŸ˜Ž

    With Pandas, PyMssql (ODBC) & DuckDB πŸ¦†

    import yaml
    from time import time
    from sqlalchemy import create_engine
    import pandas as pd
    import duckdb
    
    config = yaml.safe_load(open("conf.yml"))
    db = config['db']
    
    engine = create_engine(f"mssql+pymssql://{db['user']}:{db['password']}@{db['host']}:{db['port']}/{db['dbname']}")
    con = engine.connect().execution_options(stream_results=True)
    
    def get_tables_in_schema(schema):
        """ fetch table list in given schema of our mssql db """
        sql = f"""select tab.name as [table]
    		      from sys.tables tab
    		      where schema_name(tab.schema_id)='{schema}'"""
        df = pd.read_sql(sql, engine)
        return df['table'].values.tolist()
    
    def export_table_to_parquet(schema, table):
        """ export data from given table to parquet """
        time_step = time()
        print("Let's export", table)
        sql = f"SELECT * FROM {schema}.{table}"
        lines = 0
        for i, df in enumerate(pd.read_sql(sql, con, chunksize=1000000)):
    		# by chunk of 1M rows if needed
            t_step = time()
            file_name = table + ('' if i==0 else f'_{i}m')
            duckdb.sql(f"copy df to 'output/{file_name}.parquet' (format parquet)")
            lines += df.shape[0]
            print('  ', file_name, df.shape[0], f'lines ({round(time() - t_step, 2)}s)')
        print("  ", lines, f"lines exported {'' if i==0 else f' in {i} files'} ({round(time() - time_step, 2)}s)")
    
    schema = 'myschema'
    for table in get_tables_in_schema(schema): # pour exporter TOUTES les tables
        export_table_to_parquet(schema, table)
    
    # export_table_to_parquet(schema, 'stations')
    # export_table_to_parquet(schema, 'rentals')
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.