Output SQL server data into Parquet files

Question

Output SQL server data into Parquet files

Srikanth Sudharma 11

Hello,

My requirements need me to export data from sql server ( on prem) db to an external source. Currently, the need is to create parquet files of this data for consumption. My question is,
Is this possible using polybase/External file format/OpenRowset ? If yes, are there articles indicating how to? I was reading about polybase and external files but most of these examples indicates reading data from external files/ data sources and not writing to them.

Thanks
Sri

3 answers

Your answer

Answer 1

YufeiShao-msft 7,146

Hi @Srikanth Sudharma ,

Data factory is a good way to do this if you have
One Way to Create a Parquet File from SQL Server Data

or try to use Spark

-------------

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Srikanth Sudharma 11 Reputation points

2022-01-25T13:43:18.02+00:00

My current set up is entire on premise Sql server. Is the understanding then correct that using just sql server and its tools like openrowset and polybase or even SSIS, there is no way to generate a parquet file for the data in the system?

While, its possible to read from parquet files, external tables from multiple source formats, writing out isnt an option yet?
YufeiShao-msft 7,146 Reputation points

2022-02-01T08:15:55.523+00:00

At least in Azure SQL Server

Answer 2

You can do this with Sling (https://slingdata.io). See below.

# set connection via env var
export mssql='sqlserver://...'

# test connection
sling conns test mssql

# run export for many tables
sling run --src-conn mssql --src-stream 'my_schema.*' --tgt-object 'file://{stream_schema}/{stream_table}.parquet'

# run export for one table
sling run --src-conn mssql --src-stream 'my_schema.my_table' --tgt-object 'file://my_folder/my_table.parquet'

# run export for custom SQL
sling run --src-conn mssql --src-stream 'select col1, col2 from my_schema.my_table where col3 > 0' --tgt-object 'file://my_folder/my_table.parquet'

Answer 3

An other pythonic option 👨‍💻😎

With Pandas, PyMssql (ODBC) & DuckDB 🦆

import yaml
from time import time
from sqlalchemy import create_engine
import pandas as pd
import duckdb

config = yaml.safe_load(open("conf.yml"))
db = config['db']

engine = create_engine(f"mssql+pymssql://{db['user']}:{db['password']}@{db['host']}:{db['port']}/{db['dbname']}")
con = engine.connect().execution_options(stream_results=True)

def get_tables_in_schema(schema):
    """ fetch table list in given schema of our mssql db """
    sql = f"""select tab.name as [table]
		      from sys.tables tab
		      where schema_name(tab.schema_id)='{schema}'"""
    df = pd.read_sql(sql, engine)
    return df['table'].values.tolist()

def export_table_to_parquet(schema, table):
    """ export data from given table to parquet """
    time_step = time()
    print("Let's export", table)
    sql = f"SELECT * FROM {schema}.{table}"
    lines = 0
    for i, df in enumerate(pd.read_sql(sql, con, chunksize=1000000)):
		# by chunk of 1M rows if needed
        t_step = time()
        file_name = table + ('' if i==0 else f'_{i}m')
        duckdb.sql(f"copy df to 'output/{file_name}.parquet' (format parquet)")
        lines += df.shape[0]
        print('  ', file_name, df.shape[0], f'lines ({round(time() - t_step, 2)}s)')
    print("  ", lines, f"lines exported {'' if i==0 else f' in {i} files'} ({round(time() - time_step, 2)}s)")

schema = 'myschema'
for table in get_tables_in_schema(schema): # pour exporter TOUTES les tables
    export_table_to_parquet(schema, table)

# export_table_to_parquet(schema, 'stations')
# export_table_to_parquet(schema, 'rentals')

Share via

Output SQL server data into Parquet files

3 answers

Your answer