Output SQL server data into Parquet files

Question

Output SQL server data into Parquet files

Srikanth Sudharma 11

Hello,

My requirements need me to export data from sql server ( on prem) db to an external source. Currently, the need is to create parquet files of this data for consumption. My question is,
Is this possible using polybase/External file format/OpenRowset ? If yes, are there articles indicating how to? I was reading about polybase and external files but most of these examples indicates reading data from external files/ data sources and not writing to them.

Thanks
Sri

0 comments

4 answers

Your answer

Answer 1

YufeiShao-msft 7,156

Hi @Srikanth Sudharma ,

Data factory is a good way to do this if you have
One Way to Create a Parquet File from SQL Server Data

or try to use Spark

-------------

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Srikanth Sudharma 11 Reputation points

2022-01-25T13:43:18.02+00:00

My current set up is entire on premise Sql server. Is the understanding then correct that using just sql server and its tools like openrowset and polybase or even SSIS, there is no way to generate a parquet file for the data in the system?

While, its possible to read from parquet files, external tables from multiple source formats, writing out isnt an option yet?
YufeiShao-msft 7,156 Reputation points

2022-02-01T08:15:55.523+00:00

At least in Azure SQL Server

Answer 2

The simplest way to do exports from sql server to parquet local is with fastbcp

Is it really fast
This will export you table in chunks (faster for export and for reimport).
*Prefer to use the first column of a clustered index as distributekeycolumn (if the column have enought value of course)
*If your table id is an identity switch from Ntile method to RangeId

.\FastBCP.exe `
  --connectiontype "mssql" `
  --server "myserver,1433" `
  --trusted `
  --database "mydatabase" `
  --query "SELECT * FROM dbo.mytable where mycond = true" `
  --fileoutput "myfile.parquet" `
  --directory "d:\out\{sourceschema}\{sourcetable}" `
  --parallelmethod "Ntile" `
  --distributekeycolumn "myid" `
  --merge false `
  --license "C:\MyFreeTrialLicense.lic"

if you prefer to generate one file per month swtich to Timepartition method:

.\FastBCP.exe `
  --connectiontype "mssql" `
  --server "myserver,1433" `
  --trusted `
  --database "mydatabase" `
  --query "SELECT * FROM dbo.mytable where mycond = true" `
  --fileoutput "myfile.parquet" `
  --directory "d:\out\{sourceschema}\{sourcetable}" `
  --parallelmethod "Timepartition" `
  --distributekeycolumn "(mydatecolumn,year,month)" `
  --merge false `
  --license "C:\MyFreeTrialLicense.lic"

For small tables:

.\FastBCP.exe `
  --connectiontype "mssql" `
  --server "myserver,1433" `
  --trusted `
  --database "mydatabase" `
  --sourceschema "dbo" `
  --sourcetable "mysmalltable" `
  --fileoutput "{sourcetable}.parquet" `
  --directory "d:\out\{sourceschema}\{sourcetable}" `
  --parallelmethod "None" `
  --merge false `
  --license "C:\MyFreeTrialLicense.lic"

Nota : work also on Linux

Answer 3

An other pythonic option 👨‍💻😎

With Pandas, PyMssql (ODBC) & DuckDB 🦆

import yaml
from time import time
from sqlalchemy import create_engine
import pandas as pd
import duckdb

config = yaml.safe_load(open("conf.yml"))
db = config['db']

engine = create_engine(f"mssql+pymssql://{db['user']}:{db['password']}@{db['host']}:{db['port']}/{db['dbname']}")
con = engine.connect().execution_options(stream_results=True)

def get_tables_in_schema(schema):
    """ fetch table list in given schema of our mssql db """
    sql = f"""select tab.name as [table]
		      from sys.tables tab
		      where schema_name(tab.schema_id)='{schema}'"""
    df = pd.read_sql(sql, engine)
    return df['table'].values.tolist()

def export_table_to_parquet(schema, table):
    """ export data from given table to parquet """
    time_step = time()
    print("Let's export", table)
    sql = f"SELECT * FROM {schema}.{table}"
    lines = 0
    for i, df in enumerate(pd.read_sql(sql, con, chunksize=1000000)):
		# by chunk of 1M rows if needed
        t_step = time()
        file_name = table + ('' if i==0 else f'_{i}m')
        duckdb.sql(f"copy df to 'output/{file_name}.parquet' (format parquet)")
        lines += df.shape[0]
        print('  ', file_name, df.shape[0], f'lines ({round(time() - t_step, 2)}s)')
    print("  ", lines, f"lines exported {'' if i==0 else f' in {i} files'} ({round(time() - time_step, 2)}s)")

schema = 'myschema'
for table in get_tables_in_schema(schema): # pour exporter TOUTES les tables
    export_table_to_parquet(schema, table)

# export_table_to_parquet(schema, 'stations')
# export_table_to_parquet(schema, 'rentals')

Answer 4

You can do this with Sling (https://slingdata.io). See below.

# set connection via env var
export mssql='sqlserver://...'

# test connection
sling conns test mssql

# run export for many tables
sling run --src-conn mssql --src-stream 'my_schema.*' --tgt-object 'file://{stream_schema}/{stream_table}.parquet'

# run export for one table
sling run --src-conn mssql --src-stream 'my_schema.my_table' --tgt-object 'file://my_folder/my_table.parquet'

# run export for custom SQL
sling run --src-conn mssql --src-stream 'select col1, col2 from my_schema.my_table where col3 > 0' --tgt-object 'file://my_folder/my_table.parquet'

Share via

Output SQL server data into Parquet files

4 answers

Your answer