Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Traditional SMP data warehouses use an Extract, Transform, and Load (ETL) process for loading data. Azure SQL pool is a massively parallel processing (MPP) architecture that takes advantage of the scalability and flexibility of compute and storage resources. An Extract, Load, and Transform (ELT) process can take advantage of built-in distributed query processing capabilities and eliminate resources needed to transform the data before loading.
While SQL pool supports many loading methods including non-Polybase options such as BCP and SQL BulkCopy API, the fastest and most scalable way to load data is through PolyBase. PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language.
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data warehouse, and then transformed.
The basic steps for implementing a PolyBase ELT for dedicated SQL pool are:
For a loading tutorial, see Use PolyBase to load data from Azure blob storage to Azure Synapse Analytics.
For more information, see Loading patterns blog.
Getting data out of your source system depends on the storage location. The goal is to move the data into PolyBase supported delimited text files.
PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. PolyBase also loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can also load data from Gzip and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats such as WinZip, JSON, and XML.
If you're exporting from SQL Server, you can use bcp command-line tool to export the data into delimited text files. The Parquet to Azure Synapse Analytics data type mapping is as follows:
Parquet Data Type | SQL Data Type |
---|---|
tinyint | tinyint |
smallint | smallint |
int | int |
bigint | bigint |
boolean | bit |
double | float |
float | real |
double | money |
double | smallmoney |
string | nchar |
string | nvarchar |
string | char |
string | varchar |
binary | binary |
binary | varbinary |
timestamp | date |
timestamp | smalldatetime |
timestamp | datetime2 |
timestamp | datetime |
timestamp | time |
date | date |
decimal | decimal |
To land the data in Azure storage, you can move it to Azure Blob storage or Azure Data Lake Store. In either location, the data should be stored in text files. PolyBase can load from either location.
Tools and services you can use to move data to Azure Storage:
You might need to prepare and clean the data in your storage account before loading it into dedicated SQL pool. Data preparation can be performed while your data is in the source, as you export the data to text files, or after the data is in Azure Storage. It is easiest to work with the data as early in the process as possible.
Before you can load data, you need to define external tables in your data warehouse. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the data warehouse.
Defining external tables involves specifying the data source, the format of the text files, and the table definitions. What follows are the T-SQL syntax topics that you'll need:
Once the external objects are defined, you need to align the rows of the text files with the external table and file format definition. The data in each row of the text file must align with the table definition. To format the text files:
It is best practice to load data into a staging table. Staging tables allow you to handle errors without interfering with the production tables. A staging table also gives you the opportunity to use SQL pool built-in distributed query processing capabilities for data transformations before inserting the data into production tables.
To load data with PolyBase, you can use any of these loading options:
If your data is not compatible with PolyBase, you can use bcp or the SQLBulkCopy API. BCP loads directly to dedicated SQL pool without going through Azure Blob storage, and is intended only for small loads. Note, the load performance of these options is slower than PolyBase.
While data is in the staging table, perform transformations that your workload requires. Then move the data into a production table.
The INSERT INTO ... SELECT statement moves the data from the staging table to the permanent table.
As you design an ETL process, try running the process on a small test sample. Try extracting 1000 rows from the table to a file, move it to Azure, and then try loading it into a staging table.
Many of our partners have loading solutions. To find out more, see a list of our solution partners.
For loading guidance, see Guidance for load data.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayTraining
Module
Use Azure Synapse serverless SQL pools to transform data in a data lake - Training
Use Azure Synapse serverless SQL pools to transform data in a data lake
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Documentation
Use external tables with Synapse SQL - Azure Synapse Analytics
Reading or writing data files with external tables in Synapse SQL
Data loading best practices for dedicated SQL pools - Azure Synapse Analytics
Recommendations and performance optimizations for loading data into a dedicated SQL pool in Azure Synapse Analytics.
Design tables using Synapse SQL - Azure Synapse Analytics
Introduction to designing tables in Synapse SQL.