Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Traditional SMP dedicated SQL pools use an Extract, Transform, and Load (ETL) process for loading data. Synapse SQL, within Azure Synapse Analytics, uses distributed query processing architecture that takes advantage of the scalability and flexibility of compute and storage resources.
Using an Extract, Load, and Transform (ELT) process uses built-in distributed query processing capabilities and eliminates the resources needed for data transformation before loading.
While dedicated SQL pools support many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement.
With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, we recommend using the COPY statement.
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a dedicated SQL pool, and then transformed.
The basic steps for implementing ELT are:
For a loading tutorial, see loading data from Azure blob storage.
Getting data out of your source system depends on the storage location. The goal is to move the data into supported delimited text or CSV files.
With PolyBase and the COPY statement, you can load data from UTF-8 and UTF-16 encoded delimited text or CSV files. In addition to delimited text or CSV files, it loads from the Hadoop file formats such as ORC and Parquet. PolyBase and the COPY statement can also load data from Gzip and Snappy compressed files.
Extended ASCII, fixed-width format, and nested formats such as WinZip or XML aren't supported. If you're exporting from SQL Server, you can use the bcp command-line tool to export the data into delimited text files.
To land the data in Azure storage, you can move it to Azure Blob storage or Azure Data Lake Store Gen2. In either location, the data should be stored in text files. PolyBase and the COPY statement can load from either location.
Tools and services you can use to move data to Azure Storage:
You might need to prepare and clean the data in your storage account before loading. Data preparation can be performed while your data is in the source, as you export the data to text files, or after the data is in Azure Storage. It's easiest to work with the data as early in the process as possible.
First, define the tables you're loading to in your dedicated SQL pool when using the COPY statement.
If you're using PolyBase, you need to define external tables in your dedicated SQL pool before loading. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the dedicated SQL pool.
Defining external tables involves specifying the data source, the format of the text files, and the table definitions. T-SQL syntax reference articles that you'll need are:
Use the following SQL data type mapping when loading Parquet files:
Parquet type | Parquet logical type (annotation) | SQL data type |
---|---|---|
BOOLEAN | bit | |
BINARY / BYTE_ARRAY | varbinary | |
DOUBLE | float | |
FLOAT | real | |
INT32 | int | |
INT64 | bigint | |
INT96 | datetime2 | |
FIXED_LEN_BYTE_ARRAY | binary | |
BINARY | UTF8 |
nvarchar |
BINARY | STRING |
nvarchar |
BINARY | ENUM |
nvarchar |
BINARY | UUID |
uniqueidentifier |
BINARY | DECIMAL |
decimal |
BINARY | JSON |
nvarchar(MAX) |
BINARY | BSON |
varbinary(MAX) |
FIXED_LEN_BYTE_ARRAY | DECIMAL |
decimal |
BYTE_ARRAY | INTERVAL |
varchar(MAX) |
INT32 | INT(8, true) |
smallint |
INT32 | INT(16, true) |
smallint |
INT32 | INT(32, true) |
int |
INT32 | INT(8, false) |
tinyint |
INT32 | INT(16, false) |
int |
INT32 | INT(32, false) |
bigint |
INT32 | DATE |
date |
INT32 | DECIMAL |
decimal |
INT32 | TIME (MILLIS) |
time |
INT64 | INT(64, true) |
bigint |
INT64 | INT(64, false ) |
decimal(20,0) |
INT64 | DECIMAL |
decimal |
INT64 | TIME (MILLIS) |
time |
INT64 | TIMESTAMP (MILLIS) |
datetime2 |
Complex type | LIST |
varchar(max) |
Complex type | MAP |
varchar(max) |
Important
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException:...
For an example of creating external objects, see Create external tables.
If you are using PolyBase, the external objects defined need to align the rows of the text files with the external table and file format definition. The data in each row of the text file must align with the table definition.
To format the text files:
It is best practice to load data into a staging table. Staging tables allow you to handle errors without interfering with the production tables. A staging table also gives you the opportunity to use the dedicated SQL pool parallel processing architecture for data transformations before inserting the data into production tables.
To load data, you can use any of these loading options:
Review available tutorials:
In addition to PolyBase and the COPY statement, you can use bcp or the SqlBulkCopy API. The bcp
utility loads directly to the database without going through Azure Blob storage, and is intended only for small loads.
Note
The load performance of these options is slower than PolyBase and the COPY statement.
While data is in the staging table, perform transformations that your workload requires. Then move the data into a production table.
The INSERT INTO ... SELECT statement moves the data from the staging table to the permanent table.
As you design an ETL process, try running the process on a small test sample. Try extracting 1,000 rows from the table to a file, move it to Azure, and then try loading it into a staging table.
Many of our partners have loading solutions. To find out more, see a list of our solution partners.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayTraining
Module
Use Azure Synapse serverless SQL pools to transform data in a data lake - Training
Use Azure Synapse serverless SQL pools to transform data in a data lake
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.