Load data from Azure Data Lake Storage into dedicated SQL pools in Azure Synapse Analytics
This guide outlines how to use the COPY statement to load data from Azure Data Lake Storage. For quick examples on using the COPY statement across all authentication methods, visit the following documentation: Securely load data using dedicated SQL pools.
To provide feedback or report issues on the COPY statement, send an email to the following distribution list: email@example.com.
- Create the target table to load data from Azure Data Lake Storage.
- Create the COPY statement to load data into the data warehouse.
If you don't have an Azure subscription, create a free Azure account before you begin.
Before you begin
Before you begin this tutorial, download and install the newest version of SQL Server Management Studio (SSMS).
To run this tutorial, you need:
- A dedicated SQL pool. See Create a dedicated SQL pool and query data.
- A Data Lake Storage account. See Get started with Azure Data Lake Storage. For this storage account, you will need to configure or specify one of the following credentials to load: A storage account key, shared access signature (SAS) key, an Azure Directory Application user, or an Azure AD user that has the appropriate Azure role to the storage account.
- Currently, ingesting data using the COPY command into an Azure Storage account that is using the new Azure Storage DNS partition feature results in an error. Provision a storage account in a subscription that does not use DNS partitioning for this tutorial.
Create the target table
Connect to your dedicated SQL pool and create the target table you will load to. In this example, we are creating a product dimension table.
-- A: Create the target table -- DimProduct CREATE TABLE [dbo].[DimProduct] ( [ProductKey] [int] NOT NULL, [ProductLabel] [nvarchar](255) NULL, [ProductName] [nvarchar](500) NULL ) WITH ( DISTRIBUTION = HASH([ProductKey]), CLUSTERED COLUMNSTORE INDEX --HEAP );
Create the COPY statement
Connect to your SQL dedicated pool and run the COPY statement. For a complete list of examples, visit the following documentation: Securely load data using dedicated SQL pools.
-- B: Create and execute the COPY statement COPY INTO [dbo].[DimProduct] --The column list allows you map, omit, or reorder input file columns to target table columns. --You can also specify the default value when there is a NULL value in the file. --When the column list is not specified, columns will be mapped based on source and target ordinality ( ProductKey default -1 1, ProductLabel default 'myStringDefaultWhenNull' 2, ProductName default 'myStringDefaultWhenNull' 3 ) --The storage account location where you data is staged FROM 'https://storageaccount.blob.core.windows.net/container/directory/' WITH ( --CREDENTIAL: Specifies the authentication method and credential access your storage account CREDENTIAL = (IDENTITY = '', SECRET = ''), --FILE_TYPE: Specifies the file type in your storage account location FILE_TYPE = 'CSV', --FIELD_TERMINATOR: Marks the end of each field (column) in a delimited text (CSV) file FIELDTERMINATOR = '|', --ROWTERMINATOR: Marks the end of a record in the file ROWTERMINATOR = '0x0A', --FIELDQUOTE: Specifies the delimiter for data of type string in a delimited text (CSV) file FIELDQUOTE = '', ENCODING = 'UTF8', DATEFORMAT = 'ymd', --MAXERRORS: Maximum number of reject rows allowed in the load before the COPY operation is canceled MAXERRORS = 10, --ERRORFILE: Specifies the directory where the rejected rows and the corresponding error reason should be written ERRORFILE = '/errorsfolder', ) OPTION (LABEL = 'COPY: ADLS tutorial');
Optimize columnstore compression
By default, tables are defined as a clustered columnstore index. After a load completes, some of the data rows might not be compressed into the columnstore. There's a variety of reasons why this can happen. To learn more, see manage columnstore indexes.
To optimize query performance and columnstore compression after a load, rebuild the table to force the columnstore index to compress all the rows.
ALTER INDEX ALL ON [dbo].[DimProduct] REBUILD;
It is best to create single-column statistics immediately after a load. There are some choices for statistics. For example, if you create single-column statistics on every column it might take a long time to rebuild all the statistics. If you know certain columns are not going to be in query predicates, you can skip creating statistics on those columns.
If you decide to create single-column statistics on every column of every table, you can use the stored procedure code sample
prc_sqldw_create_stats in the statistics article.
The following example is a good starting point for creating statistics. It creates single-column statistics on each column in the dimension table, and on each joining column in the fact tables. You can always add single or multi-column statistics to other fact table columns later on.
You have successfully loaded data into your data warehouse. Great job!
Loading data is the first step to developing a data warehouse solution using Azure Synapse Analytics. Check out our development resources.
For more loading examples and references, view the following documentation:
Submit and view feedback for