Fast Data Loading in Azure SQL DB using Azure Databricks

License

Azure Databricks and Azure SQL database can be used amazingly well together. This repo will help you to use the latest connector to load data into Azure SQL as fast as possible, using table partitions and column-store and all the known best-practices.

Samples

All the samples start from a partitioned Parquet file, created with data generated from the famous TPC-H benchmark. Free tools are available on TPC-H website to generate a dataset with the size you want:

http://www.tpc.org/tpch/

Once the Parquet file is available,

the samples will guide you through the most common scenarios

all samples will also show how to correctly load table if there are already indexes or if you want to use a column-store in Azure SQL.

Bonus Samples: Reading data as fast as possible

Though this repo focuses on writing data as fast as possible into Azure SQL, I also understand that you may also want to know how to do the opposite: how the read data as fast as possible from Azure SQL into Apache Spark / Azure Databricks? For this reason in the folder notebooks/read-from-azure-sql you will find two samples that shows how to do exactly that:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.