Understand the use-cases for SQL and spark pools integration

Completed

The Apache Spark and SQL integration that is available within Azure Synapse analytics provides several benefits:

  • You can take advantage of the big data computational power that Apache Spark offers
  • There is flexibility in the use of Apache Spark and SQL languages and frameworks on one platform
  • The integration is seamless and doesn’t require a complex setup
  • SQL and Apache Spark share the same underlying metadata store to transfer data easily

As a result, when you deploy an Azure Synapse Apache Spark cluster, the Azure Data Lake Gen2 capability enables you to store Apache Spark SQL Tables within it. If you use Apache Spark SQL tables, these tables can be queried from a SQL-based Transact-SQL language without needing to use commands like CREATE EXTERNAL TABLE. Within Azure Synapse Analytics, these queries integrate natively with data files that are stored in an Apache Parquet format.

The integration can be helpful in use cases where you perform an ETL process predominately using SQL but need to call on the computation power of Apache Spark to perform a portion of the extract, transform, and load (ETL) process as it is more efficient.

Let's say you would like to write data to a SQL pool after you've performed engineering tasks in Apache Spark. You can reference the dedicated SQL pool data as a source for joining with an Apache Spark DataFrame that can contain data from other files. The method makes use of the Azure Synapse Apache Spark to Synapse SQL connector to efficiently transfer data between the Apache Spark and SQL Pools.

The Azure Synapse Apache Spark pool to Synapse SQL connector is a data source implementation for Apache Spark. It uses the Azure Data Lake Storage Gen2 and PolyBase in SQL pools to efficiently transfer data between the Spark cluster and the Synapse SQL instance.

The other thing to keep in mind is that beyond the capabilities mentioned above, the Azure Synapse Studio experience gives you an integrated notebook experience. Within this notebook experience, you can attach a SQL or Apache Spark pool, and develop and execute transformation pipelines using Python, Scala, and native Spark SQL.