How to leverage existing spark cluster in Synapse Workspace

Catherine Meng 41 Reputation points
2021-02-22T13:29:17.087+00:00

We have some legacy computing resources in Cosmos which is Spark on Cosmos. I'd like to know if we could connect the existing computing resources on cosmos.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,921 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,178 questions
0 comments No comments
{count} votes

Accepted answer
  1. Samara Soucy - MSFT 5,131 Reputation points
    2021-02-23T02:49:46.22+00:00

    It depends on what your goals are. If you would like to create notebooks in your Synapse workspace and have them run on your HDInsight or Databricks clusters, then the answer is no. You would need to migrate your jobs to a cluster maintained within Synapse. You can connect to CosmosDB from within Synapse just as you would in Databricks or HDInsight.

    If your goal is to run jobs on Databricks and then use the results within Synapse, or use data in Synapse within Databricks, then yes, this is possible.

    To move data in and out of Synapse from Databricks you will need a blob storage account that both Databricks and Synapse have permissions to read and write- this is used as a temporary common storage area for the two services.

    In python you would use something similar to the following code in Databricks to move the data between the two services:

    spark.conf.set(  
      "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",  
      "<your-storage-account-access-key>")  
      
    # Get some data from an Azure Synapse table.  
    df = spark.read \  
      .format("com.databricks.spark.sqldw") \  
      .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \  
      .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \  
      .option("forwardSparkAzureStorageCredentials", "true") \  
      .option("dbTable", "<your-table-name>") \  
      .load()  
      
    # Load data from an Azure Synapse query.  
    df = spark.read \  
      .format("com.databricks.spark.sqldw") \  
      .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \  
      .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \  
      .option("forwardSparkAzureStorageCredentials", "true") \  
      .option("query", "select x, count(*) as cnt from table group by x") \  
      .load()  
      
    # Apply some transformations to the data, then use the  
    # Data Source API to write the data back to another table in Azure Synapse.  
      
    df.write \  
      .format("com.databricks.spark.sqldw") \  
      .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \  
      .option("forwardSparkAzureStorageCredentials", "true") \  
      .option("dbTable", "<your-table-name>") \  
      .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \  
      .save()  
    

    Does that answer your question?

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.