Synapse Python/Spark Notebook code with SQL to query data from linked service to SQL Server

ylycfj88 5 Reputation points
2024-04-24T13:45:40.7233333+00:00

In Synpase, I have linked service to a SQL Server database. I'm looking for a sample notebook code, either in python or spark, that I can run a complex SQL qery within this code to get data from multiple SQL Server tables, put the rsults in a dataframe and then save it to a GEN2 data set.

This should be the most basic step that many analysts/data scientists need to do when working with databases. But I can't find any examples in Synapse gallery.

Thanks !!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,396 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,759 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 9,760 Reputation points Microsoft Vendor
    2024-04-25T05:15:43.54+00:00

    Hi ylycfj88,

    Thanks for the question and using MS Q&A platform.

    To query data from a SQL Server database linked service in Synapse using Python or Spark, you can use the following steps:

    • First, you need to create a Synapse workspace and link it to your SQL Server database. You can follow the instructions in this doc to create a Synapse workspace and link it to your SQL Server database.
    • Once you have linked your SQL Server database to Synapse, you can create a new notebook in Synapse Studio. In the notebook, you can use the pyspark or spark libraries to connect to your SQL Server database and execute SQL queries. Here's an example code snippet that demonstrates how to query data from a SQL Server database, put the results in a dataframe, and save it to a Gen2 data set:
    # Import the necessary libraries
    from pyspark.sql import SparkSession
    
    # Create a SparkSession
    spark = SparkSession.builder.appName("Query SQL Server").getOrCreate()
    
    # Define the SQL query
    query = "(SELECT * FROM table1 JOIN table2 ON table1.id = table2.id) AS joined_tables"
    
    # Define the connection properties
    connectionProperties = {
      "user": "<username>",
      "password": "<password>",
      "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    }
    
    # Define the JDBC URL
    jdbcUrl = "jdbc:sqlserver://<server>:<port>;database=<database>"
    
    # Load the data into a dataframe
    df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
    
    # Save the dataframe to a Gen2 data set
    df.write.format("com.databricks.spark.avro").option("compression", "snappy").mode("overwrite").save("<Gen2 data set path>")
    

    In this code snippet, replace <username>, <password>, <server>, <port>, <database>, and <Gen2 data set path> with your specific values.

    You can find this video for more information about using pyspark to query data from SQL Server in the Azure Synapse Analytics

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.