How to use linked service in Notebook with pyspark

Shao Peng Sun 81 Reputation points
2023-03-31T04:24:40.7333333+00:00

I have pyspark script in Notebook to read and write data in ADLS Gen2. Below is an sample of the pyspark script. But in the Synapse I only have a linked service created with Service Principle could connect to the ADLS Gen2, so I need to specify in notebook to use that linked service to make the connection. But how could I do this in notebook?

df.write.mode("overwrite").parquet("abfss://container_name@xxxxxxxxxxxx.dfs.core.windows.net/test")
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,335 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,342 questions
0 comments No comments
{count} vote

Accepted answer
  1. BhargavaGunnam-MSFT 25,876 Reputation points Microsoft Employee
    2023-03-31T19:19:34.4633333+00:00

    Hello Shao Peng Sun,

    Welcome to the MS Q&A platform.

    You can use the storage_options parameter in your PySpark script to specify the linked service for reading and writing data in ADLS Gen2. You can modify your script to below to use the linked service:

    
    
    linked_service_name = 'your_linked_service_name'
    
    # read data
    
    df = spark.read.parquet(
        "abfss://demo@bhargavasynapsegen2.dfs.core.windows.net/NYCTrip.parquet",
        storage_options={'linked_service': linked_service_name}
    )
    df.show()
    
    
    
    # write
    spark.conf.set("fs.azure.account.auth.type.<your-storage-account-name>.dfs.core.windows.net", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type.<your-storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-id>")
    spark.conf.set("fs.azure.account.oauth2.client.secret.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-secret>")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint.<your-storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<your-tenant-id>/oauth2/token")
    
    
    df.write.mode("overwrite").parquet("abfss://container_name@xxxxxxxxxxxx.dfs.core.windows.net/test")
    

    Please see the below screenshot for your reference.

    Using this, I was able to read my parquet file.

    User's image

    I hope this help.

    If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.


1 additional answer

Sort by: Most helpful
  1. Sullivan Paul - Basel 15 Reputation points
    2023-05-23T18:02:42.9+00:00

    So we still have to use the storage account name. Which means, from the point of view of being able to deploy environment-aware linked services and use them from environment-unaware code, it's pointless.

    3 people found this answer helpful.