How to use linked service in Notebook with pyspark

Question

How to use linked service in Notebook with pyspark

Shao Peng Sun 91

I have pyspark script in Notebook to read and write data in ADLS Gen2. Below is an sample of the pyspark script. But in the Synapse I only have a linked service created with Service Principle could connect to the ADLS Gen2, so I need to specify in notebook to use that linked service to make the connection. But how could I do this in notebook?

df.write.mode("overwrite").parquet("abfss://******@xxxxxxxxxxxx.dfs.core.windows.net/test")

Rajan Sethi 5 Reputation points

2025-02-13T11:38:00.3+00:00

I did the same thing but its not working for me.. The service principal is having write permission on the lake folder.

Accepted answer

1 additional answer

Your answer

Rajan Sethi 5 Reputation points

2025-02-13T11:38:00.3+00:00

I did the same thing but its not working for me.. The service principal is having write permission on the lake folder.

Answer 1

Bhargava-MSFT 31,261 Microsoft Employee Moderator

Hello Shao Peng Sun,

Welcome to the MS Q&A platform.

You can use the storage_options parameter in your PySpark script to specify the linked service for reading and writing data in ADLS Gen2. You can modify your script to below to use the linked service:



linked_service_name = 'your_linked_service_name'

# read data

df = spark.read.parquet(
    "abfss://******@bhargavasynapsegen2.dfs.core.windows.net/NYCTrip.parquet",
    storage_options={'linked_service': linked_service_name}
)
df.show()


# write
spark.conf.set("fs.azure.account.auth.type.<your-storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<your-storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<your-storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<your-tenant-id>/oauth2/token")


df.write.mode("overwrite").parquet("abfss://******@xxxxxxxxxxxx.dfs.core.windows.net/test")

Please see the below screenshot for your reference.

Using this, I was able to read my parquet file.

User's image

I hope this help.

If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.

Shao Peng Sun 91 Reputation points

2023-03-31T22:13:54.2733333+00:00

Thank you BhargavaGunnam for you help. But when I use storage__options to write dataframe, it seems that df.write.parquet cannot use this "storage__options"

And when I use spark.read.parquet to read with storage_option, I also got error like below, but the linked service does have write permission because if I created a data flow I could specify that linked service to write data there

Bhargava-MSFT 31,261 Microsoft Employee Moderator

Hello Shao Peng Sun,

Sorry about that. The storage_options parameter is available for reading data by not writing data in PySpark. You can use the hadoop_conf method to set the required configurations for the linked service. Please see my below example to write data.


# To write
spark.conf.set("fs.azure.account.auth.type.<your-storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<your-storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<your-storage-account-name>.dfs.core.windows.net", "<your-service-principal-client-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<your-storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<your-tenant-id>/oauth2/token")

# Modified write code
df.write.mode("overwrite").parquet("abfss://******@xxxxxxxxxxxx.dfs.core.windows.net/test")

linked_service_name = 'LS_gen2_serviceprincipal'

# read
df = spark.read.parquet(
    "abfss://******@bhargavasynapsegen2.dfs.core.windows.net/NYCTrip.parquet",
    storage_options={'linked_service': linked_service_name}
)
df.show()

I have tested the code from my end. Please see the below screenshot for your reference.

User's image

I hope this helps. Please let me know if you have any further questions.

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-04-06T19:32:51.9833333+00:00

Hello Shao Peng Sun,
I am checking to see if you got a chance to look into my earlier response. Please let me know if you have any further questions.
Shao Peng Sun 91 Reputation points

2023-04-07T09:42:34.41+00:00

This finally works. Thank you so much @Bhargava-MSFT !
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-04-07T17:42:36.17+00:00

Hello Shao Peng Sun, Glad to know the answer was helpful. Thank you for accepting the answer.
Renato de Melo 125 Reputation points

2023-04-11T11:40:22.39+00:00

HI @Bhargava-MSFT ,

Thank you for your reply, but the solution you provided requires hard-coding the credential into PySpark code or storing credential in a Key Vault and applying a tool to retrieve the credential from Key Vault.

We found a better way enabling the option to run the notebook as "managed identity" and adding the following code to our script. It applies the linked service to Spark without having to worry with credentials:

source_full_storage_account_name = "<storage-account-name>.dfs.core.windows.net" spark.conf.set(f"spark.storage.synapse.{source_full_storage_account_name}.linkedServiceName", "<linked-service-name>") spark.conf.set(f"fs.azure.account.oauth.provider.type.{source_full_storage_account_name}", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-04-11T16:53:47.6633333+00:00

Thank you for sharing this, Renato de Melo Much appreciated!

Answer 2

Sullivan Paul - Basel 15

So we still have to use the storage account name. Which means, from the point of view of being able to deploy environment-aware linked services and use them from environment-unaware code, it's pointless.

Dom Sadie 0 Reputation points

2023-09-01T11:22:29.34+00:00

I agree, why mention linked service when it still has to load the storage account
gCW1886 26 Reputation points

2024-02-07T10:57:10.9266667+00:00
Using the TokenLibrary of mssparkutils, it is possible to get the storage account name by referring to a linked service (which could have a constant name across environments). From that you can create the abfss call dynamically.

import json linkedServiceInfo = json.loads(mssparkutils.credentials.getPropertiesAll("<linkedServiceName")) storageAccountName = linkedServiceInfo['Endpoint'].split('.')[0].split('//')[1] containerName = any abfssUrl = f"abfss://{containerName}@{storageAccountName}.dfs.core.windows.net/"

Share via

How to use linked service in Notebook with pyspark

1 additional answer

Your answer