How to run one synapse pipeline across tenants from different domain (entraID)
Hi, I have synapse pipeline with some spark notebooks with ETL transformations. I need to run this pipeline and save data to different ADLS. All of them are in different subscriptions - some of them are having different entraID than my workspace where I have synapse pipeline. What are my options?
- To run synapse pipeline - I need to set managed identity - Azure Blob Storage contributor on all of ADLS but I can´t see those subscriptions which are not related to same (my) entraID?
- It is possible to use SAS tokens for that purpose? If yes - what is the best practice to connect to ADLS from notebook with possibility use sas tokens in KeyVault? (I don´t want to use Databricks)
- If I would have all my ALDS related to one ENTRAid I would just need managed identity and wouldn´t need SAS token?
Many thanks for answer!
Peter Michalik
Azure Data Lake Storage
Azure Synapse Analytics
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-20T10:10:18.9033333+00:00 @peter michalik Thanks for reaching out to Microsoft Q&A.
- Running Synapse Pipeline Across Different Subscriptions: If you have different subscriptions under different EntraIDs, you might need to set up a DevOps pipeline to access multiple subscriptions1. However, this might not be straightforward as a DevOps pipeline is typically linked to one subscription. refer :https://stackoverflow.com/questions/61444258/devops-pipeline-task-to-access-multiple-subscriptions
- Using SAS Tokens: Yes, you can use Shared Access Signature (SAS) tokens to access Azure Data Lake Storage (ADLS). SAS tokens provide you with a way to grant limited access to objects in your storage account to other clients, without exposing your account key. You can store these SAS tokens in Azure KeyVault for secure access. In a Synapse notebook, you can use the following code to connect to ADLS Gen2 with a SAS token:
spark.conf.set("fs.azure.account.auth.type.<ACCOUNT>.dfs.core.windows.net", "SAS") spark.conf.set("fs.azure.sas.token.provider.type.<ACCOUNT>.dfs.core.windows.net", "com.microsoft.azure.synapse.tokenlibrary.ConfBasedSASProvider") spark.conf.set("spark.storage.synapse.<CONTAINER>.<ACCOUNT>.dfs.core.windows.net.sas", "<SAS KEY>")
Replace
<ACCOUNT>
,<CONTAINER>
, and<SAS KEY>
with your actual account name, container name, and SAS key respectively.refer:https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage
- Using Managed Identity with One EntraID: If all your ADLS are related to one EntraID, you can indeed use Managed Identity and wouldn’t need a SAS token4. Managed identities provide an automatically managed identity in Microsoft Entra ID for applications to use when connecting to resources that support Microsoft Entra authentication. This eliminates the need for developers having to manage credentials.
refer:https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/tutorial-windows-vm-access-datalake Please note that you should review the availability status of managed identities for your resource and known issues before you begin. Also, remember to grant your VM access to an Azure Data Lake Store. Hope this helps. Do let us know if you any further queries.
-
Peter Michalik 20 Reputation points
2024-02-20T11:41:46.4133333+00:00 Thanks for answers, @phemanth To be more clear, I would add some questions:
- I am going to use synapse pipeline with spark notebooks. Synapse run pipelines with managed identity. How can I set managed identity for ADLS from different subscription/entraID for my synapse workspace? Is it even possible?
- If synapse run only with managed identity - that basically mean that I can´t use SAS token to read data from ADLS via synapse notebooks?) .
- What do you mean by "availability status of managed identities"?
- When I want to read data from external ADLS to dataframe in synapse notebook with SAS token (not in pipeline), do I still need to set blob storage contributor role of my synapse workspace?
- I created linked service to external ADLS with auth type - storage key- clicked on button verify connection - everything´s ok. When I try to access from notebook with using storage key it is failing on permission issue. I used code below.
- I also tried code for using SAS - failing on "Server failed to authenticate the request". When I use the same SAS/storage key in data flow /not notebook/ everything is ok.
I tried 3 ways how to do it- everything in notebook fails on permission.
sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.dpassdaastest.dfs.core.windows.net", "SAS") sc._jsc.hadoopConfiguration().set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.ConfBasedSASProvider") spark.conf.set("spark.storage.synapse.dpassdaastest.dfs.core.windows.net.sas", "?xxxxxxxxx") df = spark.read.csv('abfss://testcontainer@dpassdaastest.dfs.core.windows.net/SampleCSVFile_2kb.csv') display(df.limit(10))
spark.conf.set("fs.azure.account.auth.type.dpassdaastest.dfs.core.windows.net", "SAS") spark.conf.set("fs.azure.sas.token.provider.type.dpassdaastest.dfs.core.windows.net", "com.microsoft.azure.synapse.tokenlibrary.ConfBasedSASProvider") spark.conf.set("spark.storage.synapse.testcontainer.dpassdaastest.dfs.core.windows.net.sas", "?xxxxxxxxx")
spark.conf.set("spark.storage.synapse.dpassdaastest.linkedServiceName", "DpaaSTenant") df = spark.read.csv("abfss://testcontainer@dpassdaastest.dfs.core.windows.net/SampleCSVFile_2kb.csv", header=True)
Many thanks for answers
Peter Michalik
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-21T05:34:54.57+00:00 @Peter Michalik
Let me address your questions one by one.- Setting Managed Identity for ADLS from different subscription/tenantID for Synapse workspace: Yes, it is possible to set a managed identity for Azure Data Lake Storage (ADLS) from a different subscription or tenant for your Synapse workspace. The managed identity in Azure Synapse workspace needs the
Storage Blob Data Contributor
role on the ADLS Gen2 storage account. If the workspace creator is also the owner of the ADLS Gen2 storage account, then Azure Synapse will assign theStorage Blob Data Contributor
role to the managed identity. - Using SAS token to read data from ADLS via Synapse notebooks: Yes, you can use a Shared Access Signature (SAS) token to read data from ADLS via Synapse notebooks. You can connect to ADLS Gen2 storage directly by using a SAS key. Use the
ConfBasedSASProvider
and provide the SAS key to thespark.storage.synapse.sas
configuration setting. - Availability status of managed identities: The availability status of managed identities refers to the support status of managed identities for different Azure services. Each Azure service that supports managed identities for Azure resources follows its own timeline. Therefore, it’s important to review the availability status of managed identities for your resource and known issues before you begin.
- Setting Blob Storage Contributor role of Synapse workspace for reading data from external ADLS to dataframe in Synapse notebook with SAS token: Yes, the Azure Synapse managed identity needs the
Storage Blob Data Contributor
role on the ADLS Gen2 storage account. This role is required to successfully launch Spark pools in Azure Synapse workspace. - Permission issue when accessing ADLS from notebook using storage key: The permission issue could be due to insufficient access rights. The user or the managed identity running the code should have the
Storage Blob Data Contributor
role on the storage account. If you’re using a SAS token, make sure it’s correctly formed and not expired. If you’re still facing issues, please check the firewall restrictions on ADLS and ensure that your Synapse workspace IP and Client IP are whitelisted.
- Setting Managed Identity for ADLS from different subscription/tenantID for Synapse workspace: Yes, it is possible to set a managed identity for Azure Data Lake Storage (ADLS) from a different subscription or tenant for your Synapse workspace. The managed identity in Azure Synapse workspace needs the
-
Peter Michalik 20 Reputation points
2024-02-21T07:03:43.37+00:00 @phemanth Let me explain my scenario:
- I have synapse workspace in one subscription related to different entraID/active directory than ADLS where I want to store data.
- I am not able to set Azure Blob Storage Contributor role on ADLS for my Synapse workspace as a managed identity because it´s related to different entraID and therefore when I am trying to set that role on ADLS - I do not see subscription where my workspace is (which makes sense to me because it is related to different microsoft identity- entraID/active directory).
- Therefore I was thinking I need to use SAS token for accessing on "external" ADLS. When I am working with "linked service" and then data flows in synapse - everything is working fine. I am able to access to external ADLS with token generated etc (no IP addresses problems, no permission issues). But when I want to access on that same ADLS with same token I´m using in data flows, within synapse notebook - I´m getting always some error related to "permission", not authorized, invalid config etc.... I did it according your advice to use ConfBasedSASProvide and follow this documentation https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-secure-credentials-with-tokenlibrary?pivots=programming-language-python.
To your answer on question 1
"If the workspace creator is also the owner of the ADLS Gen2 storage account, then Azure Synapse will assign theStorage Blob Data Contributor
role to the managed identity." - I do not understand that point - creator of workspace IS NOT the owner of ADLS, of course. The owners of ADLS are our customers from different domains/AAD/entraID.sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.adlsasa.dfs.core.windows.net", "SAS") sc._jsc.hadoopConfiguration().set("fs.azure.sas.token.provider.type.adlsasa.dfs.core.windows.net", "com.microsoft.azure.synapse.tokenlibrary.ConfBasedSASProvider") spark.conf.set("spark.storage.synapse.mycontainer.adlsasa.dfs.core.windows.net.sas", "?sv=2022-11-02&ss=MQ%3D") df = spark.read.csv('abfss://mycontainer@adlsasa.dfs.core.windows.net/SampleCSVFile_2kb.csv')
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-22T05:17:19.6566667+00:00 @Peter Michalik I understand your scenario and the challenges you’re facing. Let’s break down the issues and potential solutions:
- Setting Azure Blob Storage Contributor role on ADLS for Synapse workspace: If the Synapse workspace and the ADLS are in different Azure Active Directories (AAD), it can indeed be challenging to set the Azure Blob Storage Contributor role1. However, it’s possible to grant permissions to the managed identity in the Synapse workspace1. The owner of the ADLS Gen2 storage account can manually assign the Storage Blob Data Contributor role to the managed identity.
- Using SAS token for accessing “external” ADLS: If you’re able to access the external ADLS with a SAS token in data flows but not in Synapse notebooks, it might be due to some configuration issues. The code snippet you provided seems correct, but the error suggests that there might be an issue with the SAS token or the way it’s being used.
Here are a few things you could check:
- Ensure that the SAS token has the necessary permissions for the operations you’re trying to perform.
- Make sure the SAS token hasn’t expired.
- Verify that the SAS token is correctly set in the
spark.storage.synapse.mycontainer.adlsasa.dfs.core.windows.net.sas
configuration setting.
If you continue to face issues, I recommend checking the Microsoft documentation
Regarding your last point, the statement “If the workspace creator is also the owner of the ADLS Gen2 storage account, then Azure Synapse will assign the Storage Blob Data Contributor role to the managed identity” means that if the same person or entity that created the Synapse workspace also owns the ADLS Gen2 storage account, then Azure Synapse can automatically assign the necessary role. However, in your case, since the owners of the ADLS are your customers from different domains/AAD/entraID, this automatic assignment won’t happen, and the role needs to be manually assigned by the owner of the ADLS Gen2 storage account.
-
Peter Michalik 20 Reputation points
2024-02-22T09:15:56.6733333+00:00 @phemanth "The owner of the ADLS Gen2 storage account can manually assign the Storage Blob Data Contributor role to the managed identity."
How can we do "manual assignment" with customer if customer does not see subscription where our synapse workspace is located?
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-23T12:27:32.1766667+00:00 If the customer does not see the subscription where your Synapse workspace is located, you might need to share access to the specific resource or resource group with them. You can do this by adding their account as a guest in your Azure Active Directory and assigning appropriate permissions.
Here are the steps to manually assign the role:
- Add Guest User: In your Azure Active Directory, add the customer’s account as a guest user.
- Assign Permissions: Assign the necessary permissions to the guest user for the specific resource or resource group where your Synapse workspace is located.
- Role Assignment: Now, the customer should be able to see the subscription and manually assign the “Storage Blob Data Contributor” role to the managed identity of the Synapse workspace.
Please note that these steps involve certain permissions and changes in your Azure settings, so they should be performed by an administrator or someone with sufficient privileges
-
Peter Michalik 20 Reputation points
2024-02-23T14:35:41.6533333+00:00 @phemanth Well, I did it. Now my customer is able to switch between directories and see my synapse workspace but that does not solve my isssue, because I need to see all resources at the same time (mine and also customer resource) to pick my synapse workspace on their ADLS aim properties. That´s not possible because either I am in my directory or in a customer - I can see everything but not at the same time.
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-26T10:51:17.71+00:00 @Peter Michalik I understand your challenge. Unfortunately, the Azure portal does not currently support viewing resources across multiple directories simultaneously. You can switch between directories, but you cannot view resources from multiple directories at the same time. However, there are a few potential workarounds:
- Azure Lighthouse: This service offers cross-tenant management, allowing service providers to view and manage Azure resources across all their customers from a single control plane. This could potentially allow you to manage resources that reside in the customer’s tenant.
- Azure Policy: If the customer has the necessary permissions, they can use Azure Policy to assign roles at a management group or subscription level. This policy will apply to all resources within the scope of the management group or subscription.
- Azure Resource Graph: This is a service in Azure that is designed to extend Azure Resource Management by providing efficient and performant resource exploration with the ability to query at scale across a given set of subscriptions. You can use it to explore your Azure resources.
- PowerShell or Azure CLI: You can use scripting tools like PowerShell or Azure CLI to list resources across multiple subscriptions. You can then filter and manipulate this data as needed.
Please note that these are potential solutions, and the exact implementation might vary based on your specific setup and requirements. Always ensure to follow best practices for security and compliance in your organization.
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-02-28T06:53:11.0666667+00:00 @Peter Michalik We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
-
Peter Michalik 20 Reputation points
2024-02-29T08:56:20.6266667+00:00 @phemanth Thanks for all of your support. Currently I´m working with my team to try one of that solution you shared with me. Unfortunately I don´t have admin rights to do it on my own so I need to always communicate with our devops teams. It will take some time to confirm if we found a solution on that - but I will definitely share with community results!
But anyway is there a chance to schedule a call with you with sharing screen etc?
Peter
-
phemanth 5,840 Reputation points • Microsoft Vendor
2024-03-06T08:41:10.8766667+00:00 sorry it is not possible . I hope above solution may helpful. let us know how it goes and if you any further queries.
Sign in to comment