Using pySpark in Azure Synapse read files from non data lake storage account

Question

Using pySpark in Azure Synapse read files from non data lake storage account

Paul Beare 25

Hi, I have a storage account where the logs of websites go. This is a Gen 2 storage account but NOT a Data Lake. I want to read the log files from this account from our Synapse Workspace using pySpark or any of the other languages to the filter/process the data before loading the results into a data lake attached to the workspace. The use serverless to make the data available to Power BI report(s). I have managed to get the files in via a pipeline, however this generated a lot of data movement expense. Is there a better way of doing this?

Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2023-05-02T05:37:04.9733333+00:00

if my understanding is correct you are trying to read file from Gen2 storage and store them as a datalake using pyspark from synapse notebook, right?
Paul Beare 25 Reputation points

2023-05-04T10:54:04.3466667+00:00
Hi,

yes, I have 2 storage accounts in 2 different subscriptions (in case that is a factor, although the end game is for them to be in the same).
Storage 1 is a standard gen 2 account, where the logs from application insights are written to. There is one container per customer.

Storage 2 is a data lake created when creating Azure Synapse. This is the end point I want the filtered versions of the files.

Within Azure synapse I have a Linked Service to the Storage Account 1, which is using the storage account key (at the moment) to allow access.

below is an example of where I am so far, the issue is that I get the error:
: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: No credentials found for account <account> in the configuration, and its container <container> is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.

It looks like that the creds that I am trying to assign are not being used.

from pyspark.sql import SparkSession blob_account_name = '<account-name>' linked_service_name = '<linked-service>' blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name) spark.conf.set(f"fs.azure.account.key.$blob_account_name.blob.core.windows.net",blob_sas_token) container_name = '<container-name>' wasbs_path = f"wasbs://$container_name@$blob_account_name.blob.core.windows.net/" # here I want to list all the blobs in the container including those in sub directories # then for each blob # read the blob, filter the data and then write to the data lake (storage 2)

Accepted answer

1 additional answer

Your answer

Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2023-05-02T05:37:04.9733333+00:00

if my understanding is correct you are trying to read file from Gen2 storage and store them as a datalake using pyspark from synapse notebook, right?
Paul Beare 25 Reputation points

2023-05-04T10:54:04.3466667+00:00

Hi,

yes, I have 2 storage accounts in 2 different subscriptions (in case that is a factor, although the end game is for them to be in the same).
Storage 1 is a standard gen 2 account, where the logs from application insights are written to. There is one container per customer.

Storage 2 is a data lake created when creating Azure Synapse. This is the end point I want the filtered versions of the files.

Within Azure synapse I have a Linked Service to the Storage Account 1, which is using the storage account key (at the moment) to allow access.

below is an example of where I am so far, the issue is that I get the error:
: org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: No credentials found for account <account> in the configuration, and its container <container> is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.

It looks like that the creds that I am trying to assign are not being used.

from pyspark.sql import SparkSession blob_account_name = '<account-name>' linked_service_name = '<linked-service>' blob_sas_token = mssparkutils.credentials.getConnectionStringOrCreds(linked_service_name) spark.conf.set(f"fs.azure.account.key.$blob_account_name.blob.core.windows.net",blob_sas_token) container_name = '<container-name>' wasbs_path = f"wasbs://$container_name@$blob_account_name.blob.core.windows.net/" # here I want to list all the blobs in the container including those in sub directories # then for each blob # read the blob, filter the data and then write to the data lake (storage 2)

Answer 1

If you want to just read log files from a Gen 2 storage account in Synapse Workspace using PySpark and save to a DLS Account, here is an example of a code snippet

# Import required modules
from pyspark.sql.functions import col
# Define storage account credentials
storage_account_name = '<storage_account_name>'
storage_account_key = '<storage_account_key>'
container_name = '<container_name>'
folder_path = '<folder_path>'
# Create PySpark DataFrame from log files
df_logs = spark.read.format('csv').option('header', True).option('inferSchema', True).load(f'wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{folder_path}')
# Filter and process the data
df_filtered = df_logs.filter(col('status') == 200).groupBy(col('url')).count().orderBy(col('count').desc())
# Write the filtered data to data lake
df_filtered.write.format('parquet').mode('overwrite').option('compression', 'snappy').save('abfss://<data_lake_name>.dfs.core.windows.net/<data_lake_folder>')

we are reading the log files from the specified folder path in the container of the storage account using the spark.read method. We are then filtering the data based on a condition and grouping and counting the results. Finally, we are writing the filtered data to a data lake attached to the Synapse Workspace using the df_filtered.write method.

Hope this helps

Answer 2

To list all blobs and subdirectories in a given storage account, you can use the Azure Storage SDK for Python to enumerate the containers and blobs in the storage account. Here's an example code snippet:

from azure.storage.blob import BlobServiceClient

# Create a BlobServiceClient object
conn_str = "DefaultEndpointsProtocol=https;AccountName=<your_account_name>;AccountKey=<your_account_key>;EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(conn_str)

# List all containers in the storage account
containers = blob_service_client.list_containers()

# Loop through all containers
for container in containers:
    print("Container name: " + container.name)

    # List all blobs and subdirectories in the container
    blobs = blob_service_client.list_blobs(container.name)

    # Loop through all blobs
    for blob in blobs:
        print("Blob name: " + blob.name)

To filter and write the blobs to another storage account, you can use the following steps:

Create a PySpark DataFrame that reads the blobs from the source storage account using the BinaryFiles method.
Use the filter method to select the blobs that match the filter criteria.
Write the filtered blobs to the target storage account using the write method.

from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, BinaryType, StringType

# Define the schema of the DataFrame
schema = StructType([
    StructField("path", StringType(), True),
    StructField("content", BinaryType(), True)
])

# Read the blobs from the source storage account
df = spark.read.format("binaryFile").option("recursiveFileLookup", "true").schema(schema).load("wasbs://<source_container>@<source_account>.blob.core.windows.net")

# Filter the blobs based on the file name
filtered_df = df.filter(col("path").contains("example"))

# Write the filtered blobs to the target storage account
filtered_df.write.format("binaryFile").option("mode", "overwrite").option("recursiveFileLookup", "true").save("wasbs://<target_container>@<target_account>.blob.core.windows.net")

Share via

Using pySpark in Azure Synapse read files from non data lake storage account

1 additional answer

Your answer