If you want to just read log files from a Gen 2 storage account in Synapse Workspace using PySpark and save to a DLS Account, here is an example of a code snippet
# Import required modules
from pyspark.sql.functions import col
# Define storage account credentials
storage_account_name = '<storage_account_name>'
storage_account_key = '<storage_account_key>'
container_name = '<container_name>'
folder_path = '<folder_path>'
# Create PySpark DataFrame from log files
df_logs = spark.read.format('csv').option('header', True).option('inferSchema', True).load(f'wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{folder_path}')
# Filter and process the data
df_filtered = df_logs.filter(col('status') == 200).groupBy(col('url')).count().orderBy(col('count').desc())
# Write the filtered data to data lake
df_filtered.write.format('parquet').mode('overwrite').option('compression', 'snappy').save('abfss://<data_lake_name>.dfs.core.windows.net/<data_lake_folder>')
we are reading the log files from the specified folder path in the container of the storage account using the spark.read
method. We are then filtering the data based on a condition and grouping and counting the results. Finally, we are writing the filtered data to a data lake attached to the Synapse Workspace using the df_filtered.write
method.
Hope this helps