Binary/Pickel files I/O on storage account blob from synapse

Vithal Korimilli 1 Reputation point Microsoft Employee
2022-08-16T20:49:08.73+00:00

We need to upload a binary/gz file to Azure storage account and read it back. Tried different methods but getting errors in each of them.

Method #1
import sys
from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
connection_string = token_library.getSecret("....") # using azure keyvault name

mssparkutils.fs.put("synfs:/140/Writing/testingwritetext.txt", local_file_name, True)

This is only working for txt/csv files

Method#2

output_blob_service_client = BlobServiceClient(account_url=fhttps://{storage_account}.blob.core.windows.net/, credential=connection_string)
output_blob_client = output_blob_service_client.get_blob_client(container='test-container',blob=local_file_name)
output_blob_client.upload_blob(f, overwrite=True)

Here “f” has to be a binary file as per our need but it’s expecting f to be in closed state.

Method#3
Using "fsspec"
Approach#1
fsspec_handle = fsspec.open('abfss://test-container@teststorage.dfs.core.windows.net/TestStorageAccount.csv', account_name = adls_account_name, sas_token=sas_key)
gives an error “ValueError: Protocol not known: abfss”
Approach#2
If I remove “s” and give (abfs://…) in the above command, it gives a different error “UnicodeError: encoding with 'idna' codec failed”
Approach#3
If we give “///” above (abfss:///), it’s giving HttpResponseError: Public access is not permitted on this storage account.

Method#4

df.write.parquet(parquet_path, mode = 'overwrite')
df.write.json(json_path, mode = 'overwrite')
df.write.csv(csv_path, mode = 'overwrite', header = 'true')

This is writing a parquet file to the Blob storage instead of the actual file.

Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,666 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,416 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,342 questions
{count} votes