Binary/Pickel files I/O on storage account blob from synapse

Question

We need to upload a binary/gz file to Azure storage account and read it back. Tried different methods but getting errors in each of them.

Method #1
import sys
from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
connection_string = token_library.getSecret("....") # using azure keyvault name

mssparkutils.fs.put("synfs:/140/Writing/testingwritetext.txt", local_file_name, True)

This is only working for txt/csv files

Method#2

output_blob_service_client = BlobServiceClient(account_url=fhttps://{storage_account}.blob.core.windows.net/, credential=connection_string)
output_blob_client = output_blob_service_client.get_blob_client(container='test-container',blob=local_file_name)
output_blob_client.upload_blob(f, overwrite=True)

Here “f” has to be a binary file as per our need but it’s expecting f to be in closed state.

Method#3
Using "fsspec"
Approach#1
fsspec_handle = fsspec.open('abfss://test-container@teststorage.dfs.core.windows.net/TestStorageAccount.csv', account_name = adls_account_name, sas_token=sas_key)
gives an error “ValueError: Protocol not known: abfss”
Approach#2
If I remove “s” and give (abfs://…) in the above command, it gives a different error “UnicodeError: encoding with 'idna' codec failed”
Approach#3
If we give “///” above (abfss:///), it’s giving HttpResponseError: Public access is not permitted on this storage account.

Method#4

df.write.parquet(parquet_path, mode = 'overwrite')
df.write.json(json_path, mode = 'overwrite')
df.write.csv(csv_path, mode = 'overwrite', header = 'true')

This is writing a parquet file to the Blob storage instead of the actual file.

Share via

Binary/Pickel files I/O on storage account blob from synapse