Is there a way to read files in python from ADLS gen2 using Synapse?

Evan Brody 35 Reputation points
2024-01-11T18:14:08.0566667+00:00

I am looking to read in files of different formats with python in a Synapse notebook. These include pdf, pptx, .docx, .msg, and .eml. I would like to be able to read in the files then parse and manipulate them with python.
I was able to do this in data bricks using different python libraries. For example I can use code like this:

from pptx import Presentation
prs = Presentation(file_name)

# for pdf
from pypdf import PdfReader
reader = PdfReader(open(filename, 'rb'))

# word docs
import docx
doc = docx.Document(file_name)

# .eml files
import email
 msg = email.message_from_file(open(file_name))

How can I do this in synapse? I have been getting an error:
FileNotFoundError: [Errno 2] No such file or directory These file paths work to read in data using spark or pandas. The format is 'abfs[s]://file_system_name@account_name.dfs.core.windows.net/file_path'

using mounting I am able to read in text files but not other formats. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-01-15T12:34:19.8466667+00:00

    In Synapse, you can mount file systems from ADLS Gen2 for easier access. You've mentioned that you can read text files using mounting, which is a good start. The process should be similar for other file formats, but you might need to use different libraries or approaches to read specific file types. Different extensions :

    • PDF: You can use libraries like PyPDF2 or pdfplumber to read PDF files.
    • PPTX: The python-pptx library can be used to read PowerPoint files, as you've done in your example
    • DOCX: For Word documents, python-docx is suitable
    • MSG and EML: These email formats can be read using the email library in Python You can adapt your code to this one :
    from azure.identity import DefaultAzureCredential
    from azure.storage.filedatalake import DataLakeServiceClient
    # Establish the connection to your ADLS Gen2 account
    service_client = DataLakeServiceClient(account_url="https://{account_name}.dfs.core.windows.net", credential=DefaultAzureCredential())
    # Access a specific file system and file path
    file_system_client = service_client.get_file_system_client(file_system="{file_system_name}")
    file_client = file_system_client.get_file_client("file_path")
    # Download the file content
    downloaded_bytes = file_client.download_file().readall()
    # Depending on the file format, use appropriate libraries to read the content
    # This is an example for a DOCX file
    import io
    import docx
    doc_stream = io.BytesIO(downloaded_bytes)
    doc = docx.Document(doc_stream)
    # the 'doc' contains your Word document
    
    

    I used the Azure SDK for Python to interact with ADLS Gen2. The DefaultAzureCredential class is part of the Azure Identity library, which provides a straightforward way to get credentials from your environment. This is particularly useful in cloud environments like Azure Synapse, where it can automatically use the credentials of the service.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.