In Synapse, you can mount file systems from ADLS Gen2 for easier access. You've mentioned that you can read text files using mounting, which is a good start. The process should be similar for other file formats, but you might need to use different libraries or approaches to read specific file types. Different extensions :
- PDF: You can use libraries like
PyPDF2
orpdfplumber
to read PDF files. - PPTX: The
python-pptx
library can be used to read PowerPoint files, as you've done in your example - DOCX: For Word documents,
python-docx
is suitable - MSG and EML: These email formats can be read using the
email
library in Python You can adapt your code to this one :
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
# Establish the connection to your ADLS Gen2 account
service_client = DataLakeServiceClient(account_url="https://{account_name}.dfs.core.windows.net", credential=DefaultAzureCredential())
# Access a specific file system and file path
file_system_client = service_client.get_file_system_client(file_system="{file_system_name}")
file_client = file_system_client.get_file_client("file_path")
# Download the file content
downloaded_bytes = file_client.download_file().readall()
# Depending on the file format, use appropriate libraries to read the content
# This is an example for a DOCX file
import io
import docx
doc_stream = io.BytesIO(downloaded_bytes)
doc = docx.Document(doc_stream)
# the 'doc' contains your Word document
I used the Azure SDK for Python to interact with ADLS Gen2. The DefaultAzureCredential
class is part of the Azure Identity library, which provides a straightforward way to get credentials from your environment. This is particularly useful in cloud environments like Azure Synapse, where it can automatically use the credentials of the service.