Extract content from PDF files stored in a blue blob

José Roberto Madureira Junior 1 Reputation point
2021-10-24T17:33:55.437+00:00

Able to list the files of an Azure Blob, however I am not able to list the contents of the PDF files present in that Blob.

import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess

# Create the BlockBlockService that the system uses to call the Blob service for the storage account.
block_blob_service = BlockBlobService(
    account_name='<NAME>', account_key='<KEY>')

# Create a container called 'quickstartblobs'.
container_name = 'dataset'
block_blob_service.create_container(container_name)

# Set the permission so the blobs are public.
block_blob_service.set_container_acl(
    container_name, public_access=PublicAccess.Container)

# List the blobs in the container.
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
    print("\t Blob name: " + blob.name)

How can I list files and extract the contents of a file present in Azure Blob Storage?

Azure Storage
Azure Storage
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
3,548 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sumarigo-MSFT 47,476 Reputation points Microsoft Employee Moderator
    2021-10-25T10:32:36.237+00:00

    @José Roberto Madureira Junior Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

    This article shows you how to use query acceleration to retrieve a subset of data from your storage account: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration-how-to?tabs=azure-powershell%2Cpowershell

    As of now to get the contents of the blob --> Query acceleration for CSV and Json is there, PDF is not available as of now.. So the workaround is that he downloads the PDF and does the filter on client side using some PDF reader ( third party) in their application

    There is a similar thread discussion here can you please refer to the suggestion mentioned over-there and let me know the status

    Additional information: Easiest way to list and extract the content of PDF files in Azure Blob Storage you can use Azure Storage Explorer tool

    Get-AzStorageBlob -Container 'containerName' -Context $StorageContext | Where-Object {!($_.Name -like '*.pdf')}  
    

    If you want to read pdf form azure blob, please refer to the following code

    Please let us know if you have any further queries. I’m happy to assist you further.
    Looking forward for your reply!

    -------------------------------------------------------------------------------------------------------------------------------------------------------

    Please do not forget to 143461-image.png and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.