sql query on blob storage

Question

sql query on blob storage

Shambhu Rai 1,411

Hi Expert,

how to write sql query on azure blob storage files using databricks notebook without using azure synapse

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-03T09:06:18.1+00:00

@Shambhu Rai - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

2 answers

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-03T09:06:18.1+00:00

@Shambhu Rai - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Sander van de Velde | MVP 36,761 MVP Volunteer Moderator

Hello @Shambhu Rai ,

To connect to Azure Blob Storage check the documentation.

Use the fully qualified ABFS URI to access data secured with Unity Catalog.

A Python example from that page:

dbutils.fs.ls("abfss://******@storageAccount.dfs.core.windows.net/external-location/path/to/data")

spark.read.format("parquet").load("abfss://******@storageAccount.dfs.core.windows.net/external-location/path/to/data")

spark.sql("SELECT * FROM parquet.`abfss://******@storageAccount.dfs.core.windows.net/external-location/path/to/data`")

Pay extra attention to the credentials needed, shown at the bottom of that documentation.

If the response helped, do "Accept Answer". If it doesn't work, please let us know the progress. All community members with similar issues will benefit by doing so. Your contribution is highly appreciated.

Shambhu Rai 1,411 Reputation points

2023-07-28T12:40:23.95+00:00

how about 22 millions records...and wants fetch 2 months records randomly... will it work faster than azure sql database
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-07-31T07:08:17.7733333+00:00
@Shambhu Rai -Azure Databricks can handle large datasets and perform complex operations on them, including filtering and aggregating data. However, the performance of your queries will depend on several factors, such as the size of your cluster, the complexity of your queries, and the structure of your data.

If you have 22 million records and you want to fetch 2 months of records randomly, you can use the sample function in Spark to randomly sample your data. Here's an example code snippet:

df = spark.read.format("csv").option("header", "true").load("/mnt/<mount-name>/<file-name>.csv") df = df.filter("date >= '2022-01-01' AND date < '2022-03-01'") sampled_df = df.sample(fraction=0.1, seed=42)

In this example, we first read the CSV file into a DataFrame, then filter the data to only include records from January and February 2022. Finally, we use the sample function to randomly sample 10% of the data using a seed of 42.

Whether this approach will be faster than using Azure SQL Database will depend on several factors, such as the size of your Azure SQL Database, the complexity of your queries, and the structure of your data. In general, Azure SQL Database is optimized for handling large datasets and performing complex queries, so it may be a better choice for some use cases. However, Azure Databricks can provide additional flexibility and scalability for certain types of data processing tasks.

Answer 2

@Shambhu Rai - Thanks for the question and using MS Q&A platform.

To write SQL queries on Azure Blob Storage files using Databricks notebook, you can follow the steps below:

Step1: Create an Azure Databricks workspace, cluster, and notebook.

**Step2:**Mount the Azure Blob Storage container to the Databricks file system. You can use the following code snippet to mount the container:

Replace <application-id>, <application-secret>, <tenant-id>, <container-name>, <storage-account-name>, and <mount-name> with your own values.

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": "<application-secret>",
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"}

dbutils.fs.mount(
  source="wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point="/mnt/<mount-name>",
  extra_configs=configs)

Step3: Read the files from the mounted directory using the spark.read function. You can use the following code snippet to read a CSV file:

Replace <mount-name> and <file-name> with your own values.

df = spark.read.format("csv").option("header", "true").load("/mnt/<mount-name>/<file-name>.csv")

Step4: Run SQL queries on the DataFrame using the spark.sql function. You can use the following code snippet to run a SQL query:

Replace <table-name> and <condition> with your own values.

df.createOrReplaceTempView("<table-name>")
result = spark.sql("SELECT * FROM <table-name> WHERE <condition>")

For more details, refer to Connect to Azure Data Lake Storage Gen2 and Blob Storage.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-07T07:10:15.0333333+00:00

@Shambhu Rai - Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

sql query on blob storage

2 answers

Your answer