Tutorial: Analyze data in a storage account

Article
12/18/2024

In this tutorial, you'll learn how to analyze data located in a storage account.

So far, we've covered scenarios where data resides in databases in the workspace. Now we'll show you how to work with files in storage accounts. In this scenario, we'll use the primary storage account of the workspace and container that we specified when creating the workspace.

The name of the storage account: contosolake
The name of the container in the storage account: users

Create CSV and Parquet files in your storage account

Run the following code in a notebook in a new code cell. It creates a CSV file and a parquet file in the storage account.

Tip

This table was created earlier in the quickstart, and you can find the steps here.

%%pyspark
df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
df = df.repartition(1) # This ensures we'll get a single file during write()
df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats_csvformat")
df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats_parquetformat")

Analyze data in a storage account

You can analyze the data in your workspace default Azure Data Lake Storage (ADLS) Gen2 account or you can link an ADLS Gen2 or Blob storage account to your workspace through "Manage" > "Linked Services" > "New" (The next steps will refer to the primary ADLS Gen2 account).

In Synapse Studio, go to the Data hub, and then select Linked.
Go to Azure Data Lake Storage Gen2 > myworkspace (Primary - contosolake).
Select users (Primary). You should see the NYCTaxi folder. Inside you should see two folders called PassengerCountStats_csvformat and PassengerCountStats_parquetformat.
Open the PassengerCountStats_parquetformat folder. Inside, there's a parquet file with a name like part-00000-2638e00c-0790-496b-a523-578da9a15019-c000.snappy.parquet.

Right-click .parquet, then select New notebook, then select Load to DataFrame. A new notebook is created with a cell like this:

%%pyspark
abspath = 'abfss://users@contosolake.dfs.core.windows.net/NYCTaxi/PassengerCountStats_parquetformat/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet'
df = spark.read.load(abspath, format='parquet')
display(df.limit(10))

Attach to the Spark pool named Spark1. Run the cell. If you run into an error related to lack of cores, another session could be using this spark pool. Cancel all the existing sessions and retry.

Select back to the users folder. Right-click the .parquet file again, and then select New SQL script > SELECT TOP 100 rows. It creates a SQL script like this:

SQL

SELECT 
    TOP 100 *
FROM OPENROWSET(
    BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats_parquetformat/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
    FORMAT='PARQUET'
) AS [result]

In the script window, make sure the Connect to field is set to the Built-in serverless SQL pool.

Run the script.

Next step

Orchestrate activities with pipelines

Additional resources

Training

Module

Use Azure Synapse serverless SQL pool to query files in a data lake - Training

Use Azure Synapse serverless SQL pool to query files in a data lake

Certification

Microsoft Certified: Azure Data Engineer Associate - Certifications

Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.

Events

FabCon Vegas

Mar 31, 11 PM - Apr 2, 11 PM

The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.

Documentation

Quickstart: Get started analyzing with Spark - Azure Synapse Analytics

In this tutorial, you'll learn to analyze some sample data with Apache Spark in Azure Synapse Analytics.
Quickstart: Get started analyzing with Data Explorer pools (Preview) - Azure Synapse Analytics

In this quickstart, you learn to analyze data with Data Explorer.
Tutorial: Get started analyze data with dedicated SQL pools - Azure Synapse Analytics

In this tutorial, use the NYC Taxi sample data to explore SQL pool's analytic capabilities.
Tutorial: Get started integrate with pipelines - Azure Synapse Analytics

In this tutorial, you learn how to integrate pipelines and activities using Synapse Studio.
Tutorial: Get started analyze data with a serverless SQL pool - Azure Synapse Analytics

In this tutorial, you'll learn how to analyze data with a serverless SQL pool using data located in Spark databases.
'Tutorial: Get started with Azure Synapse Analytics - visualize workspace data with Power BI' - Azure Synapse Analytics

In this tutorial, you learn how to use Power BI to visualize data in Azure Synapse Analytics.
Quickstart: Create a Synapse workspace - Azure Synapse Analytics

In this quickstart, you'll learn how to create a Synapse workspace, a dedicated SQL pool, and a serverless Apache Spark pool.
Tutorial: Get started with Azure Synapse Analytics - monitor your Synapse workspace - Azure Synapse Analytics

In this tutorial, you'll learn how to monitor activities in your Synapse workspace.

Share via

Tutorial: Analyze data in a storage account

Create CSV and Parquet files in your storage account

Analyze data in a storage account

Next step

Feedback

Additional resources