Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
In this tutorial, you'll learn how to analyze data located in a storage account.
So far, we've covered scenarios where data resides in databases in the workspace. Now we'll show you how to work with files in storage accounts. In this scenario, we'll use the primary storage account of the workspace and container that we specified when creating the workspace.
Run the following code in a notebook in a new code cell. It creates a CSV file and a parquet file in the storage account.
Tip
This table was created earlier in the quickstart, and you can find the steps here.
%%pyspark
df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
df = df.repartition(1) # This ensures we'll get a single file during write()
df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats_csvformat")
df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats_parquetformat")
You can analyze the data in your workspace default Azure Data Lake Storage (ADLS) Gen2 account or you can link an ADLS Gen2 or Blob storage account to your workspace through "Manage" > "Linked Services" > "New" (The next steps will refer to the primary ADLS Gen2 account).
In Synapse Studio, go to the Data hub, and then select Linked.
Go to Azure Data Lake Storage Gen2 > myworkspace (Primary - contosolake).
Select users (Primary). You should see the NYCTaxi folder. Inside you should see two folders called PassengerCountStats_csvformat and PassengerCountStats_parquetformat.
Open the PassengerCountStats_parquetformat folder. Inside, there's a parquet file with a name like part-00000-2638e00c-0790-496b-a523-578da9a15019-c000.snappy.parquet
.
Right-click .parquet, then select New notebook, then select Load to DataFrame. A new notebook is created with a cell like this:
%%pyspark
abspath = 'abfss://users@contosolake.dfs.core.windows.net/NYCTaxi/PassengerCountStats_parquetformat/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet'
df = spark.read.load(abspath, format='parquet')
display(df.limit(10))
Attach to the Spark pool named Spark1. Run the cell. If you run into an error related to lack of cores, another session could be using this spark pool. Cancel all the existing sessions and retry.
Select back to the users folder. Right-click the .parquet file again, and then select New SQL script > SELECT TOP 100 rows. It creates a SQL script like this:
SELECT
TOP 100 *
FROM OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats_parquetformat/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
FORMAT='PARQUET'
) AS [result]
In the script window, make sure the Connect to field is set to the Built-in serverless SQL pool.
Run the script.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest Fabric, Power BI, and SQL learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayTraining
Module
Use Azure Synapse serverless SQL pool to query files in a data lake - Training
Use Azure Synapse serverless SQL pool to query files in a data lake
Certification
Microsoft Certified: Azure Data Engineer Associate - Certifications
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Documentation
Quickstart: Get started analyzing with Spark - Azure Synapse Analytics
In this tutorial, you'll learn to analyze some sample data with Apache Spark in Azure Synapse Analytics.
Quickstart: Get started analyzing with Data Explorer pools (Preview) - Azure Synapse Analytics
In this quickstart, you learn to analyze data with Data Explorer.
Tutorial: Get started analyze data with dedicated SQL pools - Azure Synapse Analytics
In this tutorial, use the NYC Taxi sample data to explore SQL pool's analytic capabilities.