Sample datasets
There are a variety of datasets provided by third parties that you can upload to your Azure Databricks workspace and use. Databricks also provides a variety of datasets that are already mounted to DBFS in your Azure Databricks workspace.
Third-party sample datasets
Azure Databricks has built-in tools to quickly upload third-party sample datasets as comma-separated values (CSV) files into Azure Databricks workspaces. Some popular third-party sample datasets available in CSV format:
Sample dataset | To download the sample dataset as a CSV file… |
---|---|
The Squirrel Census | On the Data webpage, click Park Data, Squirrel Data, or Stories. |
OWID Dataset Collection | In the GitHub repository, click the datasets folder. Click the subfolder that contains the target dataset, and then click the dataset’s CSV file. |
Data.gov CSV datasets | On the search results webpage, click the target search result, and next to the CSV icon, click Download. |
Diamonds (Requires a Kaggle account) | On the dataset’s webpage, on the Data tab, on the Data tab, next to diamonds.csv, click the Download icon. |
NYC Taxi Trip Duration (Requires a Kaggle account) | On the dataset’s webpage, on the Data tab, next to sample_submission.zip, click the Download icon. To find the dataset’s CSV files, extracts the contents of the downloaded ZIP file. |
UFO Sightings (Requires a data.world account) | On the dataset’s webpage, next to nuforc_reports.csv, click the Download icon. |
To use third-party sample datasets in your Azure Databricks workspace, do the following:
- Follow the third-party’s instructions to download the dataset as a CSV file to your local machine.
- Upload the CSV file from your local machine into your Azure Databricks workspace.
- To work with the imported data, use Databricks SQL to query the data. Or you can use a notebook to load the data as a DataFrame.
Databricks datasets (databricks-datasets)
Azure Databricks includes a variety of datasets mounted to DBFS.
Note
The availability and location of Databricks datasets are subject to change without notice.
Browse Databricks datasets
To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.
Python
display(dbutils.fs.ls('/databricks-datasets'))
Scala
display(dbutils.fs.ls("/databricks-datasets"))
R
%fs ls "/databricks-datasets"
Unity Catalog datasets
Unity Catalog provides access to a number of sample datasets in the samples
catalog. You can review these datasets in the Data Explorer UI and reference them directly using the <catalog_name>.<database_name>.<table_name>
pattern.
The nyctaxi
database contains the table trips
, which has details about taxi rides in New York City stored using Delta Lake. The following code example returns all records in this table:
SELECT * FROM samples.nyctaxi.trips
The tpch
database contains data from the TPC-H Benchmark. To see tables in this database, run:
SHOW TABLES IN samples.tpch
Get information about Databricks datasets
To get more information about a dataset, you can use a local file API to print out the dataset README
(if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.
Python
f = open('/dbfs/databricks-datasets/README.md', 'r')
print(f.read())
Scala
scala.io.Source.fromFile("/dbfs/databricks-datasets/README.md").foreach {
print
}
R
library(readr)
f = read_lines("/dbfs/databricks-datasets/README.md", skip = 0, n_max = -1L)
print(f)
Create a table based on a Databricks dataset
This code example demonstrates how to use SQL in the Databricks SQL query editor, or how to use Python, Scala, or R in a notebook in Data Science & Engineering or Databricks Machine Learning, to create a table based on a Databricks dataset:
SQL
CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')
Python
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
Scala
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
R
library(SparkR)
sparkR.session()
sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
Feedback
Submit and view feedback for