Download data from the internet
You can use Azure Databricks notebooks to download data from public URLs. Azure Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages. If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results. See Connect to data sources.
Azure Databricks clusters provide general compute, allowing you to run arbitrary code in addition to Apache Spark commands. Arbitrary commands store results on ephermal storage attached to the driver by default. You must move downloaded data to a new location before reading it with Apache Spark, as Apache Spark cannot read from ephemeral storage. See Work with files on Azure Databricks.
Databricks recommends using Unity Catalog volumes for storing all non-tabular data. You can optionally specify a volume as your destination during download, or move data to a volume after download. Volumes do not support random writes, so download files and unzip them to ephemeral storage before moving them to volumes. See Expand and read Zip compressed files.
Note
Some workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.
Download a file to a volume
Databricks recommends storing all non-tabular data in Unity Catalog volumes.
The following examples use packages for Bash, Python, and Scala to download a file to a Unity Catalog volume:
Bash
%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /Volumes/my_catalog/my_schema/my_volume/curl-subway.csv
Python
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/Volumes/my_catalog/my_schema/my_volume/python-subway.csv")
Scala
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils
FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/Volumes/my_catalog/my_schema/my_volume/scala-subway.csv"))
Download a file to ephemeral storage
The following examples use packages for Bash, Python, and Scala to download a file to ephemeral storage attached to the driver:
Bash
%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /tmp/curl-subway.csv
Python
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/tmp/python-subway.csv")
Scala
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils
FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/tmp/scala-subway.csv"))
Because these files are downloaded to ephemeral storage attached to the driver, use %sh
to see these files, as in the following example:
%sh ls /tmp/
You can use Bash commands to preview the contents of files download this way, as in the following example:
%sh head /tmp/curl-subway.csv
Move data with dbutils
To access data with Apache Spark, you must move it from ephemeral storage to cloud object storage. Databricks recommends using volumes for managing all access to cloud object storage. See Connect to data sources.
The Databricks Utilities (dbutils
) allow you to move files from ephemeral storage attached to the driver to other locations, including Unity Catalog volumes. The following example moves data to a an example volume:
dbutils.fs.mv("file:/tmp/curl-subway.csv", "/Volumes/my_catalog/my_schema/my_volume/subway.csv")
Read downloaded data
After you move the data to a volume, you can read the data as normal. The following code reads in the CSV data moved to a volume:
df = spark.read.format("csv").option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/subway.csv")
display(df)
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for