Import from Web URL via HTTP

2019-05-06

Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.

See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
Learn more about Azure Machine Learning.

ML Studio (classic) documentation is being retired and may not be updated in the future.

This article describes how to use the Import Data module in Machine Learning Studio (classic), to read data from a public Web page for use in a machine learning experiment.

Note

Applies to: Machine Learning Studio (classic) only

Similar drag-and-drop modules are available in Azure Machine Learning designer.

The following restrictions apply to data published on a web page:

Data must be in one of the supported formats: CSV, TSV, ARFF, or SvmLight. Other data will cause errors.
No authentication is required or supported. Data must be publicly available.

How to import data via HTTP

There are two ways to get data: use the wizard to set up the data source, or configure it manually.

Use the Data Import Wizard

Add the Import Data module to your experiment. You can find the module in Studio (classic), in the Data Input and Output category.
Click Launch Import Data Wizard and select Web URL via HTTP.
Paste in the URL, and select a data format.
When configuration is complete, right-click the module, and select Run Selected.

To edit an existing data connection, start the wizard again. The wizard loads all previous configuration details so that you don't have to start again from scratch

Manually set properties in the Import Data module

The following steps describe how to manually configure the import source.

Add the Import Data module to your experiment. You can find the module in Studio (classic), in the Data Input and Output category.
For Data source, select Web URL via HTTP.
For URL, type or paste the full URL of the page that contains the data you want to load.

The URL should include the site URL and the full path, with file name and extension, to the page that contains the data to load.

For example, the following page contains the Iris data set from the machine learning repository of the University of California, Irvine:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
For Data format, select one of the supported data formats from the list.

We recommend that you always check the data beforehand to determine the format. The UC Irvine page uses the CSV format. Other supported data formats are TSV, ARFF, and SvmLight.
If the data is in CSV or TSV format, use the File has header row option to indicate whether or not the source data includes a header row. The header row is used to assign column names.
Select the Use cached results options if you don't expect the data to change much, or if you want to avoid reloading the data each time you run the experiment.

When this option is selected, the experiment loads the data the first time the module is run, and thereafter uses a cached version of the dataset.

If you want to re-load the dataset on each iteration of the experiment dataset, deselect the Use cached results option. Results are also re-loaded if there are any changes to the parameters of Import Data.
Run the experiment.

Results

When complete, click the output dataset and select Visualize to see if the data was imported successfully.

Examples

See these examples in the Azure AI Gallery of machine learning experiments that get data from public web sites:

Letter Recognition sample: Gets a training dataset from the public machine learning repository hosted by UC Irvine.
Download UCI Dataset: Reads a dataset in the CSV format.

Technical notes

This section contains implementation details, tips, and answers to frequently asked questions.

Common questions

Can I filter data as it is being read from the source

No. That option is not supported with this data source.

After reading the data into Machine Learning Studio (classic), you can split the dataset, use sampling, and so forth to get just the rows you want:

Write some simple R code in the Execute R Script to get a portion of the data by rows or columns.
Use the Split Data module with a relative expression or a regular expression to isolate the data you want.
If you loaded more data than you need, overwrite the cached dataset by reading a new dataset, and saving it with the same name.

How can I avoid re-loading the same data unnecessarily

If your source data changes, you can refresh the dataset and add new data by re-running Import Data.

If you don't want to re-read from the source each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module checks whether the experiment has run previously using the same source and same input options. If a previous run is found, the data in the cache is used, instead of re-loading the data from the source.

Why was an extra row added at the end of my dataset

If the Import Data module encounters a row of data that is followed by an empty line or a trailing new line character, an extra row is added at the end of the table. This new row contains missing values.

The reason for interpreting a trailing new line as a new row is that Import Data cannot determine the difference between an actual empty line and an empty line that is created by the user pressing ENTER at the end of a file.

Because some machine learning algorithms support missing data and would thus treat this line as a case (which in turn could affect the results), you should use Clean Missing Data to check for missing values (particularly rows that are completely empty), and remove them as needed.

Before you check for empty rows, you might also want to divide the dataset by using Split Data. This separates rows with partial missing values, which represent actual missing values in the source data. Use the Select head N rows option to read the first part of the dataset into a separate container from the last line.

Why are some characters in my source file not displayed correctly

Machine Learning supports the UTF-8 encoding. If your source file used another type of encoding, the characters might not be imported correctly.

Module parameters

Name	Range	Type	Default	Description
Data source	List	Data Source Or Sink	Azure Blob Storage	Data source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, an on-premises SQL Server database, a Hive table, or an OData endpoint.
URL	any	String	none	URL for HTTP
Data format	CSV TSV ARFF SvmLight	Data Format	CSV	File type of HTTP source
CSV or TSV has header row	TRUE/FALSE	Boolean	false	Indicates if CSV or TSV file has a header row
Use cached results	TRUE/FALSE	Boolean	FALSE	Module executes only if valid cache does not exist. Otherwise, cached data from previous execution is used.

Outputs

Name	Type	Description
Results dataset	Data Table	Dataset with downloaded data

Exceptions

Exception	Description
Error 0027	An exception occurs when two objects have to be the same size, but they are not.
Error 0003	An exception occurs if one or more of inputs are null or empty.
Error 0029	An exception occurs when an invalid URI is passed.
Error 0030	an exception occurs in when it is not possible to download a file.
Error 0002	An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method.
Error 0048	An exception occurs when it is not possible to open a file.
Error 0046	An exception occurs when it is not possible to create a directory on specified path.
Error 0049	An exception occurs when it is not possible to parse a file.

For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.

Share via