Azure Machine learning and loading external data sets into your experiment
Amy Nicholson and I attended an event at University yesterday, Amy presentation was on the use of Azure ML Studio and how the University could effectively use Azure Machine Studio within their Machine Learning teaching, learning and research.
One of the questions we received at the end of the session was how to get large datasets from local computer to Azure ML.
The size limit for uploading local datasets directly to Azure ML is 1.98 GB.
To overcome this limitation and upload larger files, up to 10 GB, the recommended approach is through following 2 steps:
- Stage the data to Microsoft Azure Blob Storage using AzCopy command-line utility
- Use Reader module to import data from Blob to ML Studio
Note that for large files, bringing in datasets can take long time to complete, 10 minutes per GB of data or more.
Step 1: Stage Data to Blob Storage using AzCopy
First install AzCopy command-line utility on your local computer. Then start Command Prompt and use AzCopy to upload your file from local folder to blob storage:
cd "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy"
.\AzCopy.exe /Source:C:\LocalFolder /Dest:https://mystorage.blob.core.windows.net/mycontainer /DestKey:MyStorageAccountKey /Pattern:myfile.csv
Note: To optimize performance for next step, use South Central US as the region for your storage account. South Central US is the same region that Azure ML service uses.
Step 2: Use Reader Module to Import Data from Blob to ML Studio
Create a new blank experiment in Azure ML Studio. Drag Reader module to experiment canvas, and configure its parameters to read data from the blob created in Step 1:
- Data source: AzureBlobStorage
- Authentication type: Account
- Account name: <mystorage>
- Account key: < MyStorageAccountKey>
- Path to container, directory or blob: <mycontainer>/<myfile.csv>
Run the experiment. Once the experiment has finished, right-click the output port of the Reader module and select “Save as Dataset”. Note that the Reader module will re-read the dataset every time the experiment is run, but saving the dataset will create a static copy that is available from “Saved Datasets” list in ML Studio.
If your interested in learning more about Azure Machine Learning
Here is a short introduction on Preprocessing Data in Azure Machine Learning Studio
Here is a short video on Predictive Modeling with Azure ML Studio
For more videos on Azure ML see https://channel9.msdn.com
Resources
This repository contains an Azure Machine Learning Student focused document for getting started. Looking into the Azure Machine Learning Studio, Gallery and Notebooks. This will take you from end-to-end building and deploying a model using the cloud service on Azure
https://github.com/amykatenicho/AzureMLStudentsPython
The Azure Workshop is a series of hands-on coding labs to help computer science faculty and student quickly learn how to deploy solutions to the Azure cloud across common scenarios like Web Dev, App Dev, Internet of Things, and Data Science with Machine Learning using cross-platform technologies. Labs can be completed on a Windows device or through VMs on Mac or Linux. Format is typically 1-day instructor-led; however groups may opt to customize into 2-hour or 4-hour lengths too. Your feedback is welcome in improving these labs.
https://github.com/MSFTImagine/computerscience/tree/master/Workshop
Data science in 5 steps with Microsoft Azure Machine Learning
A set of Machine Learning Resources
https://blogs.msdn.microsoft.com/uk_faculty_connection/?s=Machine+Learning