Connect to data with the Azure Machine Learning studio

In this article, learn how to access your data with the Azure Machine Learning studio. Connect to your data in storage services on Azure with Azure Machine Learning datastores, and then package that data for tasks in your ML workflows with Azure Machine Learning datasets.

The following table defines and summarizes the benefits of datastores and datasets.

Object Description Benefits
Datastores Securely connect to your storage service on Azure, by storing your connection information, like your subscription ID and token authorization in your Key Vault associated with the workspace Because your information is securely stored, you

  • Don't put authentication credentials or original data sources at risk.
  • No longer need to hard code them in your scripts.
  • Datasets By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. With datasets you can,

  • Access data during model training.
  • Share data and collaborate with other users.
  • Use open-source libraries, like pandas, for data exploration.
  • Because datasets are lazily evaluated, and the data remains in its existing location, you

  • Keep a single copy of data in your storage.
  • Incur no extra storage cost
  • Don't risk unintentionally changing your original data sources.
  • Improve ML workflow performance speeds.
  • To understand where datastores and datasets fit in Azure Machine Learning's overall data access workflow, see the Securely access data article.

    For a code first experience, see the following articles to use the Azure Machine Learning Python SDK to:

    Prerequisites

    • An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.

    • Access to Azure Machine Learning studio.

    • An Azure Machine Learning workspace. Create workspace resources.

      • When you create a workspace, an Azure blob container and an Azure file share are automatically registered as datastores to the workspace. They're named workspaceblobstore and workspacefilestore, respectively. If blob storage is sufficient for your needs, the workspaceblobstore is set as the default datastore, and already configured for use. Otherwise, you need a storage account on Azure with a supported storage type.

    Create datastores

    You can create datastores from these Azure storage solutions. For unsupported storage solutions, and to save data egress cost during ML experiments, you must move your data to a supported Azure storage solution. Learn more about datastores.

    You can create datastores with credential-based access or identity-based access.

    Create a new datastore in a few steps with the Azure Machine Learning studio.

    Important

    If your data storage account is in a virtual network, additional configuration steps are required to ensure the studio has access to your data. See Network isolation & privacy to ensure the appropriate configuration steps are applied.

    1. Sign in to Azure Machine Learning studio.
    2. Select Data on the left pane under Assets.
    3. At the top, select Datastores.
    4. Select +Create.
    5. Complete the form to create and register a new datastore. The form intelligently updates itself based on your selections for Azure storage type and authentication type. See the storage access and permissions section to understand where to find the authentication credentials you need to populate this form.

    The following example demonstrates what the form looks like when you create an Azure blob datastore:

    Form for a new datastore

    Create data assets

    After you create a datastore, create a dataset to interact with your data. Datasets package your data into a lazily evaluated consumable object for machine learning tasks, like training. Learn more about datasets.

    There are two types of datasets, FileDataset and TabularDataset. FileDatasets create references to single or multiple files or public URLs. Whereas TabularDatasets represent your data in a tabular format. You can create TabularDatasets from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.

    The following steps describe how to create a dataset in Azure Machine Learning studio.

    Note

    Datasets created through Azure Machine Learning studio are automatically registered to the workspace.

    1. Navigate to Azure Machine Learning studio

    2. Under Assets in the left navigation, select Data. On the Data assets tab, select Create This screenshot highlights Create in the Data assets tab.

    3. Give your data asset a name and optional description. Then, under Type, select one of the Dataset types, either File or Tabular. This screenshot shows set the name, description, and type of the data asset.

    4. You have a few options for your data source. If your data is already stored in Azure, choose "From Azure storage". If you want to upload data from your local drive, choose "From local files". If your data is stored at a public web location, choose "From web files". You can also create a data asset from a SQL database, or from Azure Open Datasets.

    5. For the file selection step, select where you want your data to be stored in Azure, and what data files you want to use.

      1. Enable skip validation if your data is in a virtual network. Learn more about virtual network isolation and privacy.
    6. Follow the steps to set the data parsing settings and schema for your data asset. The settings will be pre-populated based on file type and you can further configure your settings prior to creating the data asset.

    7. Once you reach the Review step, click Create on the last page

    Data preview and profile

    After you create your dataset, verify you can view the preview and profile in the studio with the following steps:

    1. Sign in to the Azure Machine Learning studio
    2. Under Assets in the left navigation, select Data. Screenshot highlights Create in the Data assets tab.
    3. Select the name of the dataset you want to view.
    4. Select the Explore tab.
    5. Select the Preview tab. Screenshot shows a preview of a dataset.
    6. Select the Profile tab. Screenshot shows dataset column metadata in the Profile tab.

    You can get a vast variety of summary statistics across your data set to verify whether your data set is ML-ready. For non-numeric columns, they include only basic statistics like min, max, and error count. For numeric columns, you can also review their statistical moments and estimated quantiles.

    Specifically, Azure Machine Learning dataset's data profile includes:

    Note

    Blank entries appear for features with irrelevant types.

    Statistic Description
    Feature Name of the column that is being summarized.
    Profile In-line visualization based on the type inferred. For example, strings, booleans, and dates will have value counts, while decimals (numerics) have approximated histograms. This allows you to gain a quick understanding of the distribution of the data.
    Type distribution In-line value count of types within a column. Nulls are their own type, so this visualization is useful for detecting odd or missing values.
    Type Inferred type of the column. Possible values include: strings, booleans, dates, and decimals.
    Min Minimum value of the column. Blank entries appear for features whose type doesn't have an inherent ordering (like, booleans).
    Max Maximum value of the column.
    Count Total number of missing and non-missing entries in the column.
    Not missing count Number of entries in the column that aren't missing. Empty strings and errors are treated as values, so they won't contribute to the "not missing count."
    Quantiles Approximated values at each quantile to provide a sense of the distribution of the data.
    Mean Arithmetic mean or average of the column.
    Standard deviation Measure of the amount of dispersion or variation of this column's data.
    Variance Measure of how far spread out this column's data is from its average value.
    Skewness Measure of how different this column's data is from a normal distribution.
    Kurtosis Measure of how heavily tailed this column's data is compared to a normal distribution.

    Storage access and permissions

    To ensure you securely connect to your Azure storage service, Azure Machine Learning requires that you have permission to access the corresponding data storage. This access depends on the authentication credentials used to register the datastore.

    Virtual network

    If your data storage account is in a virtual network, extra configuration steps are required to ensure Azure Machine Learning has access to your data. See Use Azure Machine Learning studio in a virtual network to ensure the appropriate configuration steps are applied when you create and register your datastore.

    Access validation

    Warning

    Cross tenant access to storage accounts is not supported. If cross tenant access is needed for your scenario, please reach out to the AzureML Data Support team alias at amldatasupport@microsoft.com for assistance with a custom code solution.

    As part of the initial datastore creation and registration process, Azure Machine Learning automatically validates that the underlying storage service exists and the user provided principal (username, service principal, or SAS token) has access to the specified storage.

    After datastore creation, this validation is only performed for methods that require access to the underlying storage container, not each time datastore objects are retrieved. For example, validation happens if you want to download files from your datastore; but if you just want to change your default datastore, then validation doesn't happen.

    To authenticate your access to the underlying storage service, you can provide either your account key, shared access signatures (SAS) tokens, or service principal according to the datastore type you want to create. The storage type matrix lists the supported authentication types that correspond to each datastore type.

    You can find account key, SAS token, and service principal information on your Azure portal.

    • If you plan to use an account key or SAS token for authentication, select Storage Accounts on the left pane, and choose the storage account that you want to register.

      • The Overview page provides information such as the account name, container, and file share name.
        1. For account keys, go to Access keys on the Settings pane.
        2. For SAS tokens, go to Shared access signatures on the Settings pane.
    • If you plan to use a service principal for authentication, go to your App registrations and select which app you want to use.

      • Its corresponding Overview page will contain required information like tenant ID and client ID.

    Important

    • If you need to change your access keys for an Azure Storage account (account key or SAS token), be sure to sync the new credentials with your workspace and the datastores connected to it. Learn how to sync your updated credentials.

    • If you unregister and re-register a datastore with the same name, and it fails, the Azure Key Vault for your workspace may not have soft-delete enabled. By default, soft-delete is enabled for the key vault instance created by your workspace, but it may not be enabled if you used an existing key vault or have a workspace created prior to October 2020. For information on how to enable soft-delete, see Turn on Soft Delete for an existing key vault.

    Permissions

    For Azure blob container and Azure Data Lake Gen 2 storage, make sure your authentication credentials have Storage Blob Data Reader access. Learn more about Storage Blob Data Reader. An account SAS token defaults to no permissions.

    • For data read access, your authentication credentials must have a minimum of list and read permissions for containers and objects.

    • For data write access, write and add permissions also are required.

    Train with datasets

    Use your datasets in your machine learning experiments for training ML models. Learn more about how to train with datasets.

    Next steps