建立和探索具有標籤的 Azure Machine Learning 資料集

文章
10/16/2024

在本文中，您將了解如何從 Azure Machine Learning 資料標記專案匯出資料標籤，並將其載入為流行的常用格式，例如適用於資料探索的 pandas 資料框架。

什麼是具有標籤的資料集

具有標籤的 Azure Machine Learning 資料集，即稱為標籤資料集。這些特定的資料集，是具有專用標籤資料行的 TabularDataset，並且僅建立作為 Azure Machine Learning 資料標記專案的輸出。建立資料標記專案進行影像標記或文字標記。 Machine Learning 支援用於影像分類的資料標記專案 (多標籤或多類別)，以及搭配使用週框方塊的物體識別。

必要條件

Azure 訂用帳戶。如果您沒有 Azure 訂用帳戶，請在開始前建立免費帳戶。
適用於 Python 的 Azure Machine Learning SDK，或 Azure Machine Learning Studio 的存取權。
Machine Learning 工作區。請參閱建立工作區資源。
Azure Machine Learning 資料標記專案的存取權。如果您沒有標記專案，請首先建立一個標記專案進行影像標記或文字標記。

匯出資料標籤

當您完成資料標記專案後，即可從標記專案匯出標籤資料。這麼做可讓您同時擷取資料和其標籤的參考，並以 COCO 格式或 Azure Machine Learning 資料集來匯出這些資料。

在標籤專案的 [專案詳細資料] 頁面上使用 [匯出] 按鈕。

COCO

COCO 檔案會建立在 Azure Machine Learning 工作區的預設 Blob 存放區中，而其所在的資料夾位於 export/coco 內。

注意

在物件偵測專案中，會將 COCO 檔案中匯出的 bbox:[x，y，width，height] 值正規化。它們會調整為 1。範例：在 640x480 的像素影像中，位於 (10, 10) 的周框方塊 (寬度為 30 像素，高度為 60 像素) 將標註為 (0.015625, 0.02083, 0.046875, 0.125)。由於座標會正規化，因此針對所有影像，其會將「寬度」和「高度」顯示為 '0.0'。您可以使用 Python 程式庫 (例如 OpenCV 或 Pillow (PIL)) 來取得實際的寬度和高度。

Azure Machine Learning 資料集

您可以在 Azure Machine Learning Studio 的 [資料集] 區段中，存取已匯出的 Azure Machine Learning 資料集。資料集詳細資料頁面也會提供從 Python 存取標籤的程式碼範例。

所匯出的資料集

提示

一旦將已標記資料匯出至 Azure Machine Learning 資料集，您就可以使用 AutoML，來建置已標記資料上定型的電腦視覺模型。深入了解使用 Python 設定 AutoML 以定型電腦視覺模型

透過 pandas 資料框架探索已標記的資料集

將已標記的資料集載入 pandas 資料框架，以利用熱門的開放原始碼程式庫，透過 azureml-dataprep 類別的 to_pandas_dataframe() 方法進行資料探索。

使用下列殼層命令來安裝類別：

pip install azureml-dataprep

在下列程式碼中，animal_labels 資料集是標記專案先前儲存至工作區的輸出。匯出的資料集是 TabularDataset。

適用於： Python SDK azureml v1 (部分機器翻譯)

import azureml.core
from azureml.core import Dataset, Workspace

# get animal_labels dataset from the workspace
animal_labels = Dataset.get_by_name(workspace, 'animal_labels')
animal_pd = animal_labels.to_pandas_dataframe()

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

#read images from dataset
img = mpimg.imread(animal_pd['image_url'].iloc(0).open())
imgplot = plt.imshow(img)

分享方式：