手寫數字的 MNIST 資料庫

2024-09-03

手寫數字的 MNIST 資料庫包含一個訓練集 (共有 60,000 個範例) 及一個測試集 (共有 10,000 個範例)。數字已大小正規化且在固定大小的影像置中。

注意

Microsoft 依「現況」提供 Azure 開放資料集。針對　貴用戶對資料集的使用方式，Microsoft 不提供任何明示或默示的擔保、保證或條件。在　貴用戶當地法律允許的範圍內，針對因使用資料集而導致的任何直接性、衍生性、特殊性、間接性、附隨性或懲罰性損害或損失，Microsoft 概不承擔任何責任。

此資料集是根據 Microsoft 接收來源資料的原始條款所提供。資料集可能包含源自 Microsoft 的資料。

此資料集的來源是手寫數字的 MNIST 資料庫。此資料庫是美國國家標準暨技術研究院所發佈更大的 NIST 手寫字體與字元資料庫的一部分。

儲存位置

Blob 帳戶：azureopendatastorage
容器名稱：mnist

有四個檔案直接位於容器中：

train-images-idx3-ubyte.gz：定型集影像 (9,912,422 位元組)
train-labels-idx1-ubyte.gz：定型集標籤 (28,881 位元組)
t10k-images-idx3-ubyte.gz：測試集影像 (1,648,877 位元組)
t10k-labels-idx1-ubyte.gz：測試集標籤 (4,542 位元組)

資料存取

使用 Azure Machine Learning 表格式資料集，將 MNIST 載入資料框架。

如需 Azure Machine Learning 資料集的詳細資訊，請參閱建立 Azure Machine Learning 資料集。

將完整的資料集放入資料框架

from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_pandas_dataframe()
mnist_df.info()

取得定型和測試資料框架

mnist_train = MNIST.get_tabular_dataset(dataset_filter='train')
mnist_train_df = mnist_train.to_pandas_dataframe()
X_train = mnist_train_df.drop("label", axis=1).astype(int).values/255.0
y_train = mnist_train_df.filter(items=["label"]).astype(int).values

mnist_test = MNIST.get_tabular_dataset(dataset_filter='test')
mnist_test_df = mnist_test.to_pandas_dataframe()
X_test = mnist_test_df.drop("label", axis=1).astype(int).values/255.0
y_test = mnist_test_df.filter(items=["label"]).astype(int).values

繪製一些數字影像

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

下載或裝載 MNIST 原始檔案 Azure Machine Learning 檔案資料集。

這僅適用於以 Linux 為基礎的計算。如需 Azure Machine Learning 資料集的詳細資訊，請參閱建立 Azure Machine Learning 資料集。

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

將檔案下載到本機儲存體

import os
import tempfile

data_folder = tempfile.mkdtemp()
data_paths = mnist_file.download(data_folder, overwrite=True)
data_paths

裝載檔案。在遠端計算上執行定型作業時很有用。

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return pandas dataframe of numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print("list dir...")
    print(os.listdir(mount_point))
    print("get the dataframe info of mounted data...")
    train_images_df = load_data(next(path for path in data_paths if path.endswith("train-images-idx3-ubyte.gz")))
    print(train_images_df.info())

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import urllib.request
import os

data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-labels.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-labels.gz'))

import gzip
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res

# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
X_train = load_data(os.path.join(
    data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(
    data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(
    data_folder, 'test-labels.gz'), True).reshape(-1)

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

Azure Databricks

azureml-opendatasets
azure-storage

使用 Azure Machine Learning 表格式資料集，將 MNIST 載入資料框架。

如需 Azure Machine Learning 資料集的詳細資訊，請參閱建立 Azure Machine Learning 資料集。

將完整的資料集放入資料框架

# This is a package in preview.
from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_spark_dataframe()

display(mnist_df.limit(5))

下載或裝載 MNIST 原始檔案 Azure Machine Learning 檔案資料集。

這僅適用於以 Linux 為基礎的計算。如需 Azure Machine Learning 資料集的詳細資訊，請參閱建立 Azure Machine Learning 資料集。

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

將檔案下載到本機儲存體

import os
import tempfile

mount_point = tempfile.mkdtemp()
mnist_file.download(mount_point, overwrite=True)

裝載檔案。在遠端計算上執行定型作業時很有用。

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print(context.mount_point )
    print(os.listdir(mount_point))  
    train_images_df = load_data(os.path.join(mount_point, 'train-images-idx3-ubyte.gz'))
    print(train_images_df.info())

下一步

檢視開放資料集目錄中的其餘資料集。

共用方式為

手寫數字的 MNIST 資料庫

儲存位置

資料存取

Azure Notebooks

使用 Azure Machine Learning 表格式資料集，將 MNIST 載入資料框架。

將完整的資料集放入資料框架

取得定型和測試資料框架

繪製一些數字影像

下載或裝載 MNIST 原始檔案 Azure Machine Learning 檔案資料集。

將檔案下載到本機儲存體

裝載檔案。 在遠端計算上執行定型作業時很有用。

Azure Databricks

使用 Azure Machine Learning 表格式資料集，將 MNIST 載入資料框架。

將完整的資料集放入資料框架

下載或裝載 MNIST 原始檔案 Azure Machine Learning 檔案資料集。

將檔案下載到本機儲存體

裝載檔案。 在遠端計算上執行定型作業時很有用。

下一步

意見反應

其他資源

裝載檔案。在遠端計算上執行定型作業時很有用。

裝載檔案。在遠端計算上執行定型作業時很有用。