手書き数字の MNIST データベース

[アーティクル]
09/03/2024

手書き数字の MNIST データベースには、60,000 件の例が含まれたトレーニングセットと、10,000 件の例が含まれたテストセットがあります。数字は、サイズが正規化され、固定サイズの画像に中心が揃えられています。

注意

Microsoft は、Azure Open Datasets を "現状有姿" で提供します。 Microsoft は、データセットの使用に関して、明示または黙示を問わず、いかなる保証も行わないものとし、条件を定めることもありません。現地の法律の下で認められている範囲内で、Microsoft は、データセットの使用に起因する、直接的、派生的、特別、間接的、偶発的、または懲罰的なものを含めたいかなる損害または損失に対しても一切の責任を負わないものとします。

このデータセットは、Microsoft がソースデータを受け取った元の条件に基づいて提供されます。データセットには、Microsoft が提供するデータが含まれている場合があります。

このデータセットのソースは、手書き数字の MNIST データベースです。これは、米国国立標準技術研究所によって公開されている、より大規模な NIST Hand-printed Forms and Characters データベースのサブセットです。

保存先

BLOB アカウント: azureopendatastorage
コンテナー名: mnist

コンテナーでは次の 4 つのファイルを直接使用できます。

train-images-idx3-ubyte.gz: トレーニングセットの画像 (9,912,422 バイト)
train-labels-idx1-ubyte.gz: トレーニングセットのラベル (28,881 バイト)
t10k-images-idx3-ubyte.gz: テストセットの画像 (1,648,877 バイト)
t10k-labels-idx1-ubyte.gz: テストセットのラベル (4,542 バイト)

データアクセス

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

Azure Machine Learning のデータセットの詳細については、「Azure Machine Learning データセットを作成する」を参照してください。

データフレームに完全なデータセットを取得する

from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_pandas_dataframe()
mnist_df.info()

トレーニングデータフレームとテストデータフレームを取得する

mnist_train = MNIST.get_tabular_dataset(dataset_filter='train')
mnist_train_df = mnist_train.to_pandas_dataframe()
X_train = mnist_train_df.drop("label", axis=1).astype(int).values/255.0
y_train = mnist_train_df.filter(items=["label"]).astype(int).values

mnist_test = MNIST.get_tabular_dataset(dataset_filter='test')
mnist_test_df = mnist_test.to_pandas_dataframe()
X_test = mnist_test_df.drop("label", axis=1).astype(int).values/255.0
y_test = mnist_test_df.filter(items=["label"]).astype(int).values

一部の数字の画像をプロットする

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

これは Linux ベースのコンピューティングでのみ機能します。 Azure Machine Learning のデータセットの詳細については、「Azure Machine Learning データセットを作成する」を参照してください。

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

ローカルストレージにファイルをダウンロードする

import os
import tempfile

data_folder = tempfile.mkdtemp()
data_paths = mnist_file.download(data_folder, overwrite=True)
data_paths

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return pandas dataframe of numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print("list dir...")
    print(os.listdir(mount_point))
    print("get the dataframe info of mounted data...")
    train_images_df = load_data(next(path for path in data_paths if path.endswith("train-images-idx3-ubyte.gz")))
    print(train_images_df.info())

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import urllib.request
import os

data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-labels.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-labels.gz'))

import gzip
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res

# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
X_train = load_data(os.path.join(
    data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(
    data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(
    data_folder, 'test-labels.gz'), True).reshape(-1)

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

Azure Databricks

azureml-opendatasets
azure-storage

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

Azure Machine Learning のデータセットの詳細については、「Azure Machine Learning データセットを作成する」を参照してください。

データフレームに完全なデータセットを取得する

# This is a package in preview.
from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_spark_dataframe()

display(mnist_df.limit(5))

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

ローカルストレージにファイルをダウンロードする

import os
import tempfile

mount_point = tempfile.mkdtemp()
mnist_file.download(mount_point, overwrite=True)

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print(context.mount_point )
    print(os.listdir(mount_point))  
    train_images_df = load_data(os.path.join(mount_point, 'train-images-idx3-ubyte.gz'))
    print(train_images_df.info())

次のステップ

Open Datasets カタログの残りのデータセットを表示します。

次の方法で共有

手書き数字の MNIST データベース

保存先

データアクセス

Azure Notebooks

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

データフレームに完全なデータセットを取得する

トレーニングデータフレームとテストデータフレームを取得する

一部の数字の画像をプロットする

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

ローカルストレージにファイルをダウンロードする

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)

Azure Databricks

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

データフレームに完全なデータセットを取得する

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

ローカルストレージにファイルをダウンロードする

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)

次のステップ

フィードバック

その他のリソース

次の方法で共有

手書き数字の MNIST データベース

保存先

データ アクセス

Azure Notebooks

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータ フレームに読み込みます。

データ フレームに完全なデータセットを取得する

トレーニング データ フレームとテスト データ フレームを取得する

一部の数字の画像をプロットする

MNIST の生ファイルを Azure Machine Learning ファイル データセットにダウンロードするか、マウントする。

ローカル ストレージにファイルをダウンロードする

ファイルをマウントする (トレーニング ジョブがリモート コンピューティングで実行される場合に便利です。)

Azure Databricks

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータ フレームに読み込みます。

データ フレームに完全なデータセットを取得する

MNIST の生ファイルを Azure Machine Learning ファイル データセットにダウンロードするか、マウントする。

ローカル ストレージにファイルをダウンロードする

ファイルをマウントする (トレーニング ジョブがリモート コンピューティングで実行される場合に便利です。)

次のステップ

フィードバック

その他のリソース

データアクセス

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

データフレームに完全なデータセットを取得する

トレーニングデータフレームとテストデータフレームを取得する

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

ローカルストレージにファイルをダウンロードする

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)

Azure Machine Learning の表形式のデータセットを使用して、MNIST をデータフレームに読み込みます。

データフレームに完全なデータセットを取得する

MNIST の生ファイルを Azure Machine Learning ファイルデータセットにダウンロードするか、マウントする。

ローカルストレージにファイルをダウンロードする

ファイルをマウントする (トレーニングジョブがリモートコンピューティングで実行される場合に便利です。)