필기 숫자의 MNIST 데이터베이스

아티클
03/04/2024

필기 숫자 MNIST 데이터베이스에는 예제 60,000개가 있는 학습 세트와 예제 10,000개가 있는 테스트 세트가 포함됩니다. 이 숫자는 크기가 표준화되었고 고정 크기 이미지로 중앙에 배치됩니다.

참고 항목

Microsoft는 Azure Open Datasets를 “있는 그대로” 제공합니다. Microsoft는 귀하의 데이터 세트 사용과 관련하여 어떠한 명시적이거나 묵시적인 보증, 보장 또는 조건을 제공하지 않습니다. 귀하가 거주하는 지역의 법규가 허용하는 범위 내에서 Microsoft는 귀하의 데이터 세트 사용으로 인해 발생하는 일체의 직접적, 결과적, 특별, 간접적, 부수적 또는 징벌적 손해 또는 손실을 비롯한 모든 손해 또는 손실에 대한 모든 책임을 부인합니다.

이 데이터 세트는 Microsoft가 원본 데이터를 받은 원래 사용 약관에 따라 제공됩니다. 데이터 세트에는 Microsoft가 제공한 데이터가 포함될 수 있습니다.

이 데이터 세트는 필기 숫자 MNIST 데이터베이스에서 제공됩니다. 이 데이터 세트는 미국 국립표준기술원에서 게시한 더 큰 NIST 필기 양식 및 문자 데이터베이스의 일부입니다.

스토리지 위치

Blob 계정: azureopendatastorage
컨테이너 이름: mnist

다음 4개 파일은 컨테이너에서 바로 사용할 수 있습니다.

train-images-idx3-ubyte.gz: 학습 세트 이미지(9,912,422바이트)
train-labels-idx1-ubyte.gz: 학습 세트 레이블(28,881바이트)
t10k-images-idx3-ubyte.gz: 테스트 세트 이미지(1,648,877바이트)
t10k-labels-idx1-ubyte.gz: 테스트 세트 레이블(4,542바이트)

데이터 액세스

Azure Machine Learning 테이블 형식 데이터 세트를 사용하여 MNIST를 데이터 프레임에 로드합니다.

Azure Machine Learning 데이터 세트에 대한 자세한 내용은 Azure Machine Learning 데이터 세트 만들기를 참조하세요.

데이터 프레임으로 전체 데이터 세트 가져오기

from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_pandas_dataframe()
mnist_df.info()

학습 및 테스트 데이터 프레임 가져오기

mnist_train = MNIST.get_tabular_dataset(dataset_filter='train')
mnist_train_df = mnist_train.to_pandas_dataframe()
X_train = mnist_train_df.drop("label", axis=1).astype(int).values/255.0
y_train = mnist_train_df.filter(items=["label"]).astype(int).values

mnist_test = MNIST.get_tabular_dataset(dataset_filter='test')
mnist_test_df = mnist_test.to_pandas_dataframe()
X_test = mnist_test_df.drop("label", axis=1).astype(int).values/255.0
y_test = mnist_test_df.filter(items=["label"]).astype(int).values

숫자의 일부 이미지 플롯

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

MNIST 원시 파일 Azure Machine Learning 파일 데이터 세트를 다운로드하거나 탑재합니다.

Linux 기반 컴퓨팅에 대해서만 작동합니다. Azure Machine Learning 데이터 세트에 대한 자세한 내용은 Azure Machine Learning 데이터 세트 만들기를 참조하세요.

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

로컬 스토리지에 파일 다운로드

import os
import tempfile

data_folder = tempfile.mkdtemp()
data_paths = mnist_file.download(data_folder, overwrite=True)
data_paths

파일을 탑재합니다. 원격 컴퓨팅에서 학습 작업을 실행하는 경우에 유용합니다.

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return pandas dataframe of numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print("list dir...")
    print(os.listdir(mount_point))
    print("get the dataframe info of mounted data...")
    train_images_df = load_data(next(path for path in data_paths if path.endswith("train-images-idx3-ubyte.gz")))
    print(train_images_df.info())

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import urllib.request
import os

data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'train-labels.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-images-idx3-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-images.gz'))
urllib.request.urlretrieve('https://azureopendatastorage.blob.core.windows.net/mnist/t10k-labels-idx1-ubyte.gz',
                           filename=os.path.join(data_folder, 'test-labels.gz'))

import gzip
import struct

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res

# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
X_train = load_data(os.path.join(
    data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(
    data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(
    data_folder, 'test-labels.gz'), True).reshape(-1)

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()

Azure Databricks

azureml-opendatasets
azure-storage

Azure Machine Learning 테이블 형식 데이터 세트를 사용하여 MNIST를 데이터 프레임에 로드합니다.

Azure Machine Learning 데이터 세트에 대한 자세한 내용은 Azure Machine Learning 데이터 세트 만들기를 참조하세요.

데이터 프레임으로 전체 데이터 세트 가져오기

# This is a package in preview.
from azureml.opendatasets import MNIST

mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_spark_dataframe()

display(mnist_df.limit(5))

MNIST 원시 파일 Azure Machine Learning 파일 데이터 세트를 다운로드하거나 탑재합니다.

Linux 기반 컴퓨팅에 대해서만 작동합니다. Azure Machine Learning 데이터 세트에 대한 자세한 내용은 Azure Machine Learning 데이터 세트 만들기를 참조하세요.

mnist_file = MNIST.get_file_dataset()
mnist_file

mnist_file.to_path()

로컬 스토리지에 파일 다운로드

import os
import tempfile

mount_point = tempfile.mkdtemp()
mnist_file.download(mount_point, overwrite=True)

파일을 탑재합니다. 원격 컴퓨팅에서 학습 작업을 실행하는 경우에 유용합니다.

import gzip
import struct
import pandas as pd
import numpy as np

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with mnist_file.mount(mount_point):
    print(context.mount_point )
    print(os.listdir(mount_point))  
    train_images_df = load_data(os.path.join(mount_point, 'train-images-idx3-ubyte.gz'))
    print(train_images_df.info())

다음 단계

Open Datasets 카탈로그에서 나머지 데이터 세트를 봅니다.

필기 숫자의 MNIST 데이터베이스

스토리지 위치

데이터 액세스

Azure Notebooks

Azure Machine Learning 테이블 형식 데이터 세트를 사용하여 MNIST를 데이터 프레임에 로드합니다.

데이터 프레임으로 전체 데이터 세트 가져오기

학습 및 테스트 데이터 프레임 가져오기

숫자의 일부 이미지 플롯

MNIST 원시 파일 Azure Machine Learning 파일 데이터 세트를 다운로드하거나 탑재합니다.

로컬 스토리지에 파일 다운로드

파일을 탑재합니다. 원격 컴퓨팅에서 학습 작업을 실행하는 경우에 유용합니다.

Azure Databricks

Azure Machine Learning 테이블 형식 데이터 세트를 사용하여 MNIST를 데이터 프레임에 로드합니다.

데이터 프레임으로 전체 데이터 세트 가져오기

MNIST 원시 파일 Azure Machine Learning 파일 데이터 세트를 다운로드하거나 탑재합니다.

로컬 스토리지에 파일 다운로드

파일을 탑재합니다. 원격 컴퓨팅에서 학습 작업을 실행하는 경우에 유용합니다.

다음 단계

추가 리소스