The MNIST database of handwritten digits
The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.
Note
Microsoft provides Azure Open Datasets on an “as is” basis. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental or punitive, resulting from your use of the datasets.
This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.
This dataset is sourced from THE MNIST DATABASE of handwritten digits. It's a subset of the larger NIST Hand-printed Forms and Characters Database published by National Institute of Standards and Technology.
Storage location
- Blob account: azureopendatastorage
- Container name: mnist
Four files are available in the container directly:
- train-images-idx3-ubyte.gz: training set images (9,912,422 bytes)
- train-labels-idx1-ubyte.gz: training set labels (28,881 bytes)
- t10k-images-idx3-ubyte.gz: test set images (1,648,877 bytes)
- t10k-labels-idx1-ubyte.gz: test set labels (4,542 bytes)
Data access
Azure Notebooks
Load MNIST into a data frame using Azure Machine Learning tabular datasets.
For more information on Azure Machine Learning datasets, see Create Azure Machine Learning datasets.
Get complete dataset into a data frame
from azureml.opendatasets import MNIST
mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_pandas_dataframe()
mnist_df.info()
Get train and test data frames
mnist_train = MNIST.get_tabular_dataset(dataset_filter='train')
mnist_train_df = mnist_train.to_pandas_dataframe()
X_train = mnist_train_df.drop("label", axis=1).astype(int).values/255.0
y_train = mnist_train_df.filter(items=["label"]).astype(int).values
mnist_test = MNIST.get_tabular_dataset(dataset_filter='test')
mnist_test_df = mnist_test.to_pandas_dataframe()
X_test = mnist_test_df.drop("label", axis=1).astype(int).values/255.0
y_test = mnist_test_df.filter(items=["label"]).astype(int).values
Plot some images of the digits
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
count = count + 1
plt.subplot(1, sample_size, count)
plt.axhline('')
plt.axvline('')
plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)
plt.show()
Download or mount MNIST raw files Azure Machine Learning file datasets.
This works only for Linux based compute. For more information on Azure Machine Learning datasets, see Create Azure Machine Learning datasets.
mnist_file = MNIST.get_file_dataset()
mnist_file
mnist_file.to_path()
Download files to local storage
import os
import tempfile
data_folder = tempfile.mkdtemp()
data_paths = mnist_file.download(data_folder, overwrite=True)
data_paths
Mount files. Useful when training job will run on a remote compute.
import gzip
import struct
import pandas as pd
import numpy as np
# load compressed MNIST gz files and return pandas dataframe of numpy arrays
def load_data(filename, label=False):
with gzip.open(filename) as gz:
gz.read(4)
n_items = struct.unpack('>I', gz.read(4))
if not label:
n_rows = struct.unpack('>I', gz.read(4))[0]
n_cols = struct.unpack('>I', gz.read(4))[0]
res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
res = res.reshape(n_items[0], n_rows * n_cols)
else:
res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
res = res.reshape(n_items[0], 1)
return pd.DataFrame(res)
import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
if sys.platform == 'linux':
print("start mounting....")
with mnist_file.mount(mount_point):
print("list dir...")
print(os.listdir(mount_point))
print("get the dataframe info of mounted data...")
train_images_df = load_data(next(path for path in data_paths if path.endswith("train-images-idx3-ubyte.gz")))
print(train_images_df.info())
Azure Databricks
Load MNIST into a data frame using Azure Machine Learning tabular datasets.
For more information on Azure Machine Learning datasets, see Create Azure Machine Learning datasets.
Get complete dataset into a data frame
# This is a package in preview.
from azureml.opendatasets import MNIST
mnist = MNIST.get_tabular_dataset()
mnist_df = mnist.to_spark_dataframe()
display(mnist_df.limit(5))
Download or mount MNIST raw files Azure Machine Learning file datasets.
This works only for Linux based compute. For more information on Azure Machine Learning datasets, see Create Azure Machine Learning datasets.
mnist_file = MNIST.get_file_dataset()
mnist_file
mnist_file.to_path()
Download files to local storage
import os
import tempfile
mount_point = tempfile.mkdtemp()
mnist_file.download(mount_point, overwrite=True)
Mount files. Useful when training job will run on a remote compute.
import gzip
import struct
import pandas as pd
import numpy as np
# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
with gzip.open(filename) as gz:
gz.read(4)
n_items = struct.unpack('>I', gz.read(4))
if not label:
n_rows = struct.unpack('>I', gz.read(4))[0]
n_cols = struct.unpack('>I', gz.read(4))[0]
res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
res = res.reshape(n_items[0], n_rows * n_cols)
else:
res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
res = res.reshape(n_items[0], 1)
return pd.DataFrame(res)
import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))
if sys.platform == 'linux':
print("start mounting....")
with mnist_file.mount(mount_point):
print(context.mount_point )
print(os.listdir(mount_point))
train_images_df = load_data(os.path.join(mount_point, 'train-images-idx3-ubyte.gz'))
print(train_images_df.info())
Next steps
View the rest of the datasets in the Open Datasets catalog.