Microsoft News Recommendation
Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area.
MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category, and entities. Each impression log contains the click events, non-clicked events, and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID. For more detailed information about the MIND dataset, you can refer to the paper MIND: A Large-scale Dataset for News Recommendation.
Volume
Both the training and validation data are a zip-compressed folder, which contains four different files:
FILE NAME | DESCRIPTION |
---|---|
behaviors.tsv | The click histories and impression logs of users |
news.tsv | The information of news articles |
entity_embedding.vec | The embeddings of entities in news extracted from knowledge graph |
relation_embedding.vec | The embeddings of relations between entities extracted from knowledge graph |
behaviors.tsv
The behaviors.tsv file contains the impression logs and users’ news click histories. It has five columns divided by the tab symbol:
- Impression ID. The ID of an impression.
- User ID. The anonymous ID of a user.
- Time. The impression time with format “MM/DD/YYYY HH:MM:SS AM/PM”.
- History. The news click history (ID list of clicked news) of this user before this impression.
- Impressions. List of news displayed in this impression and user’s click behaviors on them (1 for click and 0 for non-click).
An example is shown in the table below:
COLUMN | CONTENT |
---|---|
Impression ID | 123 |
User ID | U131 |
Time | 11/13/2019 8:36:57 AM |
History | N11 N21 N103 |
Impressions | N4-1 N34-1 N156-0 N207-0 N198-0 |
news.tsv
The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has seven columns, which are divided by the tab symbol:
- News ID
- Category
- Subcategory
- Title
- Abstract
- URL
- Title Entities (entities contained in the title of this news)
- Abstract Entities (entities contained in the abstract of this news)
The full content bodies of MSN news articles are not made available for download, due to licensing structure. However, for your convenience, we have provided a utility script to help parse news webpage from the MSN URLs in the dataset. Due to time limitation, some URLs are expired and cannot be accessed successfully. Currently, we are trying our best to solve this problem.
An example is shown in the following table:
COLUMN | CONTENT |
---|---|
News ID | N37378 |
Category | sports |
SubCategory | golf |
Title | PGA Tour winners |
Abstract | A gallery of recent winners on the PGA Tour. |
URL | https://www.msn.com/en-us/sports/golf/pga-tour-winners/ss-AAjnQjj?ocid=chopendata |
Title Entities | [{“Label”: “PGA Tour”, “Type”: “O”, “WikidataId”: “Q910409”, “Confidence”: 1.0, “OccurrenceOffsets”: [0], “SurfaceForms”: [“PGA Tour”]}] |
Abstract Entites | [{“Label”: “PGA Tour”, “Type”: “O”, “WikidataId”: “Q910409”, “Confidence”: 1.0, “OccurrenceOffsets”: [35], “SurfaceForms”: [“PGA Tour”]}] |
The descriptions of the dictionary keys in the “Entities” column are listed as follows:
KEYS | DESCRIPTION |
---|---|
Label | The entity name in the Wikidata knowledge graph |
Type | The type of this entity in Wikidata |
WikidataId | The entity ID in Wikidata |
Confidence | The confidence of entity linking |
OccurrenceOffsets | The character-level entity offset in the text of title or abstract |
SurfaceForms | The raw entity names in the original text |
entity_embedding.vec & relation_embedding.vec
The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values. We hope this data can facilitate the research of knowledge-aware news recommendation. An example is shown as follows:
ID | EMBEDDING VALUES |
---|---|
Q42306013 | 0.014516 -0.106958 0.024590 … -0.080382 |
Due to some reasons in learning embedding from the subgraph, a few entities may not have embeddings in the entity_embedding.vec file.
Storage location
The data are stored in blobs in the West/East US data center, in the following blob container: 'https://mind201910small.blob.core.windows.net/release/'.
Within the container, the training and validation set are compressed into MINDlarge_train.zip and MINDlarge_dev.zip respectively.
Additional information
The MIND dataset is free to download for research purposes under Microsoft Research License Terms. Contact mind@microsoft.com if you have any questions about the dataset.
Data access
Azure Notebooks
Demo notebook for accessing MIND data on Azure
This notebook provides an example of accessing MIND data from blob storage on Azure.
MIND data are stored in the West/East US data center, so this notebook will run more efficiently on the Azure compute located in West/East US.
Imports and environment
import os
import tempfile
import shutil
import urllib
import zipfile
import pandas as pd
# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)
# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'
Functions
def download_url(url,
destination_filename=None,
progress_updater=None,
force_download=False,
verbose=True):
"""
Download a URL to a temporary file
"""
if not verbose:
progress_updater = None
# This is not intended to guarantee uniqueness, we just know it happens to guarantee
# uniqueness for this application.
if destination_filename is None:
url_as_filename = url.replace('://', '_').replace('/', '_')
destination_filename = \
os.path.join(temp_dir,url_as_filename)
if (not force_download) and (os.path.isfile(destination_filename)):
if verbose:
print('Bypassing download of already-downloaded file {}'.format(
os.path.basename(url)))
return destination_filename
if verbose:
print('Downloading file {} to {}'.format(os.path.basename(url),
destination_filename),
end='')
urllib.request.urlretrieve(url, destination_filename, progress_updater)
assert (os.path.isfile(destination_filename))
nBytes = os.path.getsize(destination_filename)
if verbose:
print('...done, {} bytes.'.format(nBytes))
return destination_filename
Download and extract the files
# For demonstration purpose, we will use small version validation set only.
# This file is about 30MB.
zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(temp_dir)
os.listdir(temp_dir)
Read the files with pandas
# The behaviors.tsv file contains the impression logs and users' news click histories.
# It has 5 columns divided by the tab symbol:
# - Impression ID. The ID of an impression.
# - User ID. The anonymous ID of a user.
# - Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
# - History. The news click history (ID list of clicked news) of this user before this impression.
# - Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).
behaviors_path = os.path.join(temp_dir, 'behaviors.tsv')
pd.read_table(
behaviors_path,
header=None,
names=['impression_id', 'user_id', 'time', 'history', 'impressions'])
# The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file.
# It has 7 columns, which are divided by the tab symbol:
# - News ID
# - Category
# - Subcategory
# - Title
# - Abstract
# - URL
# - Title Entities (entities contained in the title of this news)
# - Abstract Entities (entities contained in the abstract of this news)
news_path = os.path.join(temp_dir, 'news.tsv')
pd.read_table(news_path,
header=None,
names=[
'id', 'category', 'subcategory', 'title', 'abstract', 'url',
'title_entities', 'abstract_entities'
])
# The entity_embedding.vec file contains the 100-dimensional embeddings
# of the entities learned from the subgraph by TransE method.
# The first column is the ID of entity, and the other columns are the embedding vector values.
entity_embedding_path = os.path.join(temp_dir, 'entity_embedding.vec')
entity_embedding = pd.read_table(entity_embedding_path, header=None)
entity_embedding['vector'] = entity_embedding.iloc[:, 1:101].values.tolist()
entity_embedding = entity_embedding[[0,
'vector']].rename(columns={0: "entity"})
entity_embedding
# The relation_embedding.vec file contains the 100-dimensional embeddings
# of the relations learned from the subgraph by TransE method.
# The first column is the ID of relation, and the other columns are the embedding vector values.
relation_embedding_path = os.path.join(temp_dir, 'relation_embedding.vec')
relation_embedding = pd.read_table(relation_embedding_path, header=None)
relation_embedding['vector'] = relation_embedding.iloc[:,
1:101].values.tolist()
relation_embedding = relation_embedding[[0, 'vector'
]].rename(columns={0: "relation"})
relation_embedding
Clean up temporary files
shutil.rmtree(temp_dir)
Examples
See the following examples of how to use the Microsoft News Recommender dataset:
Next steps
Check out several baseline news recommendation models developed on MIND from Microsoft Recommenders Repository
View the rest of the datasets in the Open Datasets catalog.