Microsoft 新聞建議

發行項
01/10/2024

Microsoft News Dataset （MIND）是新聞推薦研究的大型數據集。該資料集從 Microsoft News 網站收集匿名的行為記錄。 MIND 旨在為新聞推薦的資料集樹立典範，推動新聞推薦與推薦系統領域的研究。

MIND 提供約 160,000 篇英文新聞文章，以及超過 1,500 萬個由 100 萬名使用者所產生的曝光記錄。每個新聞文章都包含 RTF 文字內容，包括標題、抽象、本文、類別和實體。每個曝光記錄都包含點擊事件、未點選的事件，以及此使用者在此曝光之前的歷史新聞點擊行為。為了保護使用者隱私權，每位使用者在經過安全雜湊處理成匿名識別碼之後，即會與生產系統取消連結。如需MIND資料集的詳細資訊，您可以參閱MIND：新聞建議的大型數據集檔。

體積

訓練和驗證資料是壓縮資料夾，其中包含四個不同的檔案：

檔名	描述
behaviors.tsv	使用者的點選歷程記錄和曝光記錄
news.tsv	新聞文章的資訊
entity_embedding.vec	新聞中的實體內嵌，擷取自知識圖表
relation_embedding.vec	實體之間的關係內嵌，擷取自知識圖表

behaviors.tsv

behaviors.tsv 檔案包含印象記錄和用戶的新聞點選歷程記錄。它有五個數據行除以索引標籤號：

曝光識別碼。曝光的識別碼。
使用者識別碼。使用者的匿名識別碼。
時間：格式為「MM/DD/YYYY HH：MM：SS AM/PM」的曝光時間。
歷程記錄。此使用者在此曝光之前的新聞點選歷程記錄 (所點選新聞的識別碼清單)。
曝光。此印象中顯示的新聞清單，以及使用者的點擊行為（1 代表按兩下，0 代表非點選）。

下表顯示範例：

COLUMN	CONTENT
曝光識別碼	123
使用者識別碼	U131
Time	2019/11/13 上午 8:36:57
History	N11 N21 N103
曝光	N4-1 N34-1 N156-0 N207-0 N198-0

news.tsv

news.tsv 檔案包含 behaviors.tsv 檔案中涉及之新聞文章的詳細資訊。它有七個數據行，其除以製表符：

新聞識別碼
類別
子類別
標題
摘要
URL
標題實體 (包含在此新聞標題中的實體)
摘要實體 (包含在此新聞摘要中的實體)

MSN 新聞文章的完整內容主體由於授權結構而無法下載。不過，為了方便起見，我們提供了公用程式指令碼，以協助剖析資料集中來自 MSN URL 的新網頁。由於時間限制，某些 URL 已過期而無法成功評定。目前，我們正在盡最大努力來解決此問題。

下表顯示範例：

COLUMN	CONTENT
新聞識別碼	N37378
類別	運動
子類別	高爾夫球
標題	PGA 巡迴賽優勝者
摘要	PGA 巡迴賽最新優勝者圖庫。
URL	https://www.msn.com/en-us/sports/golf/pga-tour-winners/ss-AAjnQjj?ocid=chopendata
標題實體	[{“Label”： “PGA Tour”， “Type”： “O”， “WikidataId”： “Q910409”， “Confidence”： 1.0， “OccurrenceOffsets”： [0]， “SurfaceForms”： [“PGA Tour”]}]
摘要實體	[{“Label”： “PGA Tour”， “Type”： “O”， “WikidataId”： “Q910409”， “Confidence”： 1.0， “OccurrenceOffsets”： [35]， “SurfaceForms”： [“PGA Tour”]}]

“Entities” 數據行中字典索引鍵的描述如下：

金鑰	描述
標籤	Wikidata 知識圖表中的實體名稱
類型	維基百科中的實體類型
WikidataId	維基百科中的實體識別碼
信賴度	實體連結的信賴度
OccurrenceOffsets	標題或摘要文字中的字元層級實體位移
SurfaceForms	原始文字中未經處理的實體名稱

entity_embedding.vec 與 relation_embedding.vec

entity_embedding.vec 和 relation_embedding.vec 檔案包含 TransE 方法從子圖（來自WikiData 知識圖表）學習的實體和關聯性 100 維度內嵌。在這兩個檔案中，第一個資料行是實體/關係識別碼，而其他資料行則是內嵌向量值。我們希望此資料可以推動知識自覺的新聞推薦研究。以下顯示範例：

識別碼	內嵌值
Q42306013	0.014516 -0.106958 0.024590 ... -0.080382

由於從子檔中學習內嵌的某些原因，一些實體可能沒有內嵌在 entity_embedding.vec 檔案中。

儲存位置

數據會儲存在美國西部/東部數據中心的 Blob 中，並儲存在下列 Blob 容器中： 'https://mind201910small.blob.core.windows.net/release/'。

在容器中，定型和驗證集會分別壓縮成 MINDlarge_train.zip 和 MINDlarge_dev.zip。

其他資訊

根據 Microsoft Research 授權條款，MIND 資料集可供研究目的免費下載。如果您有數據集的任何問題，請連絡 mind@microsoft.com 。

資料存取

Azure Notebooks

azureml-opendatasets

在 Azure 上存取 MIND 數據的示範筆記本

此筆記本提供從 Azure 上的 Blob 記憶體存取 MIND 數據的範例。

MIND 數據會儲存在美國西部/東部數據中心，因此此筆記本會在位於美國西部/東部的 Azure 計算上更有效率地執行。

匯入和環境

import os
import tempfile
import shutil
import urllib
import zipfile
import pandas as pd

# Temporary folder for data we need during execution of this notebook (we'll clean up
# at the end, we promise)
temp_dir = os.path.join(tempfile.gettempdir(), 'mind')
os.makedirs(temp_dir, exist_ok=True)

# The dataset is split into training and validation set, each with a large and small version.
# The format of the four files are the same.
# For demonstration purpose, we will use small version validation set only.
base_url = 'https://mind201910small.blob.core.windows.net/release'
training_small_url = f'{base_url}/MINDsmall_train.zip'
validation_small_url = f'{base_url}/MINDsmall_dev.zip'
training_large_url = f'{base_url}/MINDlarge_train.zip'
validation_large_url = f'{base_url}/MINDlarge_dev.zip'

函式

def download_url(url,
                 destination_filename=None,
                 progress_updater=None,
                 force_download=False,
                 verbose=True):
    """
    Download a URL to a temporary file
    """
    if not verbose:
        progress_updater = None
    # This is not intended to guarantee uniqueness, we just know it happens to guarantee
    # uniqueness for this application.
    if destination_filename is None:
        url_as_filename = url.replace('://', '_').replace('/', '_')
        destination_filename = \
            os.path.join(temp_dir,url_as_filename)
    if (not force_download) and (os.path.isfile(destination_filename)):
        if verbose:
            print('Bypassing download of already-downloaded file {}'.format(
                os.path.basename(url)))
        return destination_filename
    if verbose:
        print('Downloading file {} to {}'.format(os.path.basename(url),
                                                 destination_filename),
              end='')
    urllib.request.urlretrieve(url, destination_filename, progress_updater)
    assert (os.path.isfile(destination_filename))
    nBytes = os.path.getsize(destination_filename)
    if verbose:
        print('...done, {} bytes.'.format(nBytes))
    return destination_filename

下載並擷取檔案

# For demonstration purpose, we will use small version validation set only.
# This file is about 30MB.
zip_path = download_url(validation_small_url, verbose=True)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

os.listdir(temp_dir)

使用 pandas 讀取檔案

# The behaviors.tsv file contains the impression logs and users' news click histories. 
# It has 5 columns divided by the tab symbol:
# - Impression ID. The ID of an impression.
# - User ID. The anonymous ID of a user.
# - Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
# - History. The news click history (ID list of clicked news) of this user before this impression.
# - Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).
behaviors_path = os.path.join(temp_dir, 'behaviors.tsv')
pd.read_table(
    behaviors_path,
    header=None,
    names=['impression_id', 'user_id', 'time', 'history', 'impressions'])

# The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file.
# It has 7 columns, which are divided by the tab symbol:
# - News ID
# - Category
# - Subcategory
# - Title
# - Abstract
# - URL
# - Title Entities (entities contained in the title of this news)
# - Abstract Entities (entities contained in the abstract of this news)
news_path = os.path.join(temp_dir, 'news.tsv')
pd.read_table(news_path,
              header=None,
              names=[
                  'id', 'category', 'subcategory', 'title', 'abstract', 'url',
                  'title_entities', 'abstract_entities'
              ])

# The entity_embedding.vec file contains the 100-dimensional embeddings
# of the entities learned from the subgraph by TransE method.
# The first column is the ID of entity, and the other columns are the embedding vector values.
entity_embedding_path = os.path.join(temp_dir, 'entity_embedding.vec')
entity_embedding = pd.read_table(entity_embedding_path, header=None)
entity_embedding['vector'] = entity_embedding.iloc[:, 1:101].values.tolist()
entity_embedding = entity_embedding[[0,
                                     'vector']].rename(columns={0: "entity"})
entity_embedding

# The relation_embedding.vec file contains the 100-dimensional embeddings
# of the relations learned from the subgraph by TransE method.
# The first column is the ID of relation, and the other columns are the embedding vector values.
relation_embedding_path = os.path.join(temp_dir, 'relation_embedding.vec')
relation_embedding = pd.read_table(relation_embedding_path, header=None)
relation_embedding['vector'] = relation_embedding.iloc[:,
                                                       1:101].values.tolist()
relation_embedding = relation_embedding[[0, 'vector'
                                         ]].rename(columns={0: "relation"})
relation_embedding

清除暫存盤

shutil.rmtree(temp_dir)

範例

請參閱下列範例，以瞭解如何使用 Microsoft 新聞 Recommender 數據集：

下一步

查看 Microsoft 推薦工具存放庫在 MIND 上開發的數個基準新聞推薦模型

檢視開放式數據集目錄中的其餘數據集。

Microsoft 新聞 建議

體積