Azure Open Datasets

Improve the accuracy of your machine learning models with publicly available datasets. To save time on data discovery and preparation, use curated datasets that are ready for machine learning projects.

Transportation

Dataset Description
TartanAir: AirSim Simulation Dataset AirSim Autonomous vehicle data generated to solve Simultaneous Localization and Mapping (SLAM).
NYC Taxi & Limousine Commission - yellow taxi trip records The yellow taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
NYC Taxi & Limousine Commission - green taxi trip records The green taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records The For-Hire Vehicle trip records include the dispatching base license number and the pick-up date, time, and taxi zone location ID.

Health and genomics

Dataset Description
COVID-19 Data Lake COVID-19 Data Lake collection is a collection of COVID-19 related datasets from various sources, covering testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, etc.
COVID-19 Open Research Dataset A full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles, optimized for machine readability and made available for use by the global research community.
Genomics Data Lake The Genomics Data Lake provides various public datasets available for free, ready to integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

Labor and economics

Dataset Description
US Labor Force Statistics US Labor Force Statistics provides Labor Force Statistics, labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups in the United States.
US National Employment Hours and Earnings The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
US State Employment Hours and Earnings The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
US Local Area Unemployment Statistics The US Local Area Unemployment Statistics datasets provides monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities in the United States.
US Consumer Price Index The Consumer Price Index (CPI) measures the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.
US Producer Price Index - Industry The Producer Price Index (PPI) measures the average change, over time, in the selling prices received by domestic producers for their output.
US Producer Price Index - Commodities The Producer Price Index (PPI) measures the average change, over time, in the selling prices received by domestic producers for their commodities.

Population and safety

Dataset Description
US Population by County US population by gender and race for each US county, sourced from 2000 and 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
US Population by ZIP Code US population by gender and race for each US ZIP code, sourced from 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
Boston Safety Data Read data about 311 calls reported to the city of Boston. This dataset is stored in Parquet format and receives daily updates.
Chicago Safety Data Read data about 311 calls reported to the city of Chicago. This dataset is stored in Parquet format and receives daily updates.
New York City Safety Data This dataset contains all New York City 311 service requests from 2010 to the present. This dataset is stored in Parquet format and receives daily updates.
San Francisco Safety Data Fire department calls for service and 311 cases in San Francisco. This dataset contains historical records accumulated from 2015 to the present.
Seattle Safety Data Seattle Fire Department 911 dispatches. This dataset is updated daily, and contains historical records accumulated from 2010 to the present

Supplemental and common datasets

Dataset Description
Diabetes The Diabetes dataset has 442 samples with 10 features, making it ideal for getting started with machine learning algorithms.
OJ Sales Simulated Data This dataset is derived from the Dominick’s OJ dataset and includes extra simulated data, with the goal of providing a dataset that makes it easy to simultaneously train thousands of models on Azure Machine Learning.
MNIST database of handwritten digits The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits are size-normalized and centered in a fixed-size image.
Microsoft News recommendation dataset Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It serves as a benchmark dataset for news recommendation, and facilitates research in news recommendation and recommender systems.
Public holidays Worldwide public holiday data sourced from PyPI holidays package and Wikipedia, covering 38 countries or regions from 1970 to 2099.
Russian open speech to text Russian Open STT is a large-scale open speech to text dataset for the Russian language