Improve the accuracy of your machine learning models with publicly available datasets. To save time on data discovery and preparation, use curated datasets that are ready for machine learning projects.
The yellow taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
The green taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
COVID-19 Data Lake collection is a collection of COVID-19 related datasets from various sources, covering testing and patient outcome tracking data, social distancing policy, hospital capacity, mobility, etc.
A full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles, optimized for machine readability and made available for use by the global research community.
The Genomics Data Lake provides various public datasets available for free, ready to integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.
US Labor Force Statistics provides Labor Force Statistics, labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups in the United States.
The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States.
The US Local Area Unemployment Statistics datasets provides monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities in the United States.
The Consumer Price Index (CPI) measures the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services.
US population by gender and race for each US county, sourced from 2000 and 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
US population by gender and race for each US ZIP code, sourced from 2010 Decennial Census. This dataset is sourced from the United States Census Bureau.
This dataset contains all New York City 311 service requests from 2010 to the present. This dataset is stored in Parquet format and receives daily updates.
This dataset is derived from the Dominick’s OJ dataset and includes extra simulated data, with the goal of providing a dataset that makes it easy to simultaneously train thousands of models on Azure Machine Learning.
The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits are size-normalized and centered in a fixed-size image.
Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It serves as a benchmark dataset for news recommendation, and facilitates research in news recommendation and recommender systems.