Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
Important Update May 2025: Dear Community, We’d like to inform you of an upcoming change regarding the Genomics open datasets currently available through Azure. After careful consideration, we decided to shift our focus to new initiatives that will better serve our community and align with our long-term goals. As such, access to the Genomics open datasets on Azure will be deprecated in the coming months. We understand these datasets were valuable for research, development, and learning, and we deeply appreciate the contributions and engagement from our community over time. Thank you for your understanding and support.
The 1000 Genomes Project ran between 2008 and 2015, to create the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, visit the 1000 Genome Project website and these publications:
Visit this resource for more information about the relevant data formats.
[NEW]: The dataset is also available in parquet format.
Note
Microsoft provides Azure Open Datasets on an “as is” basis. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental or punitive, resulting from your use of the datasets.
This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.
Data source
This dataset is a mirror of this FTP resource.
Data volumes and update frequency
This dataset contains approximately 815 TB of data. It receives daily updates.
Storage location
This dataset is stored in the West US 2 and West Central US Azure regions. We recommend locating compute resources in West US 2 or West Central US for affinity.
Data access
West US 2:"https://dataset1000genomes.blob.core.windows.net/dataset'"
West Central US: "https://dataset1000genomes-secondary.blob.core.windows.net/dataset"
Use Terms
Following the final publications, data from the 1000 Genomes Project is publicly available, without embargo, to anyone for use under the terms provided by the dataset source. Use of the data should be cited per details available in the 1000 Genome Project FAQ resource.
Contact
Scroll down at this resource for the contact information.
Next steps
View the rest of the datasets in the Open Datasets catalog.