Genomics Data Lake

The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity.

Note

Use of datasets is subject to terms and conditions set by the dataset owners. See the details page for each dataset for applicable terms and conditions.

Datasets

Datasets Description
Illumina Platinum Genomes Illumina Platinum Genomes
Human Reference Genomes Human Reference Genomes
ClinVar Annotations ClinVar Annotations
SnpEff SnpEff: Genomic variant annotations and functional effect prediction toolbox
gnomAD gnomAD: Genome Aggregation Database
1000 Genomes 1000 Genomes
OpenCravat OpenCravat: Open Custom Ranked Analysis of Variants Toolkit
ENCODE ENCODE: Encyclopedia of DNA Elements
GATK Resource Bundle GATK Resource bundle
TCGA Open Data TCGA Open Data
Pan UK-Biobank Pan UK-Biobank

Next steps

View the rest of the datasets in the Open Datasets catalog.