Genomics Data Lake

Article
10/18/2024

The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats.

The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity.

Note

Use of datasets is subject to terms and conditions set by the dataset owners. See the details page for each dataset for applicable terms and conditions.

Datasets

Datasets	Description
Illumina Platinum Genomes	Illumina Platinum Genomes
Human Reference Genomes	Human Reference Genomes
ClinVar Annotations	ClinVar Annotations
SnpEff	SnpEff: Genomic variant annotations and functional effect prediction toolbox
gnomAD	gnomAD: Genome Aggregation Database
1000 Genomes	1000 Genomes
OpenCravat	OpenCravat: Open Custom Ranked Analysis of Variants Toolkit
ENCODE	ENCODE: Encyclopedia of DNA Elements
GATK Resource Bundle	GATK Resource bundle
TCGA Open Data	TCGA Open Data
Pan UK-Biobank	Pan UK-Biobank
ImmuneCODE database	ImmuneCODE database
Open Targets dataset	Open Targets dataset

Next steps

View the rest of the datasets in the Open Datasets catalog.

Share via

Genomics Data Lake

Datasets

Next steps

Feedback

Additional resources