Human Reference Genomes
Note
Important Update 9/19/2024: All URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z. After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters).
This dataset includes two human-genome references assembled by the Genome Reference Consortium: Hg19 and Hg38.
For more information on Hg19 (GRCh37) data, see the GRCh37 report at NCBI.
For more information on Hg38 data, see the GRCh38 report at NCBI.
Other details about the data can be found at NCBI RefSeq site.
Note
Microsoft provides Azure Open Datasets on an “as is” basis. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental or punitive, resulting from your use of the datasets.
This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.
Data source
This dataset is sourced from two FTP locations:
Blob names are prefixed beginning with the “vertebrate_mammalian” segment of the URI.
Data volumes and update frequency
This dataset contains approximately 10 GB of data and is updated daily.
Storage location
This dataset is stored in the West US 2, West Central US and South Central US Azure regions. Allocating compute resources in West US 2 or West Central US or South Central US is recommended for affinity.
Data Access
West US 2: 'https://datasetreferencegenomes.blob.core.windows.net/dataset'
West Central US: 'https://datasetreferencegenomes-secondary.blob.core.windows.net/dataset'
SAS Token: sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=JtQoPFqiC24GiEB7v9zHLi4RrA2Kd1r%2F3iFt2l9%2FlV8%3D
South Central US: 'https://datasetreferencegenomesc.blob.core.windows.net/dataset'
SAS Token: sv=2023-01-03&st=2024-02-12T20%3A07%3A21Z&se=2029-02-13T20%3A07%3A00Z&sr=c&sp=rl&sig=ASZYVyhqLOXKsT%2BcTR8MMblFeI4uZ%2Bnno%2FCnQk2RaFs%3D
Use Terms
Data is available without restrictions. For more information and citation details, see the NCBI Reference Sequence Database site.
Contact
For any questions or feedback about this dataset, contact the Genome Reference Consortium.
Data access
Azure Notebooks
Getting the Reference Genomes from Azure Open Datasets
Several public genomics data has been uploaded as an Azure Open Dataset here. We create a blob service linked to this open dataset. You can find examples of data calling procedure from Azure Open Datasets for Reference Genomes
dataset in below:
Users can call and download the following path with this notebook: 'https://datasetreferencegenomes.blob.core.windows.net/dataset/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/genomic_regions_definitions.txt'
Important note: Users need to log in their Azure Account via Azure CLI for viewing the data with Azure ML SDK. On the other hand, they do not need do any actions for downloading the data.
Calling the data from 'Reference Genome Datasets'
import azureml.core
print("Azure ML SDK Version: ", azureml.core.VERSION)
from azureml.core import Dataset
reference_dataset = Dataset.File.from_files('https://datasetreferencegenomes.blob.core.windows.net/dataset')
mount = reference_dataset.mount()
import os
REF_DIR = '/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure'
path = mount.mount_point + REF_DIR
with mount:
print(os.listdir(path))
import pandas as pd
# create mount context
mount.start()
# specify path to genomic_regions_definitions.txt file
REF_DIR = 'vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure'
metadata_filename = '{}/{}/{}'.format(mount.mount_point, REF_DIR, 'genomic_regions_definitions.txt')
# read genomic_regions_definitions.txt file
metadata = pd.read_table(metadata_filename)
metadata
Download the specific file
import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess
blob_service_client = BlockBlobService(account_name='datasetreferencegenomes',sas_token='sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=JtQoPFqiC24GiEB7v9zHLi4RrA2Kd1r%2F3iFt2l9%2FlV8%3D')
blob_service_client.get_blob_to_path('dataset/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure', 'genomic_regions_definitions.txt', './genomic_regions_definitions.txt')
Next steps
View the rest of the datasets in the Open Datasets catalog.