RNASeq pipeline

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule.

Note

The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower versions of Databricks Runtime for Genomics, see the release notes.

The Databricks RNASeq pipeline handles short read alignment and quantification using STAR v2.6.1a and ADAM v0.32.0.

Setup

The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:

{
  "num_workers": {
    "type": "unlimited",
    "defaultValue": 13
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_F32s_v2"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38_star"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.4.x-hls-scala2.12"
  }
}
  • The task should be the RNASeq notebook provided at the bottom of this page.
  • For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend Standard_F32s_v2 VMs.

Reference genomes

You must configure the reference genome using environment variables. To use GRCh37, set the environment variable:

refGenomeId=grch37_star

To use GRCh38 instead, set the environment variable:

refGenomeId=grch38_star

Parameters

The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the RNASeq notebook. After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.

Parameter Default Description
manifest n/a The manifest describing the input.
output n/a The path where pipeline output should be written.
replayMode skip One of:

* skip: stages are skipped if output already exists.
* overwrite: existing output is deleted.
perSampleTimeout 12h A timeout applied per sample. After reaching this timeout, the pipeline continues on to the next sample. The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. For example, ‘60m’ results in a timeout of 60 minutes.

Walkthrough

The pipeline consists of two steps:

  1. Alignment: Map each short read to the reference genome using the STAR aligner.
  2. Quantification: Count how many reads correspond to each reference transcript.

Additional usage info and troubleshooting

The operational aspects of the RNASeq pipeline are very similar to the DNASeq pipeline. For more information about manifest format, output structure, programmatic usage, and common issues, see DNASeq pipeline.