Hail
Hail is a library built on Apache Spark for analyzing large genomic datasets.
Important
- When you use Hail 0.2.65 and above, use Apache Spark version 3.1 (Databricks Runtime 8.x or 9.x)
- Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated)
- Hail is not supported with Credential passthrough (legacy)
- Hail is not supported with Glow, except when exporting from Hail to Glow
Create a cluster
Install Hail via Docker with Databricks Container Services.
For containers to set up a Hail environment, see the ProjectGlow Dockerhub page.
Use projectglow/databricks-hail:<hail_version>
, replacing the tag with an available Hail version.
Create a jobs cluster with Hail
- Setup the Databricks CLI.
- Create a cluster using the Hail Docker container, setting the tag to the desired
<hail_version>
. - An example jobs definition is given below, please edit notebook_path, Databricks Runtime
<databricks_runtime_version>
and<hail_version>
.
databricks jobs create --json-file hail-create-job.json
hail-create-job.json
:
{
"name": "hail",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/hail/docs/hail-tutorial",
},
"new_cluster": {
"spark_version": "<databricks_runtime_version>.x-scala2.12",
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"num_workers": 32,
"docker_image": {
"url": "projectglow/databricks-hail:<hail_version>"
}
}
}
Use Hail in a notebook
For the most part, Hail in Azure Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Azure Databricks environment.
Initialize Hail
When initializing Hail, pass in the pre-created SparkContext
and mark the initialization as idempotent. This setting
enables multiple Azure Databricks notebooks to use the same Hail context.
Note
Enable skip_logging_configuration
to save logs to the rolling driver log4j output. This setting is
supported only in Hail 0.2.39 and above.
import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Display Bokeh plots
Hail uses the Bokeh library to create plots. The show
function built into Bokeh does not work
in Azure Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)
See Bokeh for more information.
Feedback
Submit and view feedback for