This article provides a guide to developing notebooks and jobs in Azure Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools.
Run your code on a cluster: Either create a cluster of your own or ensure that you have permissions to use a shared cluster. Attach your notebook to the cluster and run the notebook.
Beyond this, you can branch out into more specific topics:
The tutorials below provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Manage code with notebooks and Databricks Git folders
Databricks notebooks support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
Tip
To completely reset the state of your notebook, it can be useful to restart the kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the process.
Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Azure Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Clusters and libraries
Azure Databricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use single node compute for cost savings.
Administrators can set up cluster policies to simplify and guide cluster creation.
Azure Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs.
The Databricks SDKs allow you to create, edit, and delete jobs programmatically.
The Databricks CLI provides a convenient command line interface for automating jobs.
IDEs, developer tools, and SDKs
In addition to developing Scala code within Azure Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Azure Databricks, there are several options:
Libraries and jobs: You can create libraries externally and upload them to Azure Databricks. Those libraries may be imported within Azure Databricks notebooks, or they can be used to create jobs. See Libraries and Schedule and orchestrate workflows.
Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Azure Databricks to execute large computations on Azure Databricks clusters. For example, you can use IntelliJ IDEA with Databricks Connect.
Databricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Developer tools.
Additional resources
The Databricks Academy offers self-paced and instructor-led courses on many topics.
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.