Azure Databricks for Scala developers
This article provides a guide to developing notebooks and jobs in Azure Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools.
A basic workflow for getting started is:
- Import code and run it using an interactive Databricks notebook: Either import your own code from files or Git repos or try a tutorial listed below.
- Run your code on a cluster: Either create a cluster of your own or ensure that you have permissions to use a shared cluster. Attach your notebook to the cluster and run the notebook.
Beyond this, you can branch out into more specific topics:
- Work with larger data sets using Apache Spark
- Add visualizations
- Automate your workload as a job
- Develop in IDEs
The tutorials below provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
- Tutorial: Load and transform data using Apache Spark DataFrames
- Tutorial: Delta Lake provides Scala examples.
- Use XGBoost on Azure Databricks provides a Scala example.
The below subsections list key features and tips to help you begin developing in Azure Databricks with Scala.
These links provide an introduction to and reference for the Apache Spark Scala API.
- Tutorial: Load and transform data using Apache Spark DataFrames
- Query JSON strings
- Introduction to Structured Streaming
- Apache Spark Core API reference
- Apache Spark ML API reference
Databricks notebooks support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
Tip
To reset the state of your notebook, restart the kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and reattaching a notebook in Databricks. To restart the kernel in a notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the process.
Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Azure Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Azure Databricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
- For small workloads which only require single nodes, data scientists can use single node compute for cost savings.
- For detailed tips, see Compute configuration recommendations
- Administrators can set up cluster policies to simplify and guide cluster creation.
Azure Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs.
- Start with the default libraries in the Databricks Runtime release notes versions and compatibility. For full lists of pre-installed libraries, see Databricks Runtime release notes versions and compatibility.
- You can also install Scala libraries in a cluster.
- For more details, see Libraries.
Azure Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy visualizations:
This section describes features that support interoperability between Scala and SQL.
You can automate Scala workloads as scheduled or triggered jobs in Azure Databricks. Jobs can run notebooks and JARs.
- For details on creating a job via the UI, see Configure and edit Databricks Jobs.
- The Databricks SDKs allow you to create, edit, and delete jobs programmatically.
- The Databricks CLI provides a convenient command line interface for automating jobs.
In addition to developing Scala code within Azure Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Azure Databricks, there are several options:
- Code: You can synchronize code using Git. See Git integration for Databricks Git folders.
- Libraries and jobs: You can create libraries externally and upload them to Azure Databricks. Those libraries may be imported within Azure Databricks notebooks, or they can be used to create jobs. See Libraries and Schedule and orchestrate workflows.
- Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Azure Databricks to execute large computations on Azure Databricks clusters. For example, you can use IntelliJ IDEA with Databricks Connect.
Databricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Developer tools.
- The Databricks Academy offers self-paced and instructor-led courses on many topics.