Introducing SQL Server Big Data Clusters

Članak
10.07.2024.

Applies to: SQL Server 2019 (15.x)

Važno

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

In SQL Server 2019 (15.x), SQL Server Big Data Clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.

Get started

First, see Get started with SQL Server Big Data Clusters deployment
For new features for latest release, see the release notes
For frequently asked questions, see Big Data Clusters FAQ

Big data clusters architecture

The following diagram shows the components of a SQL Server big data cluster:

Controller

The controller provides management and security for the cluster. It contains the control service, the configuration store, and other cluster-level services such as Kibana, Grafana, and Elastic Search.

Compute pool

The compute pool provides computational resources to the cluster. It contains nodes running SQL Server on Linux pods. The pods in the compute pool are divided into SQL Compute instances for specific processing tasks.

Data pool

The data pool is used for data persistence. The data pool consists of one or more pods running SQL Server on Linux. It's used to ingest data from SQL queries or Spark jobs.

Storage pool

The storage pool consists of storage pool pods comprised of SQL Server on Linux, Spark, and HDFS. All the storage nodes in a SQL Server big data cluster are members of an HDFS cluster.

Napojnica

For an in-depth look into big data cluster architecture and installation, see Workshop: Microsoft SQL Server Big Data Clusters Architecture.

App pool

Application deployment enables the deployment of applications on a SQL Server Big Data Clusters by providing interfaces to create, manage, and run applications.

Scenarios and features

SQL Server Big Data Clusters provide flexibility in how you interact with your big data. You can query external data sources, store big data in HDFS managed by SQL Server, or query data from multiple external data sources through the cluster. You can then use the data for AI, machine learning, and other analysis tasks.

Use SQL Server Big Data Clusters to:

Deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes.
Read, write, and process big data from Transact-SQL or Spark.
Easily combine and analyze high-value relational data with high-volume big data.
Query external data sources.
Store big data in HDFS managed by SQL Server.
Query data from multiple external data sources through the cluster.
Use the data for AI, machine learning, and other analysis tasks.
Deploy and run applications in Big Data Clusters.
Virtualize data with PolyBase. Query data from external SQL Server, Oracle, Teradata, MongoDB, and generic ODBC data sources with external tables.
Provide high availability for the SQL Server master instance and all databases by using Always On availability group technology.

The following sections provide more information about these scenarios.

Data virtualization

By leveraging PolyBase, SQL Server Big Data Clusters can query external data sources without moving or copying the data. SQL Server 2019 (15.x) introduces new connectors to data sources, for more information see What's new in PolyBase 2019?.

Diagram of Data virtualization.

Data lake

A SQL Server big data cluster includes a scalable HDFS storage pool. This can be used to store big data, potentially ingested from multiple external sources. Once the big data is stored in HDFS in the big data cluster, you can analyze and query the data and combine it with your relational data.

Diagram of Data lake.

Integrated AI and machine learning

SQL Server Big Data Clusters enable AI and machine learning tasks on the data stored in HDFS storage pools and the data pools. You can use Spark as well as built-in AI tools in SQL Server using R, Python, Scala, or Java.

Management and monitoring

Management and monitoring are provided through a combination of command-line tools, APIs, portals, and dynamic management views.

You can use Azure Data Studio to perform a variety of tasks on the big data cluster:

Built-in snippets for common management tasks.
Ability to browse HDFS, upload files, preview files, and create directories.
Ability to create, open, and run Jupyter-compatible notebooks.
Data virtualization wizard to simplify the creation of external data sources (enabled by the Data Virtualization Extension).

Kubernetes concepts

A SQL Server big data cluster is a cluster of Linux containers orchestrated by Kubernetes.

Kubernetes is an open source container orchestrator, which can scale container deployments according to need. The following table defines some important Kubernetes terminology:

Term	Description
Cluster	A Kubernetes cluster is a set of machines, known as nodes. One node controls the cluster and is designated the master node; the remaining nodes are worker nodes. The Kubernetes master is responsible for distributing work between the workers, and for monitoring the health of the cluster.
Node	A node runs containerized applications. It can be either a physical machine or a virtual machine. A Kubernetes cluster can contain a mixture of physical machine and virtual machine nodes.
Pod	A pod is the atomic deployment unit of Kubernetes. A pod is a logical group of one or more containers-and associated resources-needed to run an application. Each pod runs on a node; a node can run one or more pods. The Kubernetes master automatically assigns pods to nodes in the cluster.

In SQL Server Big Data Clusters, Kubernetes is responsible for the state of the cluster. Kubernetes builds and configures the cluster nodes, assigns pods to nodes, and monitors the health of the cluster.

Dodatni resursi

Dokumentacija

Big Data Clusters - Learn how to manage, deploy, and use - SQL Server

Learn how to use SQL Server Big Data Cluster. Tutorials, code examples, installation guides, and other documentation.
Big data options on the Microsoft SQL Server platform - SQL Server

This article discusses migration strategies for SQL Server 2019 Big Clusters
Get started - SQL Server Big Data Clusters

Learn the steps and resources for deploying SQL Server Big Data Clusters.
Deployment guidance - SQL Server Big Data Clusters

Learn how to deploy SQL Server Big Data Clusters on Kubernetes.
Deploy SQL Server Big Data Cluster with high availability - Deploy SQL Server Big Data Cluster with high availability

Learn how to deploy SQL Server Big Data Cluster with high availability.
Install big data tools - SQL Server Big Data Clusters

Learn how to install tools used with SQL Server 2019 big data cluster.
Big Data Clusters in a Nutshell

Join Ben Weissman on a tour through the possibilities of Big Data Clusters (BDC). He will give a brief overview about the general architecture and components of a BDC, followed by a demo where he will integrate data from external and internal sources using the same connection for python and T-SQL, and lastly come up with a real-world analytics scenario that will show you how easy it is to use them. You may not even need petabytes of data to leverage what they have to offer![01:30] It's not your Grandpa's SQ
Ingest data into a SQL Server data pool - SQL Server Big Data Clusters

This tutorial demonstrates how to ingest data into the data pool of a SQL Server 2019 big data cluster.

Obuka

Modul

Understand hybrid data platform on SQL Server 2022 - Training

This module helps users understand the hybrid data platform capabilities of SQL Server 2022.

Certifikacija

Microsoft Certified: Azure Database Administrator Associate - Certifications

Administer an SQL Server database infrastructure for cloud, on-premises and hybrid relational databases using the Microsoft PaaS relational database offerings.

Događaj

SKL na FabCon Vegasu

31. mar 23 - 2. apr 23

Najveći SKL, Fabric i Pover BI događaj učenja. 31. mart – 2. april. Koristite kod FABINSIDER da uštedite $400.

Registrujte se već danas

Deli putem

Introducing SQL Server Big Data Clusters

Get started