This article answers frequently asked questions about SQL Server Big Data Clusters concepts, capabilities, deployment, supportability, and tools.
There is less flexibility in this regard comparing to configuring SQL Server on bare metal machines on Windows or Linux. In the Kubernetes environment these artifacts are abstracted and they need to be portable. Currently, there are 2 persistent volumes (PVs), for data and logs, provided per pod that can be configured. For more information, see Data persistence with SQL Server big data cluster in Kubernetes.
You need to perform log backups only for user databases in SQL Server master instance (depending on recovery model or HA configuration). Data pool databases use SIMPLE recovery model only. Same applies for the DW* databases created for PolyBase.
You can use the existing PolyBase DMVs that were enhanced for Big Data Cluster scenarios. For more information, see Monitor and troubleshoot PolyBase.
Is it possible to configure and manage Big Data Cluster resources directly via kubectl to the Kubernetes API Server?
While you can modify some of the settings using Kubernetes API or kubectl, it is not supported nor recommended. You must execute all the Big Data Cluster management operations via azdata.
You can use any solutions that enable hardware level storage snapshotting or copy/sync via webHDFS. You could also use azdata bdc hdfs cp
, for more information see azdata bdc hdfs.
Is there a way to 'scale out' a stored proc? For example, having it run on compute pool for example?
Not at this time. One option is to deploy SQL Server in an Always On Availability Group. You can then use readable secondary replica(s) to run some processes (ex: ml training/scoring, maintenance activities, etc).
This is not a supported scenario at this time.
Database in the data pool instance does not have any metadata about the external tables - it is like any user database. You can do backup/restore, but to avoid inconsistent results, you must ensure the external table metadata in the metadata database in the SQL Master instance is in sync.
Data pool is a distributed table concept. Sharding is typically referenced as an OLTP concept - this is not currently supported.
The term pool is reserved to describe a collection of homogeneous services or applications. For example, data pool is a set of stateful SQL Server compute and storage and storage pool is a set of HDFS and Spark services. The SQL Server master is either a single-instance or multiple instances that can be configured in an availability group. The SQL Server master instance is a regular SQL Server instance on Linux and you can use any feature available on Linux there. You should start first with the data model, the entities and services/applications that will primarily operate on the entity. All the data doesn't have to be stored in one place like SQL Server or HDFS or data pool. Based on the data analysis, it is possible you store most of the data in HDFS, process the data to more efficient format, and expose to other services. The remaining data would be stored in SQL Master instance.
Does SQL Server Big Data Cluster support GPU-based deep learning libraries and computations (PyTorch, Keras, specific image libraries, etc.)?
This is not a supported scenario at this time.
Each pod can have only two persisted volumes (PVs). You can abstract the volume at OS level and use it for persistent storage. For example, you can create a RAID 0 OS partition using multiple disks and use that for persistent volume using a local storage provisioner. There is no way to use more PVs per pod today. PVs are mapped to directories inside the container and this is fixed. For more information on persisted volumes see, Persistent Volumes in Kubernetes Documentation.
If we configure multiple providers and multiple disks, will the HDFS config be updated with all the data volume claims?
You can configure storage pool to use a specific storage class at deployment time. See Data persistence with SQL Server big data cluster in Kubernetes.
HDFS Tiering allows us to integrate transparently with S3-based protocols. For more information, se How to mount S3 for HDFS tiering in a big data cluster.
Yes, data will be preserved since it is backed by persistent volumes and upgrade just deploys existing pods with new images.
Using HDFS tiering, data is cached withing the local HDFS running in Big Data Cluster to allow users to attach to large data lakes without having to bring all the data in. There is a configurable amount of space allocated to the cache which is defaulted to 2% today. Data is maintained in the cache but will be removed if that threshold is exceeded. Security is also maintained from the lake and all ACLs are applied. For more information, see Configure HDFS tiering on Big Data Clusters.
Can we use SQL Server 2019 to visualize Azure Data Lake Store Gen2? Will this integration take care of folder level permission?
Yes you can virtualize data stored in ADLS Gen2 using HDFS tiering. Once HDFS Tiering is mounted to ADLS Gen2, users gain ability to query the HDFS data and run Spark jobs against it. The mounted storage will appear in the HDFS for Big Data Cluster in the location specified by --mount-path, and users can work with that mount path as if working with a local storage. See more details here: Configure HDFS tiering on Big Data Cluster. For more information on HDFS tier permissions, see Manage HDFS permissions for SQL Server Big Data Clusters.
What's the default high-availability and/or redundancy setting for the master node on Azure Kubernetes Service (AKS)?
The AKS control plane supports uptime SLA guarantees 99.95% availability. The AKS cluster nodes (worker nodes) use Availability Zones, for more information see AKS Availability Zones. An Availability Zone (AZ) is a high availability offering from Azure that protects applications and data from datacenter failures. AKS supports 99.9% availability for clusters that don't use Availability Zones. For more information, please refer to SLA for Azure Kubernetes Service (AKS).
Restarting sparkhead won't cause the logs to be lost, these logs are in HDFS. You should still see Spark history logs from the /gateway/default/sparkhistory UI. For Yarn container logs, you won't see those apps in Yarn UI because Yarn RM restarts, but those yarn logs are still in HDFS and you can link to them from Spark history server. You should always use Spark history server as the entry point to diagnose their Spark apps.
By default, 1% of the total HDFS storage will be reserved for caching of mounted data. Caching is a global setting across mounts. Currently, there is not an exposed way to turn it off, however, the percentage can be configured via the hdfs-site.dfs.provided.cache.capacity.fraction setting. This setting controls the fraction of the total capacity in the cluster that can be used to cache data from Provided stores. To modify, see How to configure Big Data Cluster settings post deployment. For more information, see Configure HDFS tiering on SQL Server Big Data Clusters.
You can use the SQL Server Agent service in the SQL Server master instance of the big data cluster.
Does Big Data Cluster support native time series data scenarios, such as generated by IoT use-cases?
At this time InfluxDB in a Big Data Cluster is used only for storing monitoring data collected within the Big Data Cluster and is not exposed as an external endpoint.
At this time InfluxDB in a Big Data Cluster is used only for storing monitoring data collected within the Big Data Cluster and is not exposed as an external endpoint.
In Big Data Cluster, the HA configuration creates an availability group called containedag which also includes system databases that are replicated across replicas. Databases created as result of a CREATE DATABASE or RESTORE workflows are automatically added to the contained AG and seeded. Prior to SQL Server 2019 (15.0) CU2, you have to connect to the physical instance in Big Data Cluster, restore the database and add it to the containedag. For more information, see Deploy SQL Server Big Data Cluster with high availability.
At this time, you can set memory for the SQL instances using sp_configure, just like in SQL Server. For cores, you can use ALTER SERVER CONFIGURATION SET PROCESS AFFINITY
. By default, containers see all CPUs on the host and we don't have a way to specify resource limits using Kubernetes at this time. For compute pool/data pool/storage pool, the configuration can be done using EXECUTE AT DATA_SOURCE statement from SQL Server master instance.
Pods that are not affinitized to the respective worker node will be moved to another node in the Kubernetes cluster provided there are sufficient resources. Otherwise, the pod(s) will be unavailable causing outages.
This action depends only on Kubernetes. Apart from pod placement using node labels, there is no other mechanism to control re-balancing Kubernetes resources from within Big Data Cluster.
What is the consequence on Big Data Cluster resources when I remove a node from the Kubernetes cluster?
This action is equivalent to the host node being shutdown. There are mechanisms to orchestrate this in Kubernetes using a tainting process and this is typically followed for upgrade or node maintenance. For more information, see Kubernetes documentation for Taints and Tolerations.
Yes, replication factor is one of the available configurations for HDFS. For more information see Configure Persistent Volumes.
It depends on your use cases and requirements. Big Data Cluster provides a full SQL Server surface area in addition to Microsoft-supported Spark and HDFS, on-premises. Big Data Cluster enables the SQL Server customer to be able to integrate into analytics/big data. Azure Synapse is purely an analytical platform offering a first class experience for customers as a managed service in the cloud, with a focus on scale out analytics. Azure Synapse is not targeting an operational workload as part of that. Big Data Cluster is aiming to provide in database analytical scenarios, much closer to the operational store.
The SQL Server instance's database files are not stored in HDFS, however, SQL Server can query HDFS using external table interface.
What are the available distribution options for storing data in the distributed tables in each data pool?
ROUND_ROBIN and REPLICATED. ROUND_ROBIN is the default. HASH is not available.
Does Big Data Cluster have the Spark Thrift Server included? If so, is ODBC endpoint exposed to connect to Hive Metastore tables?
We currently expose the Hive Metastore (HMS) via the Thrift protocol. We document the protocol but haven't opened up an ODBC endpoint at this time. You can access it via the Hive Metastore HTTP protocol, for more information see Hive Metastore HTTP Protocol.
SQL Server on Linux (applies to the SQL Server Master instance in Big Data Cluster too) does not support the generic ODBC data source which allows you to install a 3rd party ODBC driver (SnowFlake, DB2, PostgreSQL etc) and query those. This feature is currently available only in SQL Server 2019 (15.0) on Windows. In Big Data Cluster, you can read the data via Spark using JDBC and ingest into SQL Server using the MSSQL Spark Connector.
SQL Server on Linux (applies to SQL Server Master instance in Big Data Cluster too) does not support the generic ODBC data source which allows you to install a 3rd party ODBC driver (SnowFlake, DB2, PostgreSQL etc) and query those.
How can you import data to the same table using PolyBase CTAS instead of creating NEW table every time you run the CTAS?
You can use INSERT..SELECT
approach to avoid the need a new table every time.
What would be the advantage/considerations to load data into Data pool instead of directly into the Master Instance as local tables?
If your SQL Server Master instance has enough resources to satisfy your analytic workload then it is always the fastest option. Data pool helps if you want to offload execution to other SQL instances for your distributed queries. You can also use data pool to ingest data from Spark executors in parallel to different SQL instances – so load performance for large datasets that is being generated from the Hadoop Distributed File System (HDFS) will typically be better than going into a single SQL Server instance. However, this is also hard to say since you could still have multiple tables in a SQL Server and insert into parallel if you want. Performance depends on many factors and there is no single guidance or recommendation in that regard.
You can use EXECUTE AT to query DMVs like sys.dm_db_partition_stats to get the data in each local table.
No, you can use azdata bdc hdfs cp. If you provide the root directory the command will recursively copy the whole tree. You can copy in/out using this command just by changing what is the source/target paths.
You can use MSSQL Spark connector library to help with SQL and data pool ingestion. For a guided walk-through, see Tutorial: Ingest data into a SQL Server data pool with Spark jobs.
If I have a lot of data on a (Windows) network path, which contains lots of folders/sub-folders and text files, how do I upload them to HDFS on Big data cluster?
Give azdata bdc hdfs cp a try. If you provide the root directory the command will recursively copy the whole tree. You can copy in/out using this command just by changing what is the source/target paths.
There is no azdata
interface to perform this operation at this time. You have the option to resize desired PVCs manually. Resizing is a complex operation, see Persistent Volumes in Kubernetes Documentation.
See main differences and use cases here: PolyBase FAQ.
Big Data Cluster supports data virtualization from ODBC sources – SQL Server, Oracle, MongoDB, Teradata, etc. It also supports tiering of remote stores such as Azure Data Lake Store Gen2 and S3-compatible storage, as well as AWS S3A and the Azure Blob File System (ABFS).
Yes, you can use PolyBase in Big Data Cluster to access data in Azure SQL Database.
Why do the CREATE TABLE statements include the key word EXTERNAL? What does EXTERNAL do differently than the standard CREATE TABLE?
In general, the external keyword implies that the data is not in the SQL Server instance. For example, you can define a storage pool table on top of an HDFS directory. The data are stored in HDFS files, not in your database files, but external table provided you the interface to query HDFS files as a relational table as if it is in the database.
This concept of accessing external data is called data virtualization, for more information see Introducing data virtualization with PolyBase. For a tutorial on virtualizing data from CSV files in HDFS, see [Virtualize CSV data from storage pool Big Data Clusters.
What are the differences between data virtualization using SQL Server running within SQL Server Big Data Clusters vs SQL Server?
For an comparison, see PolyBase in Big Data Clusters vs. PolyBase in stand-alone instances.
You can determine type of external table by looking at the data source location prefix, for example, sqlserver://, oracle://, sqlhdfs://, sqldatapool://.
See Manage SQL Server Big Data Clusters with Azure Data Studio notebooks. Also see the troubleshooting topics in Troubleshoot Kubernetes.
All the customizations that can be done at deployment time are documented here in Configure deployment settings for cluster resources and services. For Spark, see Configure Apache Spark and Apache Hadoop in Big Data Clusters.
No. Specifically, SQL Server Analysis Services (SSAS) is not supported on SQL Server on Linux, so you will have to install a SQL Server instance on Windows server to run SSAS.
Big Data Cluster can run on any Kubernetes stack based on version 1.13 and higher. However, we have not performed specific validations of Big Data Cluster on EKS or GKS.
Spark is 2.4 and HDFS is 3.2.1. For complete details on the open-source software included in Big Data Cluster, see Open-source software reference.
You can add packages at job submission using the steps in the sample notebook for installing packages in Spark.
Machine Learning (ML) Services (R and Python) is available beginning in SQL Server 2017. ML Services is available in SQL Server Big Data Clusters as well. For more information, see What is SQL Server Machine Learning Services with Python and R?.
Please refer to the licensing guide which goes into much more detail, download the PDF.
For a summary, watch the video SQL Server Licensing: Big Data Clusters | Data Exposed.
Does Big Data Cluster support Microsoft Entra ID ([formerly Azure Active Directory](/entra/fundamentals/new-name))?
Not at this time.
Yes, you can connect to various Big Data Cluster services using integrated authentication (with Active Directory). For more information, see Deploy SQL Server Big Data Cluster in Active Directory mode. Also see Security concepts for Big Data Clusters.
In basic authentication mode (username/password), there is no support for adding multiple users for controller or Knox gateway/HDFS endpoints. The only user supported for these endpoints is root. For SQL Server, you can add users using Transact-SQL as you would for any other SQL Server instance. If you deploy Big Data Cluster with AD auth for its endpoints, multiple users are supported. See here for details on how to configure the AD groups at deployment time. For more information, see Deploy SQL Server Big Data Cluster in Active Directory mode.
For Big Data Cluster to pull the latest container images, is there an outbound IP range I can restrict?
You can review the IP addresses used by the various services in Azure IP Ranges and Service Tags – Public Cloud. Note that these IP addresses rotate periodically.
In order for the controller service to pull the container images from the Microsoft Container Registry (MCR) you'll need to grant access to the IP addresses specified in the MicrosoftContainerRegistry section. Another option is to set up a private Azure Container Registry and configure the Big Data Cluster to pull from there. In that case you'll need to expose the IP addresses specified in the AzureContainerRegistry section. Instructions on how to do this and a script are provided in Perform an offline deployment of a SQL Server big data cluster.
Yes, for more details see Perform an offline deployment of a SQL Server big data cluster.
This depends on the dynamic storage provisioner configurations in Azure Kubernetes Service (AKS). See here for more details: Best practices for storage and backups in Azure Kubernetes Service (AKS).
Yes. For more information, see Key versions in Big Data Cluster.
Yes, you can easily rotate the passwords of Autogenerated Active Directory objects with a new feature introduced in SQL Server Big Data Clusters CU13. For more information, see AD password rotation.
Yes, Microsoft supports all components shipped within Big Data Clusters.
SQL Server ML Services support policy is same as that of SQL Server, except that every major release comes with a new runtime version. SparkML library itself is open source software (OSS). We do package many OSS components in Big Data Cluster and this is supported by Microsoft.
Not at this time. See here for the tested configurations.
Yes, it's the same Jupyter kernel just surfaced in Azure Data Studio.
No, azdata
is not open sourced at this time.
Here are some resources you can use:
Workshop: Kubernetes - From Bare Metal to SQL Server Big Data Clusters
Workshop: SQL Server Big Data Clusters - Architecture
Tutorial: Prepare an application for Azure Kubernetes Service (AKS)
(PDF)Case Study: SQL Workloads running on Apache Spark in MS SQL Server 2019 Big Data Cluster