Apache Spark & Apache Hadoop (HDFS) configuration properties

Applies to: SQL Server 2019 (15.x)

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

Big Data Clusters supports deployment time and post-deployment time configuration of Apache Spark and Hadoop components at the service and resource scopes. Big Data Clusters uses the same default configuration values as the respective open source project for most settings. The settings that we do change are listed below along with a description and their default value. Other than the gateway resource, there is no difference between settings that are configurable at the service scope and the resource scope.

You can find all possible configurations and the defaults for each at the associated Apache documentation site:

The settings we do not support configuring are also listed below.

Note

To include Spark in the Storage pool, set the boolean value includeSpark in the bdc.json configuration file at spec.resources.storage-0.spec.settings.spark. See Configure Apache Spark and Apache Hadoop in Big Data Clusters for instructions.

Big Data Clusters-specific default Spark settings

The Spark settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included.

Setting Name Description Type Default Value
capacity-scheduler.yarn.scheduler.capacity.maximum-applications Maximum number of applications in the system which can be concurrently active both running and pending. int 10000
capacity-scheduler.yarn.scheduler.capacity.resource-calculator The ResourceCalculator implementation to be used to compare Resources in the scheduler. string org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler.yarn.scheduler.capacity.root.queues The capacity scheduler with predefined queue called root. string default
capacity-scheduler.yarn.scheduler.capacity.root.default.capacity Queue capacity in percentage (%) as absolute resource queue minimum capacity for root queue. int 100
spark-defaults-conf.spark.driver.cores Number of cores to use for the driver process, only in cluster mode. int 1
spark-defaults-conf.spark.driver.memoryOverhead The amount of off-heap memory to be allocated per driver in cluster mode. int 384
spark-defaults-conf.spark.executor.instances The number of executors for static allocation. int 1
spark-defaults-conf.spark.executor.cores The number of cores to use on each executor. int 1
spark-defaults-conf.spark.driver.memory Amount of memory to use for the driver process. string 1g
spark-defaults-conf.spark.executor.memory Amount of memory to use per executor process. string 1g
spark-defaults-conf.spark.executor.memoryOverhead The amount of off-heap memory to be allocated per executor. int 384
yarn-site.yarn.nodemanager.resource.memory-mb Amount of physical memory, in MB, that can be allocated for containers. int 8192
yarn-site.yarn.scheduler.maximum-allocation-mb The maximum allocation for every container request at the resource manager. int 8192
yarn-site.yarn.nodemanager.resource.cpu-vcores Number of CPU cores that can be allocated for containers. int 32
yarn-site.yarn.scheduler.maximum-allocation-vcores The maximum allocation for every container request at the resource manager, in terms of virtual CPU cores. int 8
yarn-site.yarn.nodemanager.linux-container-executor.secure-mode.pool-user-count The number of pool users for the linux container executor in secure mode. int 6
yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent Maximum percent of resources in the cluster that can be used to run application masters. float 0.1
yarn-site.yarn.nodemanager.container-executor.class Container executors for a specific operating system(s). string org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
capacity-scheduler.yarn.scheduler.capacity.root.default.user-limit-factor The multiple of the queue capacity which can be configured to allow a single user to acquire more resources. int 1
capacity-scheduler.yarn.scheduler.capacity.root.default.maximum-capacity Maximum queue capacity in percentage (%) as a float OR as absolute resource queue maximum capacity. Setting this value to -1 sets maximum capacity to 100%. int 100
capacity-scheduler.yarn.scheduler.capacity.root.default.state State of queue can be one of Running or Stopped. string RUNNING
capacity-scheduler.yarn.scheduler.capacity.root.default.maximum-application-lifetime Maximum lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. int -1
capacity-scheduler.yarn.scheduler.capacity.root.default.default-application-lifetime Default lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. int -1
capacity-scheduler.yarn.scheduler.capacity.node-locality-delay Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. int 40
capacity-scheduler.yarn.scheduler.capacity.rack-locality-additional-delay Number of additional missed scheduling opportunities over the node-locality-delay ones, after which the CapacityScheduler attempts to schedule off-switch containers. int -1
hadoop-env.HADOOP_HEAPSIZE_MAX Default maximum heap size of all Hadoop JVM processes. int 2048
yarn-env.YARN_RESOURCEMANAGER_HEAPSIZE Heap size of Yarn ResourceManager. int 2048
yarn-env.YARN_NODEMANAGER_HEAPSIZE Heap size of Yarn NodeManager. int 2048
mapred-env.HADOOP_JOB_HISTORYSERVER_HEAPSIZE Heap size of Hadoop Job HistoryServer. int 2048
hive-env.HADOOP_HEAPSIZE Heap size of Hadoop for Hive. int 2048
livy-conf.livy.server.session.timeout-check Check for Livy server session timeout. bool true
livy-conf.livy.server.session.timeout-check.skip-busy Skip-busy for Check for Livy server session timeout. bool true
livy-conf.livy.server.session.timeout Timeout for livy server session in (ms/s/m | min/h/d/y). string 2h
livy-conf.livy.server.yarn.poll-interval Polling interval for yarn in Livy server in (ms/s/m | min/h/d/y). string 500ms
livy-conf.livy.rsc.jars Livy RSC jars. string local:/opt/livy/rsc-jars/livy-api.jar,local:/opt/livy/rsc-jars/livy-rsc.jar,local:/opt/livy/rsc-jars/netty-all.jar
livy-conf.livy.repl.jars Livy repl jars. string local:/opt/livy/repl_2.11-jars/livy-core.jar,local:/opt/livy/repl_2.11-jars/livy-repl.jar,local:/opt/livy/repl_2.11-jars/commons-codec.jar
livy-conf.livy.rsc.sparkr.package Livy RSC SparkR package. string hdfs:///system/livy/sparkr.zip
livy-env.LIVY_SERVER_JAVA_OPTS Livy Server Java Options. string -Xmx2g
spark-defaults-conf.spark.r.backendConnectionTimeout Connection timeout set by R process on its connection to RBackend in seconds. int 86400
spark-defaults-conf.spark.pyspark.python Python option for spark. string /opt/bin/python3
spark-defaults-conf.spark.yarn.jars Yarn jars. string local:/opt/spark/jars/*
spark-history-server-conf.spark.history.fs.cleaner.maxAge Maximum age of job history files before they are deleted by the filesystem history cleaner in (ms/s/m | min/h/d/y). string 7d
spark-history-server-conf.spark.history.fs.cleaner.interval Interval of cleaner for spark history in (ms/s/m | min/h/d/y). string 12h
hadoop-env.HADOOP_CLASSPATH Sets the additional Hadoop classpath. string
spark-env.SPARK_DAEMON_MEMORY Spark Daemon Memory. string 2g
yarn-site.yarn.log-aggregation.retain-seconds When log aggregation in enabled, this property determines the number of seconds to retain logs. int 604800
yarn-site.yarn.nodemanager.log-aggregation.compression-type Compression type for log aggregation for Yarn NodeManager. string gz
yarn-site.yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds Interval seconds for Roll Monitoring in NodeManager Log Aggregation. int 3600
yarn-site.yarn.scheduler.minimum-allocation-mb The minimum allocation for every container request at the Resource Manager, in MBs. int 512
yarn-site.yarn.scheduler.minimum-allocation-vcores The minimum allocation for every container request at the Resource Manager in terms of virtual CPU cores. int 1
yarn-site.yarn.nm.liveness-monitor.expiry-interval-ms How long to wait until a node manager is considered dead. int 180000
yarn-site.yarn.resourcemanager.zk-timeout-ms 'ZooKeeper' session timeout in milliseconds. int 40000
capacity-scheduler.yarn.scheduler.capacity.root.default.acl_application_max_priority The ACL of who can submit applications with configured priority. For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]. string *
includeSpark Boolean to configure whether or not Spark jobs can run in the Storage pool. bool true
enableSparkOnK8s Boolean to configure whether or not enable Spark on K8s, which adding containers for K8s in Spark head. bool false
sparkVersion The version of Spark string 2.4
spark-env.PYSPARK_ARCHIVES_PATH Path to pyspark archive jars used in spark jobs. string local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip

The following sections list the unsupported configurations.

Big Data Clusters-specific default HDFS settings

The HDFS settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included.

Setting Name Description Type Default Value
hdfs-site.dfs.replication Default Block Replication. int 2
hdfs-site.dfs.namenode.provided.enabled Enables the name node to handle provided storages. bool true
hdfs.site.dfs.namenode.mount.acls.enabled Set to true to inherit ACLs (Access Control Lists) from remote stores during mount. bool false
hdfs-site.dfs.datanode.provided.enabled Enables the data node to handle provided storages. bool true
hdfs-site.dfs.datanode.provided.volume.lazy.load Enable lazy load in data node for provided storages. bool true
hdfs-site.dfs.provided.aliasmap.inmemory.enabled Enable in-memory alias map for provided storages. bool true
hdfs-site.dfs.provided.aliasmap.class The class that is used to specify the input format of the blocks on provided storages. string org.apache.hadoop.hdfs.server.common.blockaliasmap.impl.InMemoryLevelDBAliasMapClient
hdfs-site.dfs.namenode.provided.aliasmap.class The class that is used to specify the input format of the blocks on provided storages for namenode. string org.apache.hadoop.hdfs.server.common.blockaliasmap.impl.NamenodeInMemoryAliasMapClient
hdfs-site.dfs.provided.aliasmap.load.retries Number of retries on the datanode to load the provided aliasmap. int 0
hdfs-site.dfs.provided.aliasmap.inmemory.batch-size The batch size when iterating over the database backing the aliasmap. int 500
hdfs-site.dfs.datanode.provided.volume.readthrough Enable readthrough for provided storages in datanode. bool true
hdfs-site.dfs.provided.cache.capacity.mount Enable cache capacity mount for provided storages. bool true
hdfs-site.dfs.provided.overreplication.factor Overreplication factor for provided storages. Number of cache-blocks on BDC created per remote HDFS block. float 1
hdfs-site.dfs.provided.cache.capacity.fraction Cache capacity fraction for provided storage. The fraction of the total capacity in the cluster that can be used to cache data from Provided stores. float 0.01
hdfs-site.dfs.provided.cache.capacity.bytes Cluster capacity to use as cache space for Provided blocks, in bytes. int -1
hdfs-site.dfs.ls.limit Limit the number of files printed by ls. int 500
hdfs-env.HDFS_NAMENODE_OPTS HDFS Namenode options. string -Dhadoop.security.logger=INFO,RFAS -Xmx2g
hdfs-env.HDFS_DATANODE_OPTS HDFS Datanode Options. string -Dhadoop.security.logger=ERROR,RFAS -Xmx2g
hdfs-env.HDFS_ZKFC_OPTS HDFS ZKFC Options. string -Xmx1g
hdfs-env.HDFS_JOURNALNODE_OPTS HDFS JournalNode Options. string -Xmx2g
hdfs-env.HDFS_AUDIT_LOGGER HDFS Audit Logger Options. string INFO,RFAAUDIT
core-site.hadoop.security.group.mapping.ldap.search.group.hierarchy.levels Hierarchy levels for core site Hadoop LDAP Search group. int 10
core-site.fs.permissions.umask-mode Permission umask mode. string 077
core-site.hadoop.security.kms.client.failover.max.retries Max retries for client failover. int 20
zoo-cfg.tickTime Tick Time for 'ZooKeeper' config. int 2000
zoo-cfg.initLimit Init Time for 'ZooKeeper' config. int 10
zoo-cfg.syncLimit Sync Time for 'ZooKeeper' config. int 5
zoo-cfg.maxClientCnxns Max Client connections for 'ZooKeeper' config. int 60
zoo-cfg.minSessionTimeout Minimum Session Timeout for 'ZooKeeper' config. int 4000
zoo-cfg.maxSessionTimeout Maximum Session Timeout for 'ZooKeeper' config. int 40000
zoo-cfg.autopurge.snapRetainCount Snap Retain count for Autopurge 'ZooKeeper' config. int 3
zoo-cfg.autopurge.purgeInterval Purge interval for Autopurge 'ZooKeeper' config. int 0
zookeeper-java-env.JVMFLAGS JVM Flags for Java environment in 'ZooKeeper'. string -Xmx1G -Xms1G
zookeeper-log4j-properties.zookeeper.console.threshold Threshold for log4j console in 'ZooKeeper'. string INFO
zoo-cfg.zookeeper.request.timeout Controls the 'ZooKeeper' request timeout in milliseconds. int 40000
kms-site.hadoop.security.kms.encrypted.key.cache.size Cache size for encrypted key in hadoop kms. int 500

Big Data Clusters-specific default Gateway settings

The Gateway settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included. Gateway settings can only be configured at the resource scope.

Setting Name Description Type Default Value
gateway-site.gateway.httpclient.socketTimeout Socket Timeout for HTTP client in gateway in (ms/s/m). string 90s
gateway-site.sun.security.krb5.debug Debug for Kerberos Security. bool true
knox-env.KNOX_GATEWAY_MEM_OPTS Knox Gateway Memory options. string -Xmx2g

Unsupported Spark configurations

The following spark configurations are unsupported and cannot be changed in the context of the Big Data Cluster.

Category Sub-Category File Unsupported Configurations
yarn-site yarn-site.xml yarn.log-aggregation-enable
yarn.log.server.url
yarn.nodemanager.pmem-check-enabled
yarn.nodemanager.vmem-check-enabled
yarn.nodemanager.aux-services
yarn.resourcemanager.address
yarn.nodemanager.address
yarn.client.failover-no-ha-proxy-provider
yarn.client.failover-proxy-provider
yarn.http.policy
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user
yarn.acl.enable
yarn.admin.acl
yarn.resourcemanager.hostname
yarn.resourcemanager.principal
yarn.resourcemanager.keytab
yarn.resourcemanager.webapp.spnego-keytab-file
yarn.resourcemanager.webapp.spnego-principal
yarn.nodemanager.principal
yarn.nodemanager.keytab
yarn.nodemanager.webapp.spnego-keytab-file
yarn.nodemanager.webapp.spnego-principal
yarn.resourcemanager.ha.enabled
yarn.resourcemanager.cluster-id
yarn.resourcemanager.zk-address
yarn.resourcemanager.ha.rm-ids
yarn.resourcemanager.hostname.*
capacity-scheduler capacity-scheduler.xml yarn.scheduler.capacity.root.acl_submit_applications
yarn.scheduler.capacity.root.acl_administer_queue
yarn.scheduler.capacity.root.default.acl_application_max_priority
yarn-env yarn-env.sh
spark-defaults-conf spark-defaults.conf spark.yarn.archive
spark.yarn.historyServer.address
spark.eventLog.enabled
spark.eventLog.dir
spark.sql.warehouse.dir
spark.sql.hive.metastore.version
spark.sql.hive.metastore.jars
spark.extraListeners
spark.metrics.conf
spark.ssl.enabled
spark.authenticate
spark.network.crypto.enabled
spark.ssl.keyStore
spark.ssl.keyStorePassword
spark.ui.enabled
spark-env spark-env.sh SPARK_NO_DAEMONIZE
SPARK_DIST_CLASSPATH
spark-history-server-conf spark-history-server.conf spark.history.fs.logDirectory
spark.ui.proxyBase
spark.history.fs.cleaner.enabled
spark.ssl.enabled
spark.authenticate
spark.network.crypto.enabled
spark.ssl.keyStore
spark.ssl.keyStorePassword
spark.history.kerberos.enabled
spark.history.kerberos.principal
spark.history.kerberos.keytab
spark.ui.filters
spark.acls.enable
spark.history.ui.acls.enable
spark.history.ui.admin.acls
spark.history.ui.admin.acls.groups
livy-conf livy.conf livy.keystore
livy.keystore.password
livy.spark.master
livy.spark.deploy-mode
livy.rsc.jars
livy.repl.jars
livy.rsc.pyspark.archives
livy.rsc.sparkr.package
livy.repl.enable-hive-context
livy.superusers
livy.server.auth.type
livy.server.launch.kerberos.keytab
livy.server.launch.kerberos.principal
livy.server.auth.kerberos.principal
livy.server.auth.kerberos.keytab
livy.impersonation.enabled
livy.server.access-control.enabled
livy.server.access-control.*
livy-env livy-env.sh
hive-site hive-site.xml javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
hive.metastore.uris
hive.metastore.pre.event.listeners
hive.security.authorization.enabled
hive.security.metastore.authenticator.manager
hive.security.metastore.authorization.manager
hive.metastore.use.SSL
hive.metastore.keystore.path
hive.metastore.keystore.password
hive.metastore.truststore.path
hive.metastore.truststore.password
hive.metastore.kerberos.keytab.file
hive.metastore.kerberos.principal
hive.metastore.sasl.enabled
hive.metastore.execute.setugi
hive.cluster.delegation.token.store.class
hive-env hive-env.sh

Unsupported HDFS configurations

The following hdfs configurations are unsupported and cannot be changed in the context of the Big Data Cluster.

Category Sub-Category File Unsupported Configurations
core-site core-site.xml fs.defaultFS
ha.zookeeper.quorum
hadoop.tmp.dir
hadoop.rpc.protection
hadoop.security.auth_to_local
hadoop.security.authentication
hadoop.security.authorization
hadoop.http.authentication.simple.anonymous.allowed
hadoop.http.authentication.type
hadoop.http.authentication.kerberos.principal
hadoop.http.authentication.kerberos.keytab
hadoop.http.filter.initializers
hadoop.security.group.mapping.*
hadoop.security.key.provider.path
mapred-env mapred-env.sh
hdfs-site hdfs-site.xml dfs.namenode.name.dir
dfs.datanode.data.dir
dfs.namenode.acls.enabled
dfs.namenode.datanode.registration.ip-hostname-check
dfs.client.retry.policy.enabled
dfs.permissions.enabled
dfs.nameservices
dfs.ha.namenodes.nmnode-0
dfs.namenode.rpc-address.nmnode-0.*
dfs.namenode.shared.edits.dir
dfs.ha.automatic-failover.enabled
dfs.ha.fencing.methods
dfs.journalnode.edits.dir
dfs.client.failover.proxy.provider.nmnode-0
dfs.namenode.http-address
dfs.namenode.httpS-address
dfs.http.policy
dfs.encrypt.data.transfer
dfs.block.access.token.enable
dfs.data.transfer.protection
dfs.encrypt.data.transfer.cipher.suites
dfs.https.port
dfs.namenode.keytab.file
dfs.namenode.kerberos.principal
dfs.namenode.kerberos.internal.spnego.principal
dfs.datanode.data.dir.perm
dfs.datanode.address
dfs.datanode.http.address
dfs.datanode.ipc.address
dfs.datanode.https.address
dfs.datanode.keytab.file
dfs.datanode.kerberos.principal
dfs.journalnode.keytab.file
dfs.journalnode.kerberos.principal
dfs.journalnode.kerberos.internal.spnego.principal
dfs.web.authentication.kerberos.keytab
dfs.web.authentication.kerberos.principal
dfs.webhdfs.enabled
dfs.permissions.superusergroup
hdfs-env hdfs-env.sh HADOOP_HEAPSIZE_MAX
zoo-cfg zoo.cfg secureClientPort
clientPort
dataDir
dataLogDir
4lw.commands.whitelist
zookeeper-java-env java.env ZK_LOG_DIR
SERVER_JVMFLAGS
zookeeper-log4j-properties log4j.properties (zookeeper) log4j.rootLogger
log4j.appender.CONSOLE.*

Note

This article contains the term whitelist, a term Microsoft considers insensitive in this context. The term appears in this article because it currently appears in the software. When the term is removed from the software, we will remove it from the article.

Unsupported gateway configurations

The following gateway configurations for are unsupported and cannot be changed in the context of the Big Data Cluster.

Category Sub-Category File Unsupported Configurations
gateway-site gateway-site.xml gateway.port
gateway.path
gateway.gateway.conf.dir
gateway.hadoop.kerberos.secured
java.security.krb5.conf
java.security.auth.login.config
gateway.websocket.feature.enabled
gateway.scope.cookies.feature.enabled
ssl.exclude.protocols
ssl.include.ciphers

Next steps

Configure SQL Server Big Data Clusters