Apache Spark & Apache Hadoop (HDFS) configuration properties
Applies to: SQL Server 2019 (15.x)
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
Big Data Clusters supports deployment time and post-deployment time configuration of Apache Spark and Hadoop components at the service and resource scopes. Big Data Clusters uses the same default configuration values as the respective open source project for most settings. The settings that we do change are listed below along with a description and their default value. Other than the gateway resource, there is no difference between settings that are configurable at the service scope and the resource scope.
You can find all possible configurations and the defaults for each at the associated Apache documentation site:
- Apache Spark: https://spark.apache.org/docs/latest/configuration.html
- Apache Hadoop:
- Hive: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-MetaStore
- Livy: https://github.com/cloudera/livy/blob/master/conf/livy.conf.template
- Apache Knox Gateway: https://knox.apache.org/books/knox-0-14-0/user-guide.html#Gateway+Details
The settings we do not support configuring are also listed below.
Note
To include Spark in the Storage pool, set the boolean value includeSpark
in the bdc.json
configuration file at spec.resources.storage-0.spec.settings.spark
. See Configure Apache Spark and Apache Hadoop in Big Data Clusters for instructions.
Big Data Clusters-specific default Spark settings
The Spark settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included.
Setting Name | Description | Type | Default Value |
---|---|---|---|
capacity-scheduler.yarn.scheduler.capacity.maximum-applications | Maximum number of applications in the system which can be concurrently active both running and pending. | int | 10000 |
capacity-scheduler.yarn.scheduler.capacity.resource-calculator | The ResourceCalculator implementation to be used to compare Resources in the scheduler. | string | org.apache.hadoop.yarn.util.resource.DominantResourceCalculator |
capacity-scheduler.yarn.scheduler.capacity.root.queues | The capacity scheduler with predefined queue called root. | string | default |
capacity-scheduler.yarn.scheduler.capacity.root.default.capacity | Queue capacity in percentage (%) as absolute resource queue minimum capacity for root queue. | int | 100 |
spark-defaults-conf.spark.driver.cores | Number of cores to use for the driver process, only in cluster mode. | int | 1 |
spark-defaults-conf.spark.driver.memoryOverhead | The amount of off-heap memory to be allocated per driver in cluster mode. | int | 384 |
spark-defaults-conf.spark.executor.instances | The number of executors for static allocation. | int | 1 |
spark-defaults-conf.spark.executor.cores | The number of cores to use on each executor. | int | 1 |
spark-defaults-conf.spark.driver.memory | Amount of memory to use for the driver process. | string | 1g |
spark-defaults-conf.spark.executor.memory | Amount of memory to use per executor process. | string | 1g |
spark-defaults-conf.spark.executor.memoryOverhead | The amount of off-heap memory to be allocated per executor. | int | 384 |
yarn-site.yarn.nodemanager.resource.memory-mb | Amount of physical memory, in MB, that can be allocated for containers. | int | 8192 |
yarn-site.yarn.scheduler.maximum-allocation-mb | The maximum allocation for every container request at the resource manager. | int | 8192 |
yarn-site.yarn.nodemanager.resource.cpu-vcores | Number of CPU cores that can be allocated for containers. | int | 32 |
yarn-site.yarn.scheduler.maximum-allocation-vcores | The maximum allocation for every container request at the resource manager, in terms of virtual CPU cores. | int | 8 |
yarn-site.yarn.nodemanager.linux-container-executor.secure-mode.pool-user-count | The number of pool users for the linux container executor in secure mode. | int | 6 |
yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent | Maximum percent of resources in the cluster that can be used to run application masters. | float | 0.1 |
yarn-site.yarn.nodemanager.container-executor.class | Container executors for a specific operating system(s). | string | org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor |
capacity-scheduler.yarn.scheduler.capacity.root.default.user-limit-factor | The multiple of the queue capacity which can be configured to allow a single user to acquire more resources. | int | 1 |
capacity-scheduler.yarn.scheduler.capacity.root.default.maximum-capacity | Maximum queue capacity in percentage (%) as a float OR as absolute resource queue maximum capacity. Setting this value to -1 sets maximum capacity to 100%. | int | 100 |
capacity-scheduler.yarn.scheduler.capacity.root.default.state | State of queue can be one of Running or Stopped. | string | RUNNING |
capacity-scheduler.yarn.scheduler.capacity.root.default.maximum-application-lifetime | Maximum lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. | int | -1 |
capacity-scheduler.yarn.scheduler.capacity.root.default.default-application-lifetime | Default lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. | int | -1 |
capacity-scheduler.yarn.scheduler.capacity.node-locality-delay | Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. | int | 40 |
capacity-scheduler.yarn.scheduler.capacity.rack-locality-additional-delay | Number of additional missed scheduling opportunities over the node-locality-delay ones, after which the CapacityScheduler attempts to schedule off-switch containers. | int | -1 |
hadoop-env.HADOOP_HEAPSIZE_MAX | Default maximum heap size of all Hadoop JVM processes. | int | 2048 |
yarn-env.YARN_RESOURCEMANAGER_HEAPSIZE | Heap size of Yarn ResourceManager. | int | 2048 |
yarn-env.YARN_NODEMANAGER_HEAPSIZE | Heap size of Yarn NodeManager. | int | 2048 |
mapred-env.HADOOP_JOB_HISTORYSERVER_HEAPSIZE | Heap size of Hadoop Job HistoryServer. | int | 2048 |
hive-env.HADOOP_HEAPSIZE | Heap size of Hadoop for Hive. | int | 2048 |
livy-conf.livy.server.session.timeout-check | Check for Livy server session timeout. | bool | true |
livy-conf.livy.server.session.timeout-check.skip-busy | Skip-busy for Check for Livy server session timeout. | bool | true |
livy-conf.livy.server.session.timeout | Timeout for livy server session in (ms/s/m | min/h/d/y). | string | 2h |
livy-conf.livy.server.yarn.poll-interval | Polling interval for yarn in Livy server in (ms/s/m | min/h/d/y). | string | 500ms |
livy-conf.livy.rsc.jars | Livy RSC jars. | string | local:/opt/livy/rsc-jars/livy-api.jar,local:/opt/livy/rsc-jars/livy-rsc.jar,local:/opt/livy/rsc-jars/netty-all.jar |
livy-conf.livy.repl.jars | Livy repl jars. | string | local:/opt/livy/repl_2.11-jars/livy-core.jar,local:/opt/livy/repl_2.11-jars/livy-repl.jar,local:/opt/livy/repl_2.11-jars/commons-codec.jar |
livy-conf.livy.rsc.sparkr.package | Livy RSC SparkR package. | string | hdfs:///system/livy/sparkr.zip |
livy-env.LIVY_SERVER_JAVA_OPTS | Livy Server Java Options. | string | -Xmx2g |
spark-defaults-conf.spark.r.backendConnectionTimeout | Connection timeout set by R process on its connection to RBackend in seconds. | int | 86400 |
spark-defaults-conf.spark.pyspark.python | Python option for spark. | string | /opt/bin/python3 |
spark-defaults-conf.spark.yarn.jars | Yarn jars. | string | local:/opt/spark/jars/* |
spark-history-server-conf.spark.history.fs.cleaner.maxAge | Maximum age of job history files before they are deleted by the filesystem history cleaner in (ms/s/m | min/h/d/y). | string | 7d |
spark-history-server-conf.spark.history.fs.cleaner.interval | Interval of cleaner for spark history in (ms/s/m | min/h/d/y). | string | 12h |
hadoop-env.HADOOP_CLASSPATH | Sets the additional Hadoop classpath. | string | |
spark-env.SPARK_DAEMON_MEMORY | Spark Daemon Memory. | string | 2g |
yarn-site.yarn.log-aggregation.retain-seconds | When log aggregation in enabled, this property determines the number of seconds to retain logs. | int | 604800 |
yarn-site.yarn.nodemanager.log-aggregation.compression-type | Compression type for log aggregation for Yarn NodeManager. | string | gz |
yarn-site.yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds | Interval seconds for Roll Monitoring in NodeManager Log Aggregation. | int | 3600 |
yarn-site.yarn.scheduler.minimum-allocation-mb | The minimum allocation for every container request at the Resource Manager, in MBs. | int | 512 |
yarn-site.yarn.scheduler.minimum-allocation-vcores | The minimum allocation for every container request at the Resource Manager in terms of virtual CPU cores. | int | 1 |
yarn-site.yarn.nm.liveness-monitor.expiry-interval-ms | How long to wait until a node manager is considered dead. | int | 180000 |
yarn-site.yarn.resourcemanager.zk-timeout-ms | 'ZooKeeper' session timeout in milliseconds. | int | 40000 |
capacity-scheduler.yarn.scheduler.capacity.root.default.acl_application_max_priority | The ACL of who can submit applications with configured priority. For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]. | string | * |
includeSpark | Boolean to configure whether or not Spark jobs can run in the Storage pool. | bool | true |
enableSparkOnK8s | Boolean to configure whether or not enable Spark on K8s, which adding containers for K8s in Spark head. | bool | false |
sparkVersion | The version of Spark | string | 2.4 |
spark-env.PYSPARK_ARCHIVES_PATH | Path to pyspark archive jars used in spark jobs. | string | local:/opt/spark/python/lib/pyspark.zip,local:/opt/spark/python/lib/py4j-0.10.7-src.zip |
The following sections list the unsupported configurations.
Big Data Clusters-specific default HDFS settings
The HDFS settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included.
Setting Name | Description | Type | Default Value |
---|---|---|---|
hdfs-site.dfs.replication | Default Block Replication. | int | 2 |
hdfs-site.dfs.namenode.provided.enabled | Enables the name node to handle provided storages. | bool | true |
hdfs.site.dfs.namenode.mount.acls.enabled | Set to true to inherit ACLs (Access Control Lists) from remote stores during mount. | bool | false |
hdfs-site.dfs.datanode.provided.enabled | Enables the data node to handle provided storages. | bool | true |
hdfs-site.dfs.datanode.provided.volume.lazy.load | Enable lazy load in data node for provided storages. | bool | true |
hdfs-site.dfs.provided.aliasmap.inmemory.enabled | Enable in-memory alias map for provided storages. | bool | true |
hdfs-site.dfs.provided.aliasmap.class | The class that is used to specify the input format of the blocks on provided storages. | string | org.apache.hadoop.hdfs.server.common.blockaliasmap.impl.InMemoryLevelDBAliasMapClient |
hdfs-site.dfs.namenode.provided.aliasmap.class | The class that is used to specify the input format of the blocks on provided storages for namenode. | string | org.apache.hadoop.hdfs.server.common.blockaliasmap.impl.NamenodeInMemoryAliasMapClient |
hdfs-site.dfs.provided.aliasmap.load.retries | Number of retries on the datanode to load the provided aliasmap. | int | 0 |
hdfs-site.dfs.provided.aliasmap.inmemory.batch-size | The batch size when iterating over the database backing the aliasmap. | int | 500 |
hdfs-site.dfs.datanode.provided.volume.readthrough | Enable readthrough for provided storages in datanode. | bool | true |
hdfs-site.dfs.provided.cache.capacity.mount | Enable cache capacity mount for provided storages. | bool | true |
hdfs-site.dfs.provided.overreplication.factor | Overreplication factor for provided storages. Number of cache-blocks on BDC created per remote HDFS block. | float | 1 |
hdfs-site.dfs.provided.cache.capacity.fraction | Cache capacity fraction for provided storage. The fraction of the total capacity in the cluster that can be used to cache data from Provided stores. | float | 0.01 |
hdfs-site.dfs.provided.cache.capacity.bytes | Cluster capacity to use as cache space for Provided blocks, in bytes. | int | -1 |
hdfs-site.dfs.ls.limit | Limit the number of files printed by ls. | int | 500 |
hdfs-env.HDFS_NAMENODE_OPTS | HDFS Namenode options. | string | -Dhadoop.security.logger=INFO,RFAS -Xmx2g |
hdfs-env.HDFS_DATANODE_OPTS | HDFS Datanode Options. | string | -Dhadoop.security.logger=ERROR,RFAS -Xmx2g |
hdfs-env.HDFS_ZKFC_OPTS | HDFS ZKFC Options. | string | -Xmx1g |
hdfs-env.HDFS_JOURNALNODE_OPTS | HDFS JournalNode Options. | string | -Xmx2g |
hdfs-env.HDFS_AUDIT_LOGGER | HDFS Audit Logger Options. | string | INFO,RFAAUDIT |
core-site.hadoop.security.group.mapping.ldap.search.group.hierarchy.levels | Hierarchy levels for core site Hadoop LDAP Search group. | int | 10 |
core-site.fs.permissions.umask-mode | Permission umask mode. | string | 077 |
core-site.hadoop.security.kms.client.failover.max.retries | Max retries for client failover. | int | 20 |
zoo-cfg.tickTime | Tick Time for 'ZooKeeper' config. | int | 2000 |
zoo-cfg.initLimit | Init Time for 'ZooKeeper' config. | int | 10 |
zoo-cfg.syncLimit | Sync Time for 'ZooKeeper' config. | int | 5 |
zoo-cfg.maxClientCnxns | Max Client connections for 'ZooKeeper' config. | int | 60 |
zoo-cfg.minSessionTimeout | Minimum Session Timeout for 'ZooKeeper' config. | int | 4000 |
zoo-cfg.maxSessionTimeout | Maximum Session Timeout for 'ZooKeeper' config. | int | 40000 |
zoo-cfg.autopurge.snapRetainCount | Snap Retain count for Autopurge 'ZooKeeper' config. | int | 3 |
zoo-cfg.autopurge.purgeInterval | Purge interval for Autopurge 'ZooKeeper' config. | int | 0 |
zookeeper-java-env.JVMFLAGS | JVM Flags for Java environment in 'ZooKeeper'. | string | -Xmx1G -Xms1G |
zookeeper-log4j-properties.zookeeper.console.threshold | Threshold for log4j console in 'ZooKeeper'. | string | INFO |
zoo-cfg.zookeeper.request.timeout | Controls the 'ZooKeeper' request timeout in milliseconds. | int | 40000 |
kms-site.hadoop.security.kms.encrypted.key.cache.size | Cache size for encrypted key in hadoop kms. | int | 500 |
Big Data Clusters-specific default Gateway settings
The Gateway settings below are those that have BDC-specific defaults but are user configurable. System-managed settings are not included. Gateway settings can only be configured at the resource scope.
Setting Name | Description | Type | Default Value |
---|---|---|---|
gateway-site.gateway.httpclient.socketTimeout | Socket Timeout for HTTP client in gateway in (ms/s/m). | string | 90s |
gateway-site.sun.security.krb5.debug | Debug for Kerberos Security. | bool | true |
knox-env.KNOX_GATEWAY_MEM_OPTS | Knox Gateway Memory options. | string | -Xmx2g |
Unsupported Spark configurations
The following spark
configurations are unsupported and cannot be changed in the context of the Big Data Cluster.
Category | Sub-Category | File | Unsupported Configurations |
---|---|---|---|
yarn-site | yarn-site.xml | yarn.log-aggregation-enable | |
yarn.log.server.url | |||
yarn.nodemanager.pmem-check-enabled | |||
yarn.nodemanager.vmem-check-enabled | |||
yarn.nodemanager.aux-services | |||
yarn.resourcemanager.address | |||
yarn.nodemanager.address | |||
yarn.client.failover-no-ha-proxy-provider | |||
yarn.client.failover-proxy-provider | |||
yarn.http.policy | |||
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user | |||
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix | |||
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user | |||
yarn.acl.enable | |||
yarn.admin.acl | |||
yarn.resourcemanager.hostname | |||
yarn.resourcemanager.principal | |||
yarn.resourcemanager.keytab | |||
yarn.resourcemanager.webapp.spnego-keytab-file | |||
yarn.resourcemanager.webapp.spnego-principal | |||
yarn.nodemanager.principal | |||
yarn.nodemanager.keytab | |||
yarn.nodemanager.webapp.spnego-keytab-file | |||
yarn.nodemanager.webapp.spnego-principal | |||
yarn.resourcemanager.ha.enabled | |||
yarn.resourcemanager.cluster-id | |||
yarn.resourcemanager.zk-address | |||
yarn.resourcemanager.ha.rm-ids | |||
yarn.resourcemanager.hostname.* | |||
capacity-scheduler | capacity-scheduler.xml | yarn.scheduler.capacity.root.acl_submit_applications | |
yarn.scheduler.capacity.root.acl_administer_queue | |||
yarn.scheduler.capacity.root.default.acl_application_max_priority | |||
yarn-env | yarn-env.sh | ||
spark-defaults-conf | spark-defaults.conf | spark.yarn.archive | |
spark.yarn.historyServer.address | |||
spark.eventLog.enabled | |||
spark.eventLog.dir | |||
spark.sql.warehouse.dir | |||
spark.sql.hive.metastore.version | |||
spark.sql.hive.metastore.jars | |||
spark.extraListeners | |||
spark.metrics.conf | |||
spark.ssl.enabled | |||
spark.authenticate | |||
spark.network.crypto.enabled | |||
spark.ssl.keyStore | |||
spark.ssl.keyStorePassword | |||
spark.ui.enabled | |||
spark-env | spark-env.sh | SPARK_NO_DAEMONIZE | |
SPARK_DIST_CLASSPATH | |||
spark-history-server-conf | spark-history-server.conf | spark.history.fs.logDirectory | |
spark.ui.proxyBase | |||
spark.history.fs.cleaner.enabled | |||
spark.ssl.enabled | |||
spark.authenticate | |||
spark.network.crypto.enabled | |||
spark.ssl.keyStore | |||
spark.ssl.keyStorePassword | |||
spark.history.kerberos.enabled | |||
spark.history.kerberos.principal | |||
spark.history.kerberos.keytab | |||
spark.ui.filters | |||
spark.acls.enable | |||
spark.history.ui.acls.enable | |||
spark.history.ui.admin.acls | |||
spark.history.ui.admin.acls.groups | |||
livy-conf | livy.conf | livy.keystore | |
livy.keystore.password | |||
livy.spark.master | |||
livy.spark.deploy-mode | |||
livy.rsc.jars | |||
livy.repl.jars | |||
livy.rsc.pyspark.archives | |||
livy.rsc.sparkr.package | |||
livy.repl.enable-hive-context | |||
livy.superusers | |||
livy.server.auth.type | |||
livy.server.launch.kerberos.keytab | |||
livy.server.launch.kerberos.principal | |||
livy.server.auth.kerberos.principal | |||
livy.server.auth.kerberos.keytab | |||
livy.impersonation.enabled | |||
livy.server.access-control.enabled | |||
livy.server.access-control.* | |||
livy-env | livy-env.sh | ||
hive-site | hive-site.xml | javax.jdo.option.ConnectionURL | |
javax.jdo.option.ConnectionDriverName | |||
javax.jdo.option.ConnectionUserName | |||
javax.jdo.option.ConnectionPassword | |||
hive.metastore.uris | |||
hive.metastore.pre.event.listeners | |||
hive.security.authorization.enabled | |||
hive.security.metastore.authenticator.manager | |||
hive.security.metastore.authorization.manager | |||
hive.metastore.use.SSL | |||
hive.metastore.keystore.path | |||
hive.metastore.keystore.password | |||
hive.metastore.truststore.path | |||
hive.metastore.truststore.password | |||
hive.metastore.kerberos.keytab.file | |||
hive.metastore.kerberos.principal | |||
hive.metastore.sasl.enabled | |||
hive.metastore.execute.setugi | |||
hive.cluster.delegation.token.store.class | |||
hive-env | hive-env.sh |
Unsupported HDFS configurations
The following hdfs
configurations are unsupported and cannot be changed in the context of the Big Data Cluster.
Category | Sub-Category | File | Unsupported Configurations |
---|---|---|---|
core-site | core-site.xml | fs.defaultFS | |
ha.zookeeper.quorum | |||
hadoop.tmp.dir | |||
hadoop.rpc.protection | |||
hadoop.security.auth_to_local | |||
hadoop.security.authentication | |||
hadoop.security.authorization | |||
hadoop.http.authentication.simple.anonymous.allowed | |||
hadoop.http.authentication.type | |||
hadoop.http.authentication.kerberos.principal | |||
hadoop.http.authentication.kerberos.keytab | |||
hadoop.http.filter.initializers | |||
hadoop.security.group.mapping.* | |||
hadoop.security.key.provider.path | |||
mapred-env | mapred-env.sh | ||
hdfs-site | hdfs-site.xml | dfs.namenode.name.dir | |
dfs.datanode.data.dir | |||
dfs.namenode.acls.enabled | |||
dfs.namenode.datanode.registration.ip-hostname-check | |||
dfs.client.retry.policy.enabled | |||
dfs.permissions.enabled | |||
dfs.nameservices | |||
dfs.ha.namenodes.nmnode-0 | |||
dfs.namenode.rpc-address.nmnode-0.* | |||
dfs.namenode.shared.edits.dir | |||
dfs.ha.automatic-failover.enabled | |||
dfs.ha.fencing.methods | |||
dfs.journalnode.edits.dir | |||
dfs.client.failover.proxy.provider.nmnode-0 | |||
dfs.namenode.http-address | |||
dfs.namenode.httpS-address | |||
dfs.http.policy | |||
dfs.encrypt.data.transfer | |||
dfs.block.access.token.enable | |||
dfs.data.transfer.protection | |||
dfs.encrypt.data.transfer.cipher.suites | |||
dfs.https.port | |||
dfs.namenode.keytab.file | |||
dfs.namenode.kerberos.principal | |||
dfs.namenode.kerberos.internal.spnego.principal | |||
dfs.datanode.data.dir.perm | |||
dfs.datanode.address | |||
dfs.datanode.http.address | |||
dfs.datanode.ipc.address | |||
dfs.datanode.https.address | |||
dfs.datanode.keytab.file | |||
dfs.datanode.kerberos.principal | |||
dfs.journalnode.keytab.file | |||
dfs.journalnode.kerberos.principal | |||
dfs.journalnode.kerberos.internal.spnego.principal | |||
dfs.web.authentication.kerberos.keytab | |||
dfs.web.authentication.kerberos.principal | |||
dfs.webhdfs.enabled | |||
dfs.permissions.superusergroup | |||
hdfs-env | hdfs-env.sh | HADOOP_HEAPSIZE_MAX | |
zoo-cfg | zoo.cfg | secureClientPort | |
clientPort | |||
dataDir | |||
dataLogDir | |||
4lw.commands.whitelist | |||
zookeeper-java-env | java.env | ZK_LOG_DIR | |
SERVER_JVMFLAGS | |||
zookeeper-log4j-properties | log4j.properties (zookeeper) | log4j.rootLogger | |
log4j.appender.CONSOLE.* |
Note
This article contains the term whitelist, a term Microsoft considers insensitive in this context. The term appears in this article because it currently appears in the software. When the term is removed from the software, we will remove it from the article.
Unsupported gateway
configurations
The following gateway
configurations for are unsupported and cannot be changed in the context of the Big Data Cluster.
Category | Sub-Category | File | Unsupported Configurations |
---|---|---|---|
gateway-site | gateway-site.xml | gateway.port | |
gateway.path | |||
gateway.gateway.conf.dir | |||
gateway.hadoop.kerberos.secured | |||
java.security.krb5.conf | |||
java.security.auth.login.config | |||
gateway.websocket.feature.enabled | |||
gateway.scope.cookies.feature.enabled | |||
ssl.exclude.protocols | |||
ssl.include.ciphers |