Script action development with HDInsight
Learn how to customize your HDInsight cluster using Bash scripts. Script actions are a way to customize HDInsight during or after cluster creation.
What are script actions
Script actions are Bash scripts that Azure runs on the cluster nodes to make configuration changes or install software. A script action is executed as root, and provides full access rights to the cluster nodes.
Script actions can be applied through the following methods:
Use this method to apply a script... | During cluster creation... | On a running cluster... |
---|---|---|
Azure portal | ✓ | ✓ |
Azure PowerShell | ✓ | ✓ |
Azure Classic CLI | ✓ | |
HDInsight .NET SDK | ✓ | ✓ |
Azure Resource Manager Template | ✓ |
For more information on using these methods to apply script actions, see Customize HDInsight clusters using script actions.
Best practices for script development
When you develop a custom script for an HDInsight cluster, there are several best practices to keep in mind:
- Target the Apache Hadoop version
- Target the OS Version
- Provide stable links to script resources
- Use pre-compiled resources
- Ensure that the cluster customization script is idempotent
- Ensure high availability of the cluster architecture
- Configure the custom components to use Azure Blob storage
- Write information to STDOUT and STDERR
- Save files as ASCII with LF line endings
- Use retry logic to recover from transient errors
Important
Script actions must complete within 60 minutes or the process fails. During node provisioning, the script runs concurrently with other setup and configuration processes. Competition for resources such as CPU time or network bandwidth may cause the script to take longer to finish than it does in your development environment.
Target the Apache Hadoop version
Different versions of HDInsight have different versions of Hadoop services and components installed. If your script expects a specific version of a service or component, you should only use the script with the version of HDInsight that includes the required components. You can find information on component versions included with HDInsight using the HDInsight component versioning document.
Checking the operating system version
Different versions of HDInsight rely on specific versions of Ubuntu. There may be differences between OS versions that you must check for in your script. For example, you may need to install a binary that is tied to the version of Ubuntu.
To check the OS version, use lsb_release
. For example, the following script demonstrates how to reference a specific tar file depending on the OS version:
OS_VERSION=$(lsb_release -sr)
if [[ $OS_VERSION == 14* ]]; then
echo "OS version is $OS_VERSION. Using hue-binaries-14-04."
HUE_TARFILE=hue-binaries-14-04.tgz
elif [[ $OS_VERSION == 16* ]]; then
echo "OS version is $OS_VERSION. Using hue-binaries-16-04."
HUE_TARFILE=hue-binaries-16-04.tgz
fi
Target the operating system version
HDInsight is based on the Ubuntu Linux distribution. Different versions of HDInsight rely on different versions of Ubuntu, which may change how your script behaves. For example, HDInsight 3.4 and earlier are based on Ubuntu versions that use Upstart. Versions 3.5 and greater are based on Ubuntu 16.04, which uses Systemd
. Systemd
and Upstart rely on different commands, so your script should be written to work with both.
Another important difference between HDInsight 3.4 and 3.5 is that JAVA_HOME
now points to Java 8. The following code demonstrates how to determine if the script is running on Ubuntu 14 or 16:
OS_VERSION=$(lsb_release -sr)
if [[ $OS_VERSION == 14* ]]; then
echo "OS version is $OS_VERSION. Using hue-binaries-14-04."
HUE_TARFILE=hue-binaries-14-04.tgz
elif [[ $OS_VERSION == 16* ]]; then
echo "OS version is $OS_VERSION. Using hue-binaries-16-04."
HUE_TARFILE=hue-binaries-16-04.tgz
fi
...
if [[ $OS_VERSION == 16* ]]; then
echo "Using systemd configuration"
systemctl daemon-reload
systemctl stop webwasb.service
systemctl start webwasb.service
else
echo "Using upstart configuration"
initctl reload-configuration
stop webwasb
start webwasb
fi
...
if [[ $OS_VERSION == 14* ]]; then
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
elif [[ $OS_VERSION == 16* ]]; then
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
fi
You can find the full script that contains these snippets at https://hdiconfigactions.blob.core.windows.net/linuxhueconfigactionv02/install-hue-uber-v02.sh.
For the version of Ubuntu that is used by HDInsight, see the HDInsight component version document.
To understand the differences between Systemd
and Upstart, see Systemd
for Upstart users.
Provide stable links to script resources
The script and associated resources must remain available throughout the lifetime of the cluster. These resources are required if new nodes are added to the cluster during scaling operations.
The best practice is to download and archive everything in an Azure Storage account on your subscription.
Important
The storage account used must be the default storage account for the cluster or a public, read-only container on any other storage account.
For example, the samples provided by Microsoft are stored in the https://hdiconfigactions.blob.core.windows.net/
storage account. This location is a public, read-only container maintained by the HDInsight team.
Use pre-compiled resources
To reduce the time it takes to run the script, avoid operations that compile resources from source code. For example, pre-compile resources and store them in an Azure Storage account blob in the same data center as HDInsight.
Ensure that the cluster customization script is idempotent
Scripts must be idempotent. If the script runs multiple times, it should return the cluster to the same state every time.
If the script runs multiple times, the script that modifies configuration files shouldn't add duplicate entries.
Ensure high availability of the cluster architecture
Linux-based HDInsight clusters provide two head nodes that are active within the cluster, and script actions run on both nodes. If the components you install expect only one head node, don't install the components on both head nodes.
Important
Services provided as part of HDInsight are designed to fail over between the two head nodes as needed. This functionality is not extended to custom components installed through script actions. If you need high availability for custom components, you must implement your own failover mechanism.
Configure the custom components to use Azure Blob storage
Components that you install on the cluster might have a default configuration that uses Apache Hadoop Distributed File System (HDFS) storage. HDInsight uses either Azure Storage or Data Lake Storage as the default storage. Both provide an HDFS compatible file system that persists data even if the cluster is deleted. You may need to configure components you install to use WASB or ADL instead of HDFS.
For most operations, you don't need to specify the file system. For example, the following copies the hadoop-common.jar file from the local file system to cluster storage:
hdfs dfs -put /usr/hdp/current/hadoop-client/hadoop-common.jar /example/jars/
In this example, the hdfs
command transparently uses the default cluster storage. For some operations, you may need to specify the URI. For example, adl:///example/jars
for Azure Data Lake Storage Gen1, abfs:///example/jars
for Data Lake Storage Gen2 or wasb:///example/jars
for Azure Storage.
Write information to STDOUT and STDERR
HDInsight logs script output that is written to STDOUT and STDERR. You can view this information using the Ambari web UI.
Note
Apache Ambari is only available if the cluster is successfully created. If you use a script action during cluster creation, and creation fails, see Troubleshoot script actions for other ways of accessing logged information.
Most utilities and installation packages already write information to STDOUT and STDERR, however you may want to add additional logging. To send text to STDOUT, use echo
. For example:
echo "Getting ready to install Foo"
By default, echo
sends the string to STDOUT. To direct it to STDERR, add >&2
before echo
. For example:
>&2 echo "An error occurred installing Foo"
This redirects information written to STDOUT to STDERR (2) instead. For more information on IO redirection, see https://www.tldp.org/LDP/abs/html/io-redirection.html.
For more information on viewing information logged by script actions, see Troubleshoot script actions.
Save files as ASCII with LF line endings
Bash scripts should be stored as ASCII format, with lines terminated by LF. Files that are stored as UTF-8, or use CRLF as the line ending may fail with the following error:
$'\r': command not found
line 1: #!/usr/bin/env: No such file or directory
Use retry logic to recover from transient errors
When you download files, installing packages using apt-get, or other actions that transmit data over the internet, the action may fail because of transient networking errors. For example, the remote resource you're communicating with may be in the process of failing over to a backup node.
To make your script resilient to transient errors, you can implement retry logic. The following function demonstrates how to implement retry logic. It retries the operation three times before failing.
#retry
MAXATTEMPTS=3
retry() {
local -r CMD="$@"
local -i ATTMEPTNUM=1
local -i RETRYINTERVAL=2
until $CMD
do
if (( ATTMEPTNUM == MAXATTEMPTS ))
then
echo "Attempt $ATTMEPTNUM failed. no more attempts left."
return 1
else
echo "Attempt $ATTMEPTNUM failed! Retrying in $RETRYINTERVAL seconds..."
sleep $(( RETRYINTERVAL ))
ATTMEPTNUM=$ATTMEPTNUM+1
fi
done
}
The following examples demonstrate how to use this function.
retry ls -ltr foo
retry wget -O ./tmpfile.sh https://hdiconfigactions.blob.core.windows.net/linuxhueconfigactionv02/install-hue-uber-v02.sh
Helper methods for custom scripts
Script action helper methods are utilities that you can use while writing custom scripts. These methods are contained in the https://hdiconfigactions.blob.core.windows.net/linuxconfigactionmodulev01/HDInsightUtilities-v01.sh script. Use the following to download and use them as part of your script:
# Import the helper method module.
wget -O /tmp/HDInsightUtilities-v01.sh -q https://hdiconfigactions.blob.core.windows.net/linuxconfigactionmodulev01/HDInsightUtilities-v01.sh && source /tmp/HDInsightUtilities-v01.sh && rm -f /tmp/HDInsightUtilities-v01.sh
The following helpers available for use in your script:
Helper usage | Description |
---|---|
download_file SOURCEURL DESTFILEPATH [OVERWRITE] |
Downloads a file from the source URI to the specified file path. By default, it doesn't overwrite an existing file. |
untar_file TARFILE DESTDIR |
Extracts a tar file (using -xf ) to the destination directory. |
test_is_headnode |
If the script ran on a cluster head node, return 1; otherwise, 0. |
test_is_datanode |
If the current node is a data (worker) node, return a 1; otherwise, 0. |
test_is_first_datanode |
If the current node is the first data (worker) node (named workernode0) return a 1; otherwise, 0. |
get_headnodes |
Return the fully qualified domain name of the headnodes in the cluster. Names are comma delimited. An empty string is returned on error. |
get_primary_headnode |
Gets the fully qualified domain name of the primary headnode. An empty string is returned on error. |
get_secondary_headnode |
Gets the fully qualified domain name of the secondary headnode. An empty string is returned on error. |
get_primary_headnode_number |
Gets the numeric suffix of the primary headnode. An empty string is returned on error. |
get_secondary_headnode_number |
Gets the numeric suffix of the secondary headnode. An empty string is returned on error. |
Common usage patterns
This section provides guidance on implementing some of the common usage patterns that you might run into while writing your own custom script.
Passing parameters to a script
In some cases, your script may require parameters. For example, you may need the admin password for the cluster when using the Ambari REST API.
Parameters passed to the script are known as positional parameters, and are assigned to $1
for the first parameter, $2
for the second, and so-on. $0
contains the name of the script itself.
Values passed to the script as parameters should be enclosed by single quotes ('). Doing so ensures that the passed value is treated as a literal.
Setting environment variables
Setting an environment variable is performed by the following statement:
VARIABLENAME=value
In the preceding example, VARIABLENAME
is the name of the variable. To access the variable, use $VARIABLENAME
. For example, to assign a value provided by a positional parameter as an environment variable named PASSWORD, you would use the following statement:
PASSWORD=$1
Subsequent access to the information could then use $PASSWORD
.
Environment variables set within the script only exist within the scope of the script. In some cases, you may need to add system-wide environment variables that will persist after the script has finished. To add system-wide environment variables, add the variable to /etc/environment
. For example, the following statement adds HADOOP_CONF_DIR
:
echo "HADOOP_CONF_DIR=/etc/hadoop/conf" | sudo tee -a /etc/environment
Access to locations where the custom scripts are stored
Scripts used to customize a cluster needs to be stored in one of the following locations:
An Azure Storage account that is associated with the cluster.
An additional storage account associated with the cluster.
A publicly readable URI. For example, a URL to data stored on OneDrive, Dropbox, or other file hosting service.
An Azure Data Lake Storage account that is associated with the HDInsight cluster. For more information on using Azure Data Lake Storage with HDInsight, see Quickstart: Set up clusters in HDInsight.
Note
The service principal HDInsight uses to access Data Lake Storage must have read access to the script.
Resources used by the script must also be publicly available.
Storing the files in an Azure Storage account or Azure Data Lake Storage provides fast access, as both within the Azure network.
Note
The URI format used to reference the script differs depending on the service being used. For storage accounts associated with the HDInsight cluster, use wasb://
or wasbs://
. For publicly readable URIs, use http://
or https://
. For Data Lake Storage, use adl://
.
Checklist for deploying a script action
Here are the steps take when preparing to deploy a script:
- Put the files that contain the custom scripts in a place that is accessible by the cluster nodes during deployment. For example, the default storage for the cluster. Files can also be stored in publicly readable hosting services.
- Verify that the script is idempotent. Doing so allows the script to be executed multiple times on the same node.
- Use a temporary file directory /tmp to keep the downloaded files used by the scripts and then clean them up after scripts have executed.
- If OS-level settings or Hadoop service configuration files are changed, you may want to restart HDInsight services.
How to run a script action
You can use script actions to customize HDInsight clusters using the following methods:
- Azure portal
- Azure PowerShell
- Azure Resource Manager templates
- The HDInsight .NET SDK.
For more information on using each method, see How to use script action.
Custom script samples
Microsoft provides sample scripts to install components on an HDInsight cluster. See Install and use Hue on HDInsight clusters as an example script action.
Troubleshooting
The following are errors you may come across when using scripts you've developed:
Error: $'\r': command not found
. Sometimes followed by syntax error: unexpected end of file
.
Cause: This error is caused when the lines in a script end with CRLF. Unix systems expect only LF as the line ending.
This problem most often occurs when the script is authored on a Windows environment, as CRLF is a common line ending for many text editors on Windows.
Resolution: If it's an option in your text editor, select Unix format or LF for the line ending. You may also use the following commands on a Unix system to change the CRLF to an LF:
Note
The following commands are roughly equivalent in that they should change the CRLF line endings to LF. Select one based on the utilities available on your system.
Command | Notes |
---|---|
unix2dos -b INFILE |
The original file is backed up with a .BAK extension |
tr -d '\r' < INFILE > OUTFILE |
OUTFILE contains a version with only LF endings |
perl -pi -e 's/\r\n/\n/g' INFILE |
Modifies the file directly |
sed 's/$'"/`echo \\\r`/" INFILE > OUTFILE |
OUTFILE contains a version with only LF endings. |
Error: line 1: #!/usr/bin/env: No such file or directory
.
Cause: This error occurs when the script was saved as UTF-8 with a Byte Order Mark (BOM).
Resolution: Save the file either as ASCII, or as UTF-8 without a BOM. You may also use the following command on a Linux or Unix system to create a file without the BOM:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
Replace INFILE
with the file containing the BOM. OUTFILE
should be a new file name, which contains the script without the BOM.
Next steps
- Learn how to Customize HDInsight clusters using script action
- Use the HDInsight .NET SDK reference to learn more about creating .NET applications that manage HDInsight
- Use the HDInsight REST API to learn how to use REST to perform management actions on HDInsight clusters.