HDInsight jobs troubleshooting

Article
04/21/2016

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to goto YARN.
HDInsight gates all communication through Gateway due to this the behavior's will be different from what's seen on non-HDInsight Hadoop clusters. This blog post covers some general WebHCat troubleshooting scenarios

HTTP status code 502 - BadGateway

This is a very generic message from Gateway nodes. We will cover some common cases and possible mitigations

WebHcat service down

This happens in-case WebHCat server on the active headnode is not available which can be quickly verified by below CURL command which returns 502.

$ curl -u admin:{HTTP PASSWD} https://{CLUSTER DNS NAME}.azurehdinsight.net/templeton/v1/status?user.name=admin

For service down scenarios Ambari shows an alert at the top and clicking will show the hosts on which the WebHCat is not available.

WebHCat down can be mitigated by restarting the service on host for which the alert was raised as shown in below screen shot

If WebHCat server is not coming up, then clicking through operations will show the failures (In this specific case server started successfully hence none shown). For more detailed information, refer to the stderr and stdout files referenced on the node.

WebHCat service is up but not accepting requests

WebHCat service might be up but not accepting new requests. One possibility is the service ran out of socket connections. Quick way to validate is to check the connection status using below command

$ netstat | grep 30111

30111 is the port WEbHCat listens and above command lists network connections to and from WebHCat. The result should be very low in single digits).

WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

Below are known common scenarios where timeout might happen

List all jobs: This is a very expensive call. This call enumerates the applications from YARN ResourceManager and for-each completed application gets status from JobHistoryServer. In-cases of higher number of jobs this call might timeout resulting in 502.
List jobs older than 7 days: HDInsight YARN JobHistoryServer is configured (mapreduce.jobhistory.max-age-ms) to retain completed jobs information for 7 days. Trying to enumerating purged jobs results in timeout resulting in 502.
WebHCat is under load: In-cases where WebHCat is under load, the requests might timeout.

WebHCat server log file will have failures about these. WebHCat log files are saved to location /var/log/webchat. Typical contents of directory will be like

webhcat.log is the log4j log to which server writes logs
webhcat-console.log is stdout of server is started.
webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Quick look process will be like

Figure out the UTC time range to troubleshoot
Select the webchat.log file based on the time range
Look for WARN/ERROR messages during that period of time

HTTP Status code 500

In most cases where WebHCat returns 500 the error message contains details on the failure. Otherwise looking through the WebHCat log for ERROR/WARN will reveal the issue.

Job failures

In-cases where interaction with WebHCat are successful but the jobs are failing.

Check stderr of the job

Templeton collect the job console output as stderr in ‘statusdir’ which will be useful lots of times for troubleshooting. Stderr contains YARN application id of the actual query which can be used for troubleshooting.

NOTE: For HIVE workload using statement ‘set hive.root.logger=DEBUG,console’ at the start collect more verbose logs into stderr

Check YARN logs

In-cases where stderr doesn’t help, check the YARN application logs. Browse to below URL in browser https://{DNS NAME}.azurehdinsight.net/yarnui/jobhistory/job/{job id}/m/SUCCESSFUL and click through logs link to troubleshoot it further.

Sample screen shot look like

Partager via