HDInsight jobs troubleshooting
WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to goto YARN.
HDInsight gates all communication through Gateway due to this the behavior's will be different from what's seen on non-HDInsight Hadoop clusters. This blog post covers some general WebHCat troubleshooting scenarios
HTTP status code 502 - BadGateway
This is a very generic message from Gateway nodes. We will cover some common cases and possible mitigations
WebHcat service down
This happens in-case WebHCat server on the active headnode is not available which can be quickly verified by below CURL command which returns 502.
$ curl -u admin:{HTTP PASSWD} https://{CLUSTER DNS NAME}.azurehdinsight.net/templeton/v1/status?user.name=admin
For service down scenarios Ambari shows an alert at the top and clicking will show the hosts on which the WebHCat is not available.
WebHCat down can be mitigated by restarting the service on host for which the alert was raised as shown in below screen shot
If WebHCat server is not coming up, then clicking through operations will show the failures (In this specific case server started successfully hence none shown). For more detailed information, refer to the stderr and stdout files referenced on the node.
WebHCat service is up but not accepting requests
WebHCat service might be up but not accepting new requests. One possibility is the service ran out of socket connections. Quick way to validate is to check the connection status using below command
$ netstat | grep 30111
30111 is the port WEbHCat listens and above command lists network connections to and from WebHCat. The result should be very low in single digits).
WebHCat times out
HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.
Below are known common scenarios where timeout might happen
- List all jobs: This is a very expensive call. This call enumerates the applications from YARN ResourceManager and for-each completed application gets status from JobHistoryServer. In-cases of higher number of jobs this call might timeout resulting in 502.
- List jobs older than 7 days: HDInsight YARN JobHistoryServer is configured (mapreduce.jobhistory.max-age-ms) to retain completed jobs information for 7 days. Trying to enumerating purged jobs results in timeout resulting in 502.
- WebHCat is under load: In-cases where WebHCat is under load, the requests might timeout.
WebHCat server log file will have failures about these. WebHCat log files are saved to location /var/log/webchat. Typical contents of directory will be like
- webhcat.log is the log4j log to which server writes logs
- webhcat-console.log is stdout of server is started.
- webhcat-console-error.log is stderr of server process
NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.
Quick look process will be like
- Figure out the UTC time range to troubleshoot
- Select the webchat.log file based on the time range
- Look for WARN/ERROR messages during that period of time
HTTP Status code 500
In most cases where WebHCat returns 500 the error message contains details on the failure. Otherwise looking through the WebHCat log for ERROR/WARN will reveal the issue.
Job failures
In-cases where interaction with WebHCat are successful but the jobs are failing.
Check stderr of the job
Templeton collect the job console output as stderr in ‘statusdir’ which will be useful lots of times for troubleshooting. Stderr contains YARN application id of the actual query which can be used for troubleshooting.
NOTE: For HIVE workload using statement ‘set hive.root.logger=DEBUG,console’ at the start collect more verbose logs into stderr
Check YARN logs
In-cases where stderr doesn’t help, check the YARN application logs. Browse to below URL in browser https://{DNS NAME}.azurehdinsight.net/yarnui/jobhistory/job/{job id}/m/SUCCESSFUL and click through logs link to troubleshoot it further.
Sample screen shot look like