"Here's a 7 node, Hortonworks Hadoop cluster, metrics is broken, fix it, go!"
The initial indication that metrics was broken was apparent in the Services tab for Ambari Metrics. Here it showed that there was an error and that Metrics Collector was Stopped. The error however wasn't very informative:
Connection failed: [Errno 111] Connection refused...
That didn't tell me much at all, and neither did googling.
(I hope the title of this blog helps someone else find this solution quicker.)
I was able to locate several log files, on the host where Metrics Collector is installed, in the following directory:
/var/log/ambari-metrics-collector/
Here's a list of the logs I started digging through:
hbase-ams-master-[Server].out
hbase-ams-master-[Server].log
ambari-metrics-collector-startup.out
ambari-metrics-collector.out
ambari-metrics-collector.log
The ambari-metrics-collector.log was the most informative, and I had errors like the following:
WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
WARN org.apache.hadoop.hbase.ipc.CoprocessorRpcChannel: Call failed on IOException
ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.TimelineWebServices: Error saving metrics.
As you can see the errors referenced related components including yarn, hbase, and zookeeper. This sent me down quite the rabbit hole wondering which component was actually having the issue.
In the end it occurred to me that maybe the collector was getting hung up trying to handle all the data the Metrics Monitors had been gathering while the collector was down. I then decided to try trashing the historical data and resetting the Metrics.
I used this article to help me through it.
In short I did the following:
- First I stopped everything Metrics related, the collector, grafna, and the monitors via Ambari > Service action Stop under Ambari Metrics.
- I then backed up and removed, via rename, the following directories:
- /var/lib/ambari-metrics-collector
- /data/var/lib/ambari-metrics-collector/hbase
- /var/var/lib/ambari-metrics-collector/hbase-tmp
- Finally I restarted everything Metrics related via Ambari > Service action Start under Ambari Metrics
Magic!!!
While troubleshooting this issue I also came across this list of known issues with Ambari Metrics:
https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues
Which begged the question, which version am I actually running?
I ran this to find out:
rpm -qa | sort | grep ambari
I may have to look into the NORMALIZATION_ENABLED workaround as a proactive measure.
Along the way I found a couple neat little tricks:
- How to check if metrics collector is running from the cli:
ambari-metrics-collector status - URL to get some more error details out of Ambari:
http://[Server:Port]/api/v1/clusters/[ClusterName]/alerts?fields=*&Alert/state.in(CRITICAL,WARNING) - Default URL for HBase:
http://[Server]:61310/master-status
Had a stuck metrics collector as well, followed your approach, worked like a charm.
ReplyDeleteThanks!
Thank you so much! My issue resolved as well
ReplyDelete