8/26/2016

Ambari Metrics Collector Not Starting (Connection failed: [Errno 111] Connection refused)

Last week I had a bit of a trial by fire:
"Here's a 7 node, Hortonworks Hadoop cluster, metrics is broken, fix it, go!"

The initial indication that metrics was broken was apparent in the Services tab for Ambari Metrics.  Here it showed that there was an error and that Metrics Collector was Stopped.  The error however wasn't very informative:

Connection failed: [Errno 111] Connection refused...

That didn't tell me much at all, and neither did googling.
(I hope the title of this blog helps someone else find this solution quicker.)

I was able to locate several log files, on the host where Metrics Collector is installed, in the following directory:
/var/log/ambari-metrics-collector/

Here's a list of the logs I started digging through:
 hbase-ams-master-[Server].out
 hbase-ams-master-[Server].log
 ambari-metrics-collector-startup.out
 ambari-metrics-collector.out
 ambari-metrics-collector.log

The ambari-metrics-collector.log was the most informative, and I had errors like the following:

WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR

WARN org.apache.hadoop.hbase.ipc.CoprocessorRpcChannel: Call failed on IOException

ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.TimelineWebServices: Error saving metrics.

As you can see the errors referenced related components including yarn, hbase, and zookeeper.  This sent me down quite the rabbit hole wondering which component was actually having the issue.

In the end it occurred to me that maybe the collector was getting hung up trying to handle all the data the Metrics Monitors had been gathering while the collector was down.  I then decided to try trashing the historical data and resetting the Metrics.

I used this article to help me through it.
In short I did the following:
  1. First I stopped everything Metrics related, the collector, grafna, and the monitors via Ambari > Service action Stop under Ambari Metrics.
  2. I then backed up and removed, via rename, the following directories:
    • /var/lib/ambari-metrics-collector
    • /data/var/lib/ambari-metrics-collector/hbase
    • /var/var/lib/ambari-metrics-collector/hbase-tmp
  3. Finally I restarted everything Metrics related via Ambari > Service action Start under Ambari Metrics
Magic!!!

While troubleshooting this issue I also came across this list of known issues with Ambari Metrics:
https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues

Which begged the question, which version am I actually running?
I ran this to find out:
rpm -qa | sort | grep ambari

I may have to look into the NORMALIZATION_ENABLED workaround as a proactive measure.

Along the way I found a couple neat little tricks:
  • How to check if metrics collector is running from the cli:
    ambari-metrics-collector status
  • URL to get some more error details out of Ambari:
    http://[Server:Port]/api/v1/clusters/[ClusterName]/alerts?fields=*&Alert/state.in(CRITICAL,WARNING)
  • Default URL for HBase:
    http://[Server]:61310/master-status