- Telium

Participant

December 7, 2016 at 8:59 pm

Post count: 268

#6651

Dual active contention means that both peers in the cluster are active, and one or both peers discover that the other is active as well (they are contending). Upon discovering this situation the peers automatically negotiate which should remain active, and which should demote itself.

At a high level the cause of this problem is that the (previously) standby peer thought that the other peer was dead/unresponsive and needed to take over, so it promoted itself to active. However, the real question that needs answering is why did that peer think the other was dead/unresponsive?

The most common causes are:

HAAst Misconfiguration: If one of the peers continually cycles back and forth between active and standby then the most likely cause is peerlink misconfiguration in haast.conf (peer A can talk to peer B, but peer B can’t talk to peer A). To resolve this carefully check the settings in the peerlink stanza of haast.conf. If they look correct perform telnet connectivity tests to the HAAst port from each peer to the other, and continual ping tests (using only management IP’s) between peers and watch what happens before/after a contention.
Network Misconfiguration: If one of the peers occasionally cycles back and forth between active and standby then the most likely cause is network misconfiguration. Again, peer A can talk to peer B, but peer B can’t talk to peer A. To resolve this ensure the network settings at the OS level are correct, and the network settings are correct in HAAst.conf (voipnic stanza). This includes checking default routes (which may change if using a shared IP), accidentally reusing an IP address already in use, etc. If they look correct perform continual ping tests (using only management IP’s) between peers and watch what happens before/after a contention.
Peer Load/Responsiveness: If one of the peers suffers from periodic extreme load then HAAst will correctly assess its health as failing and allow the other peer to take over. To resolve this problem examine both hosts for CPU load, runaway process, high IO processes, etc. For example, the backup script included in FreePBX is poorly written and will cause very high CPU and/or IO load when it runs (causing the PBX to become unresponsive briefly). To resolve this problem identify the process(es) or device(s) causing the high load and correct their behavior (e.g.: switch to a real backup program).
LAN/WAN Latency: In cases where peers are separated by large geographic distances the maximum latency setting in haast.conf may be set too low. On rare occasions, an overloaded or problematic LAN can cause the same problem. Although the root cause of the problem can be accommodated by increasing HAAst’s maximum latency setting, this is not always desirable. Be sure to understand the implications on detection and fail-over time (for legitimate peer failure situations). As well, if you are running the Commercial Unlimited edition of HAAst then latency is already being compensated for dynamically – so the maximum latency setting will reflect how severe the problem really is, and may warrant a general network diagnostic.
Network Interruption: This is actually not a problem. It means there was a network outage (one node could not reach the other), and the standby node correctly promoted itself. Once network connectivity was restored, it demoted itself. If this problem occurs rarely then there is nothing you need to do – HAast is doing it’s job! If this problem occurs frequently, then you should investigate a network outage/intermittency.

This type of problem can be one of the most challenging to resolve. You will need to enable full debugging in the HAAst logs, as well as system logs, to capture the data needed to diagnose. You may need to involve your network admin, and possibly your WAN carrier. Telium will often work with clients through SSH to help identify the root cause, and suggest a resolution.

Reply To: Why do I see "dual active contention" in the HAAst log