MeidokonWiki:

Tracing unexpected behaviour in Corosync's address selection

We've been looking at some of Corosync's internals recently, spurred on by one of our new HA (highly-available) clusters spitting the dummy during testing. What we found isn't a "bug" per se (we're good at finding those), but a case where the correct behaviour isn't entirely clear.

We thought the findings were worth sharing, and we hope you find them interesting even if you don't run any clusters yourself.

Disclaimer goes here: this is purely just what we've found so far; we have yet to formalise this and perform further testing so we can present proper questions to the Corosync developers

Observed behaviour

Before signing-off on cluster deployments we run everything through its paces to ensure that it's behaving as expected. This means plenty of failovers and other stress-testing to verify that the cluster handles adverse situations properly.

Our standard clusters comprise two nodes with Corosync+Pacemaker, running a "stack" of managed resources. HA MySQL is a common example is: DRBD, a mounted filesystem, the MySQL daemon and a floating IP address for MySQL.

During routine testing for a new customer we saw the cluster suddenly partition itself and go up in flames. One side was suddenly convinced there were three nodes in the cluster and called in vain for a STONITH response, while the other was convinced that its buddy had been nuked from orbit and attempted to snap up the resources. What was going on!?

It was time to start poring over the logs for evidence. To understand what happened you need to know how Corosync communicates between nodes in the cluster.

A crash-course in Corosync

Linux HA is split into a number of parts that have changed significant over time. At its simplest, you can consider two major components:

We're only interested in Corosync, specifically its communication layer.

There's two major types of communication in Corosync, the shared cluster data and a "token" which is passed around the ring of cluster nodes (this is a conceptual ring, not a physical network ring). The token is used to manage connectivity and provide synchronisation guarantees necessary for the cluster to operate [1] footnote here.

The token is always transferred by unicast UDP. Cluster data can be sent either by multicast UDP or unicast UDP - we use multicast. In either case, the source address is always a normal unicast address.

Given this, a 4-node cluster looks something like this:

corosync_communication.png

One convenient feature in Corosync is its automatic selection of source address. This is done by comparing the bindnetaddr config directive against all IP addresses on the system and finding a suitable match. The cool thing about this is that you can use the exact same config file for all nodes in your cluster, and everything should Just Work™.

Automatic source-address selection is always used for IPv4, it's not negotiable. It's never done for IPv6, addresses are used exactly as supplied to bindnetaddr.

Interestingly, you only supply an address to bindnetaddr, such as 192.168.0.42 - CIDR notiation is not used, as might be commonly expected when referring to a subnet. Instead, Corosync compares each of the system's addresses (plus the associated netmask) against bindnetaddr, applying the same netmask. This diagram demonstrates a typical setup:

a diagram goes here

Caption: While somewhat contrived, we can see how the local address is determined as intended.

The problem as we see it

The key here is in two parts. Firstly, it's possible for a floating pacemaker-managed IP to match against your bindnetaddr specification. As an example:

Secondly, Corosync sometimes re-enumerates network addresses [2] footnote here on the host for automatic source address selection. We're not 100% sure of the circumstances under which this occurs, but a classic example would be while performing a rolling upgrade of the cluster software. A normal process would be to unmanage all your resources, stop Pacemaker and Corosync, upgrade, then bring them back up again and remanage your resources.

Taken together, this can lead to the cluster getting very confused when one host's unicast address for cluster traffic suddenly changes. Consider a 2-node cluster comprising nodes A and B:

  1. A's address changes
  2. The token drops as a result of the ring breaking
  3. B thinks A if offline, as A's new address isn't allowed through our outbound firewall rules, so B doesn't receive anything from A
  4. A also thinks B is offline, because B can't hear A
  5. In the meantime, A will "discover" a new cluster node: itself, operating on the new address!

This state is passed on to Pacemaker, which attempts to juggle the resources to satisfy its policy. For resources that were already running on A, it now sees duplicate copies, which isn't allowed. Original-A asks for new-A to be STONITH'd. Meanwhile, B is just trying to get all the resources started again.

This is decidedly not ideal. What really got us is that we hadn't seen this behaviour in any previously deployed cluster; something was different.

Normally we'll dedicate a separate subnet to cluster traffic, keeping it away from "production traffic" like load-balanced MySQL or HTTP. This time we didn't, opting to reuse the subnet. We did this because we don't like "polluting" a network segment with multiple IP subnets. We could have setup another VLAN to confine the subnet (avoiding pollution), but it would've meant putting that into our switches just for two physical machines, which seemed like overkill.

Footnotes

1. Cluster data is enqueued on each node when it's received. When a node receives the token, it processes the multicast data that has queued-up, does whatever it needs to, then passes the token to the next node.

2. The enumeration happens here, both for startup and "refreshing": https://github.com/corosync/corosync/blob/master/exec/totemip.c#L342


Stuff above this line is "refined" article material

Stuff below this line is bullet-point notes that I got MC to help verify


Why Corosync can select a different address

How the hack-patch avoids this

Why it's necessary

Other ways to dodge this

MeidokonWiki: CorosyncBindNetworkAddressSelection (last edited 2012-05-22 12:15:19 by furinkan)