MeidokonWiki:

Tracing unexpected behaviour in Corosync's address selection

We've been looking at some of Corosync's internals recently, spurred on by one of our new HA (highly-available) clusters spitting the dummy during testing. What we found isn't a "bug" per se (we're good at finding those), but a case where the correct behaviour isn't entirely clear. We thought the findings were worth sharing, and we hope you find them interesting even if you don't run any clusters yourself.

Observed behaviour

Before signing-off on cluster deployments we run everything through its paces to ensure that it's behaving as expected. This means plenty of failovers and other stress-testing to verify that the cluster handles adverse situations properly.

Our standard clusters comprise two nodes with Corosync+Pacemaker, running a "stack" of managed resources. HA MySQL is a common example is: DRBD, a mounted filesystem, the MySQL daemon and a floating IP address for MySQL.

During routine testing for a new customer we saw the cluster suddenly partition itself and go up in flames. One side was suddenly convinced there were three nodes in the cluster and called in vain for a STONITH response, while the other was convinced that its buddy had been nuked from orbit and attempted to snap up the resources. What was going on!?

It was time to start poring over the logs for evidence. To understand what happened you need to know how Corosync communicates between nodes in the cluster.

A crash-course in Corosync


Stuff above this line is "refined" article material

Stuff below this line is bullet-point notes that I got MC to help verify


How communication works

The problem as we see it

Why Corosync can select a different address

How the hack-patch avoids this

Why it's necessary

Other ways to dodge this

MeidokonWiki: CorosyncBindNetworkAddressSelection (last edited 2012-05-21 08:08:03 by furinkan)