Meidokon Wiki
  • Comments
  • Immutable Page
  • Menu
    • Navigation
    • RecentChanges
    • FindPage
    • Local Site Map
    • Help
    • HelpContents
    • HelpOnMoinWikiSyntax
    • Display
    • Attachments
    • Info
    • Raw Text
    • Print View
    • Edit
    • Load
    • Save
  • Login

Useful(?) links

  • furinkan's stuff

  • Postfix snippets


  • SystemInfo

  • This sidebar

Navigation

  • FrontPage
  • RecentChanges
  • FindPage
  • HelpContents
Revision 1 as of 2012-05-21 04:29:49
MeidokonWiki:
  • CorosyncBindNetworkAddressSelection

How communication works

  • Corosync can communicate using udp multicast or udp unicast. We use multicast
  • Data is sent into the cluster with a regular unicast address as the source, to a multicast group as the destination
  • Data sent in multicast packets is enqueued on each node when it's received
  • In addition, a token is also passed around the cluster in a ring. The token is passed by pure udp unicast, using the same unicast source address as previously mentioned

  • When a node receives the token, it processes the multicast data that has queued-up, modifies the token a bit to note that, then passes the token to the next node

The problem as we see it

  • Corosync takes the bindnetaddr config parameter

  • For IPv4 it tries to automatically find a match against configured IP addresses
    • This behaviour is not configurable
    • This is a feature - it means you can use the exact same config file on all nodes in the cluster
    • When using IPv6, no such automatic selection is made
  • Corosync enumerates the system's IPs and tries to find a match for the bindnetaddr specification

  • It does this by taking the netmask of the address, masking the spec, and seeing if IP+mask == spec+mask
  • It's possible for a floating pacemaker-managed IP to match/overlap your bindnetaddr IP, eg:

    • 192.168.0.1/24 - static IP used for cluster traffic
    • 192.168.0.42/24 - a floating "service IP" used for an HA service
  • If Corosync re-enumerates the IPs sometime after startup (could happen any time, as far as we're concerned), it can find the "new" IP (the floating IP) and select that as the new local address for cluster communications
    • The enumeration happens here: https://github.com/corosync/corosync/blob/master/exec/totemip.c#L342

  • Suddenly pacemaker sees a third node in the cluster
  • Corosync also thinks that the cluster has been partitioned, as the old address (192.168.0.1 in our example) has suddenly disappeared
  • The fact that firewall rules will be dropping any traffic from the now-in-use floating IP will also cause trouble

Why Corosync can select a different address

  • totemip_getifaddrs() gets all the addresses from the kernel and puts them in a linked list, you can think of them as tuples of (name,IP)

  • It does so my prepending to the head of the list
  • As a result, "later" addresses appear at the head of the list
  • When Corosync goes to traverse the list, it hits them in the reverse order of what a human would tend to expect
    • NB: the listing from the kernel is probably in undefined (ie. arbitrary) order?

  • Corosync uses the first match it finds
  • Example of possible linked list

    NAME         eth1            eth0:mysql          eth0:nfs           eth0
    ADDRESS      10.1.1.1   ->   192.168.0.42   ->   192.168.0.7   ->   192.168.0.1
                 (backups)       (HA floating)       (static)           (static, should be used for cluster traffic)

How the hack-patch avoids this

  • This is the hack fix: http://packages.engineroom.anchor.net.au/temp/corosync-2.0.0-ignore-ip-aliases.patch

    • It's a huge hack
    • Here it is »inlined »inlined

      Toggle line numbers
         1 diff -ruN corosync-2.0.0.orig/exec/totemip.c corosync-2.0.0/exec/totemip.c
         2 --- corosync-2.0.0.orig/exec/totemip.c  2012-04-10 21:09:12.000000000 +1000
         3 +++ corosync-2.0.0/exec/totemip.c       2012-05-09 15:03:51.272429481 +1000
         4 @@ -358,6 +358,9 @@
         5                     (ifa->ifa_netmask->sa_family != AF_INET && ifa->ifa_netmask->sa_family != AF_INET6))
         6                         continue ;
         7  
         8 +               if (ifa->ifa_name && strchr(ifa->ifa_name, ':'))
         9 +                       continue ;
        10 +
        11                 if_addr = malloc(sizeof(struct totem_ip_if_address));
        12                 if (if_addr == NULL) {
        13                         goto error_free_ifaddrs;
        14 @@ -384,7 +387,7 @@
        15                         goto error_free_addr_name;
        16                 }
        17  
        18 -               list_add(&if_addr->list, addrs);
        19 +               list_add_tail(&if_addr->list, addrs);
        20         }
        21  
        22         freeifaddrs(ifap);
        23 @@ -449,6 +452,9 @@
        24                         if (lifreq[i].lifr_addr.ss_family != AF_INET && lifreq[i].lifr_addr.ss_family != AF_INET6)
        25                                 continue ;
        26  
        27 +                       if (lifreq[i].lifr_name && strchr(lifreq[i].lifr_name, ':'))
        28 +                               continue ;
        29 +
        30                         if_addr = malloc(sizeof(struct totem_ip_if_address));
        31                         if (if_addr == NULL) {
        32                                 goto error_free_ifaddrs;
        33 @@ -484,7 +490,7 @@
        34                                 if_addr->interface_num = lifreq[i].lifr_index;
        35                         }
        36  
        37 -                       list_add(&if_addr->list, addrs);
        38 +                       list_add_tail(&if_addr->list, addrs);
        39                 }
        40  
        41                 free (lifconf.lifc_buf);
      
  • Skip the IP if the name has a colon in it

  • Append to the tail of the list, hopefully matching an "expected" ordering

Why it's necessary

  • Netlink is used to interrogate the kernel for addresses

  • The interface/protocol used is old, and doesn't know about primary/secondary/other addresses
  • This basically means there's no way to specify additional criteria for address selection, or to dodge addresses from selection

    • In theory Corosync could be patched to use a newer interface/protocol that can retrieve this information from the kernel

Other ways to dodge this

  • Use IPv6
  • All previous clusters use a separate subnet and NIC for cluster traffic, so this doesn't happen

    • It's happened this time because cluster traffic is in the same subnet as internal service addresses
    • We didn't see a point in using a separate subnet in this case
      • Because we don't put two subnets on the same network segment, so we wouldn't had to configure another NIC on each machine, which means another VLAN between the two - it seemed like overkill
  • MoinMoin Powered
  • Python Powered
  • GPL licensed
  • Valid HTML 4.01
MoinMoin Release 1.9.11 [Revision release], Copyright by Juergen Hermann et al.