39. LVS: Realserver failure handled by Mon

39.1. Introduction

Don't even think about doing this till you've got your LVS working properly. If you want the LVS to survive a server or director failure, you can add software to do this after you have the LVS working. For production systems, failover may be useful or required.

An agent external to the ipvs code on the director is used to monitor the services. LVS itself can't monitor the services as LVS is just a packet switcher. If a realserver fails, the director doesn't get the failure, the client does. For the standard LVS-Tun and LVS-DR setups (ie receiving packets by an ethernet device and not by TP), the reply packets from the realserver go to its default gw and don't go through the director, so the LVS can't detect failure even if it wants to. For some of the mailings concerning why the LVS does not monitor itself and why an external agent (eg mon) is used instead, see the postings on external agents.

In a failure protected LVS, if the realserver at the end of a connection fails, the client will loose their connection to the LVS, and the client will have to start with a new connection, as would happen on a regular single server. With a failure protected LVS, the failed realserver will be switched out of the LVS and a working new server will be made available to you transparently (the client will connect to one of the still working servers, or possibly a new server if one is brought on-line).

If the service is http, loosing the connection is not a problem for the client: they'll get a new connection next time they click on a link/reload. For services which maintain a connection, loosing the connection will be a problem.

ratz ratz (at) tac (dot) ch 16 Nov 2000

This is very nasty for persistent setups in an e-commerce environment. Take for example a simple e-com site providing some subjects to buy. You can browse and view all their goodies. At a certain point you want to buy something. Ok, it is common nowadays that people can buy over the Internet with CC. Obviously this is done f.e. with SSL. SSL needs persistency enabled in the lvs-configuration. Imaging having 1000 users (conn ESTABLISHED) that are entering their VISA information when the database server crashes and the healthcheck takes out the server; or even more simple when the server/service itself crashes. Ok, all already established connections (they have a persistent template in the kernel space) are lost and these 1000 users have to reauthenticate. How does this look from a clients point of view which has no idea about the technology behind a certain site.

Here the functioning and setup of "mon" is described. In the Ultra Monkey version of LVS, ldirectord fills the same role. (I haven't compared ldirectord and mon. I used mon because it was available at the time, while ldirectord was either not available or I didn't know about it.) The configure script will setup mon to monitor the services on the realservers.

Get "mon" and "fping" from http://www.kernel.org/software/mon/ (I'm using mon-0.38.20)

(from ian.martins (at) mail (dot) tju (dot) edu - comment out line 222 in fping.c if compiling under glibc)

Get the perl package "Period" from CPAN, ftp://ftp.cpan.org)

To use the fping and telnet monitors, you'll need the tcp_scan binary which can be built from satan. The standard version of satan needs patches to compile on Linux. The patched version is at

ftp://sunsite.unc.edu/pub/Linux/system/network/admin

39.2. ethernet NIC failure, and channel bonding

There was a lengthy thread on using multiple NICs to handle NIC failure. Software/hardware to handle such failures is common for unices which run expensive servers (e.g. Solaris) but is less common in Linux.

Beowulfs can use multiple NICs to increase thoughput by bonding them together (channel bonding), but redundancy/HA is not important for beowulfs - if a machine fails, it is fixed on the spot. There is no easy way to un-bond NICs - you have to reboot the computer :-(

Michael McConnell michaelm (at) eyeball (dot) com 06 Aug 2001

I want to take advantage of dual NICS in the realserver to provide redundancy. Unfortunately the default gw issue comes up.

Michael E Brown michael_e_brown (at) dell (dot) com 09 Aug 2001

Yes, this is a generally available feature with most modern NICS. It is called various things: channel bonding, trunking, link aggregation. There is a native linux driver that implements this feature in a generic way. It is called the bonding driver. It works with any NIC. look in drivers/net/bond*. Each NIC vendor also has a proprietary version that works with only their NIC. I gave urls for intel's product, iANS. Broadcom and 3com also have this feature. I believe there is a standard for this: 802.1q.

John Cronin

It would be nice if it could work across multiple switches, so if a single switch failed, you would not lose connectivity (I think the adaptive failover can do this, but that does not improve bandwidth).

Jake Garver garver (at) valkyrie (dot) net 08 Aug 2001

No it wouldn't be nice because it would put a tremendous burden on the link connecting the switches. If you are lucky, this link is 1Gb/sec, much slower than back planes which or 10Gb/sec and up. In general, you don't want to "load balance" your switches. Keep as much as you can on the same back plane.

So, are there any Cisco Fast EtherChannel experts out there? Can FEC run across multiple switches, or at least across multiple Catalyst blades? I guess I can go look it up, but if somebody already knows, I don't mind saving myself the trouble.

Fast EtherChannel cannot run across multiple switches. A colleague spent weeks of our time proving that. In short, each switch will see a distinct link, for a total of two, but your server will think it has one big one. The switches will not talk to each other to bond the two links and you don't want them to for the reason I stated above. Over multiple blades, that depends on your switch. Do a "show port capabilities" to find out; it will list the ports that can be grouped into an FEC group.

Michael E Brown michael_e_brown (at) dell (dot) com

If you want HA, have one machine (machine A) with bonded channels connected to switch A, and have another machine (machine B) with bonded channels connected to switch B.

If you want to go super-paranoid, and have money to burn on links that won't be used during normal operations: have one machine (machine A) with bonded channels connected to switch A, and have backup bonded channels to switch B. Have software that detects failure of all bonded channels to switch A and fails over your IP to switch B (still on machine A). Have another machine (B), with two sets of bonded channels connected to switch C and switch D. lather, rinse, repeat. On Solaris, IP failover to backup link is called IP Multipathing, IIRC. New feature of Solaris 8. Various HA softwares for Linux, notably Steeleye Lifekeeper and possibly LinuxFailsafe, support this as well.

John Cronin

For the scenario described above (two systems), in many cases machine A is active and machine B is a passive failover, in which case you have already burned some money on an entire system (with bonded channels, no less) that won't be used during normal operations.

Considering I can get four (two for each system) SMC EtherPower dual port cards for about $250, including shipping, or four Zynx Netblaster quad cards for about $820, if I shop around carefully, or $1000 for Intel Dual Port Server adapters or $1600 for Adaptec/Cogent ANA-6944 quad cards, if a name brand is important), the cost seems less significant when viewed in this light (not to mention the cost of two Cisco switches that can do FEC too).

Back to channel bonding (John Cronin)

I presume it's not doable.

I think "not doable" is an incorrect statement - "not done" would be more precise. For the most part, beowulf is about performance, not HA. I know that Intel NICs can use their own channel aggregation or Cisco Fast-EtherChannel to aggregate bandwidth AND provide redundancy. Unfortunately, these features are only available on the closed-source Microsoft and Novell platforms.

http://www.intel.com/network/connectivity/solutions/server_bottlenecks/config_1.htm

Having 2 NICs on a machine with one being spare, is relatively new. No-one has implemented a protocol for redundancy AFAIK.

I assume that you mean both of these statements to apply to Linux and LVS only. Sun has had trunking for years, but IP multipathing is the way to go now as it is easier to set up. You do get some bandwidth improvements for OUTBOUND connections only, on a per connection basis, but the main feature is redundancy.

Look in http://docs.sun.com/ for IP, multipathing, trunking.

Sun also has had Network Adapter Fail-Over groups (NAFO groups) in Sun Cluster 2.X for years, and in Sun Cluster 3.0. Veritas Cluster Server has an IPmultiNIC resource that provides similar functionality. Both of these allow for a failed NIC to be more or less seamlessly replaced by another NIC. I would be surprised if IBM HACMP has not had a similar feature for quite some time. In most cases these solutions do not provide improved bandwidth.

The next question then is how often does a box fail in such a way that only 1 NIC fails and everything else keeps working? I would expect this to be an unusual failure mode and not worth protecting against. You might be better off channel bonding your 2 NICs and using the higher throughput (unless you're compute bound).

I would agree, with one exception. If you have the resources to implement redundant network paths farther out into your infrastructure, then having redundant NICs is much more likely to lead to improved availability. For example if you have two NICs, which are plugged into to two different switches, which are in turn plugged into two different routers, then you start to get some real benefit. It is more complicated to setup (HA isn't easy most of the time), but with the dropping prices of switches and routers, and the increased need for HA in many environments, this is not as uncommon as it might sound, at least not in the ISP and hosting arena.

I am not trying to slam LVS and Linux HA products - to the contrary; I am trying to inspire some talented soul to write a multipathing NIC device driver we can all benefit from. ;) I make my living doing work on Sun boxes, but I use Linux on my Dell Inspiron 8000 laptop (my primary workstation, actually - it's a very capable system). I would recommend Linux solutions in many situations, but in most cases my employers won't bite, as they prefer vendor supported solutions in virtually every instance, while complaining about the official vendor support.

for channel bonding both NICS on the host have the same IP and MAC address. You need to split the cabling for the two lots of NICs, so you don't have address collisions - you'll need two switches.

John Cronin

You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux). Intel adapters also have their own channel aggregation scheme, and I think the Intel switches could also facilitate this scheme, but Intel is getting out of the switch business. Unfortunately, none of the advanced Intel NIC features are available using Linux (it would be nice to have the hardware IPsec support on their newest adapters, for example).

Michael E Brown michael_e_brown (at) dell (dot) com

Depends on which kind of bonding you do. Fast Etherchannel depends on all of the nics being connected to the same switch. You have to do configure the switch for trunking. Most of the standardized trunking methods I have seen require you to configure the switch and have all your nics connected to the same switch.

You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux).

Michael E Brown michael_e_brown (at) dell (dot) com

Not true. You can download the iANS software from Intel. Not open source, but that is different from "not available".

look in http://isearch.intel.com for ians+linux.

Also, if you want channel bonding without intel proprietary drivers, see

/usr/src/linux/drivers/net/bonding.c:
/*
 * originally based on the dummy device.
 *
 * Copyright 1999, Thomas Davis, tadavis (at) lbl (dot) gov
 * Licensed under the GPL. Based on dummy.c, and eql.c devices.
 *
 * bond.c: a bonding/etherchannel/sun trunking net driver
 *
 * This is useful to talk to a Cisco 5500, running Etherchannel, aka:
 *      Linux Channel Bonding
 *      Sun Trunking (Solaris)
 *
 * How it works:
 *    ifconfig bond0 ipaddress netmask up
 *      will setup a network device, with an ip address.  No mac address
 *      will be assigned at this time.  The hw mac address will come from
 *      the first slave bonded to the channel.  All slaves will then use
 *      this hw mac address.
 *
 *    ifconfig bond0 down
 *         will release all slaves, marking them as down.
 *
 *    ifenslave bond0 eth0
 *      will attache eth0 to bond0 as a slave.  eth0 hw mac address will
either
 *      a: be used as initial mac address
 *      b: if a hw mac address already is there, eth0's hw mac address
 *         will then  be set from bond0.
 *
 * v0.1 - first working version.
 * v0.2 - changed stats to be calculated by summing slaves stats.
 *
 */

Michael McConnell

This definately does it!

It create this excellent kernel module, it contains ALL. I just managed to get this running on a Tyan 2515 Motherboard that has two Onboard Intel Nics.

I've just tested failover mode, works *PERFECT* not even a single packet dropped! I'm gonna try out adaptive load balancing next, and i'll let you know how I make out.

ftp://download.intel.com/df-support/2895/eng/ians-1.3.34.tar.gz

Michael E Brown michael_e_brown (at) dell (dot) com

Broadcom also has proprietary channel bonding drivers for linux. The problem is getting access to this driver. I could not find any driver downloads from their website. It is possible that only OEMs have this driver. Dell factory installs this driver for RedHat 7.0 (and will be on 7.1, 7.2). You might want to e-mail Broadcom and ask.

Also

Broadcom also has an SSL offload card which is coming out and it has open source drivers for linux.

http://www.broadcom.com/products/5820.html

You need the openssl library and kernel.

The next release of Red Hat linux will have this support integrated in. The Broadcom folks are working closely with the OpenSSL team to get their userspace integrated directly into 0.9.7. Red Hat has backported this functionality into their 0.9.6 release.

If you look at Red Hat's latest public beta, all the support is there and is working.

Since there aren't docs yet, the "bcm5820" rpm is the one you want to install to enable everything. Install this RPM, and it contains an init script that enables and disables the OpenSSL "engine" support as appropriate. Engine is the new OpenSSL feature that enables hardware offload.

39.2.1. more on channel bonding

Paul wrote

Interface bond0 comes up fine with eth1 and eth2 no problem. Bond1 fails miserably every time. I'm going to take that issue up on the bonding mailing list.

Roberto Nibali ratz (at) drugphish (dot) ch 07 Mar 2002

Which patch did you try? Is it the following:

http://prdownloads.sourceforge.net/bonding/bonding-2.4.18-20020226

Did you pass max_bonds=2 when you loaded the bonding.o module? Without that you have no chance. Read the source (if you haven't already) to see what other fancy parameters you might want to pass.

This is driven in part by our desire to see how far we can push lvs. I know it does 100mb/s in and out. If it can keep 2 channels full, I'll add a thirds, fourth,fifth, etc as necessary.

Read http://www.sfu.ca/acs/cluster/nic-test.html to get the impression of what happens if you try to bond too many NICs.

39.3. Service/realserver failout: mon, ldirectord

To activate realserver failover, you can install mon on the director. Several people have indicated that they have written/are using other schemes. RedHat's piranha has monitoring code, and handles director failover and is documented there.

ldirectord handles director failover and is part of the Linux High Availability project. The author of ldirectord is Jacob Rief jacob (dot) rief (at) tis (dot) at with most of the later add-ons and code cleanup by Horms, who is using it with Linux=HA/UltraMonkey.

Note

Jacob 12 Feb 2004

Since Jun 2003, I've handed over developement to Horms. I'm now using keepalived. It's faster, cleaner in design and also much smarter, because you don't need any extra heartbeat.

ldirectord needs Net::SSLeay only if you are monitoring https (Emmanuel Pare emman (at) voxtel (dot) com, Ian S. McLeod ian (at) varesearch (dot) com)

To get ldirectord -

Jacob Rief jacob (dot) rief (at) tis (dot) at

the newest version available from
cvs.linux-ha.org:/home/cvs/
user guest,
passwd guest
module-name is: ha-linux
file: ha-linux/heartbeat/resource.d/ldirectord
documentation: ha-linux/doc/ldirectord

ldirectord is also available from http://reserve.tiscover.com/linux-ha/ (link dead May 2002)

Andreas Koenig andreas (dot) koenig (at) anima (dot) de 7 Jun 2001

cvs access is described in http://lists.community.tummy.com/pipermail/linux-ha-dev/1999-October/000212.html

Here's a possible alternative to mon -

Doug Bagley doug (at) deja (dot) com 17 Feb 2000

Looking at mon and ldirectord, I wonder what kind of work is planned for future service level monitoring?

mon is okay for general purposes, but it forks/execs each monitor process, if you have 100 real services and want to check every 10 seconds, you would fork 10 monitor processes per second. This is not entirely untenable, but why not make an effort to make the monitoring process as lightweight as possible (since it is running on the director, which is such an important host)?

ldirectord, uses the perl LWP library, which is better than forking, but it is still slow. It also issues requests serially (because LWP doesn't really make parallel requests easy).

I wrote a very simple http monitor last night in perl that uses non-blocking I/O, and processes all requests in parallel using select(). It also doesn't require any CPAN libraries, so installation should be trivial. Once it is prototyped in perl, conversion to C should be straightforward. In fact, it is pretty similar to the Apache benchmark program (ab).

In order for the monitor (like ldirectord) to do management of the ipvs kernel information, it would be easier if the /proc interface to ipvs gave a more machine readable format.

Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk

Agreed :-)

It strikes me that rather than having:
type serviceIP:port mechanism
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive

If the table was more like:

type serviceIP:port mechanism realip:port tunnel weight active inactive

Then this would make shell/awk/perl/etc scripts that do things with this table easier to cobble together.

That seems like a far reaching precedent to me. On the other hand, if the ipvsadm command wished to have a option to represent that information in XML, I can see how that could be useful.

This reminds me I should really finish tweaking the prog I wrote to allow pretty printing of the ipvsadm table, and put it somewhere else for others to play with if they like - it allows you to specify a template file for formatting the output of of ipvsadm, making displaying the stuff as XML, HTML, plain text, etc simpler/quicker. (It's got a few hardcoded settings at the mo which I want to ditch first :-)

39.4. Is ldirectord multithreaded? (ldirectord running high %CPU)

Joe: This came up after Eric Robinson found his misconfigured ldirectord running 50-99% CPU.

Horms 5 Dec 2008

The basic answer is yes. By default ldirectord does not take advantage of multiple cores, though as discussed in the previous thread, its workload can be split up, and that would cause allow it to use multiple cores. LVS does take advantage of multiple cores by virtue of being in the kernel - assuming you are using an SMP kernel. The issue of how well it can use multiple cores is a complex topic, and the results would depend on the workload.

Its important to note that generally speaking ldirectord should not be using a lot of CPU resources - and if it is then some refactoring of the code for the situation at hand would be time well spent. LVS (and the rest of the Linux network stack, which it uses) may on the other hand consume a reasonable ammount of CPU resource if it is dealing with a lot of packets / connections.

Joe: Here's Eric's problem

top
.
.
Mem:    516304k total,   506348k used,     9956k free,    45448k buffers
Swap:  1048568k total,        4k used,  1048564k free,   369656k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2762 root      17   0 13708 9884 1744 S 50.4  1.9  13386:29 ldirectord
28957 root      15   0  7108 2320 1848 S  0.3  0.4   0:00.03 sshd
.
.

Fernanda G Weiden fernanda (at) softwarelivre (dot) org

Using multiple processes for different vips would help to isolate the problem, at least. Maybe there is a weird external check or something killing the machine.

Graeme

81 days uptime is 116640 minutes; that means ldirectord has consumed >10% of the CPU in the time the server's been up. What's the health check interval here?

With (say) 100 virtual servers, 2 realservers each, an interval of 10 seconds means 200 checks every ten seconds (nominally). Assuming a 0.1 second latency for each check, you're talking overlapping checks there so a given check thread is only half way through running when it starts again.

Horms

I'm not entirely sure what overlapping means here, but a single ldirectord process runs checks in series. One runs until it finishes or times out, then the next one. There is no parallisation.

Graeme

It would appear that ldirectord isn't being given a chance to draw breath. Ever.

Horms

Ldirectord isn't that smart. Each ldirectord process just sits in a loop that looks a bit like this

while (1) {
        run check 1 and wait for it to either succeed or time-out;
        run check 2 and wait for it to either succeed or time-out;
        ...
        run check n and wait for it to either succeed or time-out;

        if configuration file has changed
                if $AUTOCHECK is set
                        re-read configuration file;
        else
                sleep $CHECKINTERVAL;
}

So unless something odd is happening with the configuration file, it should always get a chance to take a breath for $CHECKINTERVAL seconds.

Horms

It seems odd to me that Ldirectord would take up so much CPU, its primarily either a) sending small amounts of data and waiting for a reply or b) sleeping. So if it is consuming lots of CPU I suspect a bug, probably in one of the checks (or more specifically one of the modules that is used for one of the checks). There have been problems with the HTTPS check leaking memory in the past, so I would start by seeing if that is the culprit.

In answer to the multi-threading question - no ldirectord is not multi-threaded, though you can split your configuration up into multiple configuration files and run multiple instances of ldirectord. I can handle the forking for you, or you can do it manually.

Somewhere in the thread it was suggested that you could split your configuration up so that you have one ldirectord process per virtual service as a means of attempting to narrow down the problem. I think that this is a good idea.

A long time ago there was an effort to use non-blocking IO to allow ldirectord to run multiple checks in parallel in a single processes. However the code (in the supporting modules) did not work well.

The primary motivation for parallelising ldirectord either within a single process or with multiple processes is usually to minimise the delays inherent in running checks serially. This would actually result in increased CPU usage - as it would be doing more work in a given space of time.

With regards to LVS, it is almost certainly not the cause of ldirectord taking up 50% of CPU. ldirectord only configures LVS. And this is done by forking an ipvsadm process. So if there was a problem with ldirectord configuring LVS, it should show up as ipvsadm processes consuming lots of resources. (Although I guess it is possible that ldirectord is having trouble forking ipvsadm.)

Graeme

To measure packet throughput

ipvsadm -L -n --stats
ipvsadm -L -n --rate

Those two commands will get you pretty far down the road of what you need in terms of packets/sec, conns/sec and so on. --rate will give you the instantaneous rate, where --stats will give you counters since this LVS was started. This is useful for post-processing to get overall averages.

Given the large number of services and realservers you have, I think this is the key. Looking at the config file (if I read it correctly) you have:

60x tomcat services, 2 realservers each == 120 checks
60x MySQL services, 1 realserver each   ==  60 checks
5x  other services, 1 or 2 realservers each == 7 checks

That's a total of 187 checks to be run every two seconds. If we make it a round 200 (since the maths is then easier!) then you're talking a maximum latency of 0.01 seconds per check. It would appear that ldirectord isn't being given a chance to draw breath. Ever.

Just as a test, what happens if you move the checkinterval out to (say) 5, 10, 20 or 30 seconds? Can you tolerate that level of pause if something happens to a realserver?

Eric

I changed it to 5 seconds, but no significant change was apparent. Then I changed it to 10 seconds and there was a definite, observable drop in CPU utilization. A graph of the past 6 hours shows that usage has now flattened out and is now averaging less than 10%.

Graeme

We have now narrowed the cause of the problem down to health checks.

Eric

As for tolerating longer timeouts: I don't know. These are medical applications, and doctors are often grumpy about transient glitches in their applications while trying to document patient encounters. I'm thinking something like this might possibly work for the short term:

checkinterval=10
checktimeout=5
negotiatetimeout=8
checkcount=1

But it would still be a temporary solution and this raises a general question about load-balancer scaling. Right now I have 120 VS, but in a year or two it will be 240. In 4 years it could be 500+. I can't just keep increasing the checkinterval. Ultimately, I'm going to have to try multiple instances of ldirectord.

Which raises a question about LVS. Could it get confused with multiple ldirectord instances constantly forking ipvsadm?

Graeme

As long as they are managing discrete pools of virtual and real servers, then no I don't think it will *unless* you hit the problem someone else reported very recently where realservers seem to migrate between virtuals at random. Horms was going to try to work on that, but it might be tricky to isolate.

Horms

ldirectord (or any other code that manages LVS from user-space) may get confused, if one process is reading things and another is changing things for the same virtual server - though as Graeme says, if they are managing discrete pools this should not be a problem, with the caveat that there seems to be a bug in that code in ldirectord.

It is not possible to confuse LVS itself (unless there is a bug I don't know about). It just does what it is configured to do. And it uses locking to ensure that only one user-space process can change things at a time. So even if user-space is making multiple changes simultaneously (on multiple processors or cores) to the same real server in the same virtual service, the LVS kernel code will serialise these changes and something sensible should result - albeit perhaps not what the multiple user-space processes were expecting.

In other words, LVS serialises changes from user-space.

Graeme

For such a large number of realservers I think you may need to get creative with your healthchecking. You could use the "checkcommand" setting to ldirectord to read a value from a file which is kept updated by some other script which can check in parallel. Unfortunately I can't pull one of those out of a hat right now... :)

Horms

Yes, I agree that some sort of creativity is in order. I did some work on making ldirectord more scalable, but that was a long time ago, and for a somewhat different scenario. The main outcome of that work was fwmark support in both LVS and ldirectord, which allowed many virtual services with the same real servers to be aggregated.

Graeme

Thinking about it laterally, how does something like Nagios cope with a very large number of service checks? It does them in parallel, by running multiple threads. So does OpenNMS, and Zabbix, and in fact pretty much every one of the decent (fsvo "decent") NMS apps I've ever used.

Horms

As ldirectord is written in Perl, doing non-blocking IO to parallelise things is difficult - or more to the point, appeared to not work the last time it was tried. I believe that keepalived, which is written in C, has an easier time here.

On the other hand ldirectord does have a forking option, which parallelises things by forking a process for each virtual service. Though now that I think about it, it might be better if it used a pool of processes, if you have 50 virtual services it will try and fork 50 processes for each iteration of the main loop!

It also allows you to split up the configuration file manually and fork at that granularity at start-up.

Later: I missread the code, the processes should only be forked on startup, and then re-forked if they die. Not forked for each iterration of the main loop. But still, a pool might be a good idea, albeit more complex than the current code.

Joe

In the early days of LVS, all failover was set to at least 30secs as the tcpip stack is designed to tolerate somewhere between 30-90sec (depending on the OS) of lost packets (if routing goes down) before sending back icmp errors to the sending node. On that understanding, all applications can expect silence on that time scale before the routing underneath rearranges itself.

For the number of times you have a forced failover of a realserver (once every couple of months at most), I can't imagine that doctors will notice. Planned downtime will involve setting the weight to 0 and then shutting down the node, when all connections are dropped

Graeme

Not being entirely au fait with ldirectord, I had a read of the source code this morning and found the following which might (or might not) help you out:

fork = yes/no

If yes, then ldirectord will spawn a child proccess for every virtual server, and run checks against the real servers from them. This will increase response times to changes in real server status in configura- tions with many virtual servers. This may also use less memory then running many seperate instances of ldirectord. Child processes will be automaticly restarted if they die.

Default: no

I was going to suggest modifying ldirectord to fork health check processes out for each VS, but Horms (or maybe Jacob) already did it :)

That should help. You'll end up with a lot of processes running but they should be able to deal with the shorter check interval far better than screaming through hundreds of checks every two seconds.

Note
Ryan Castellucci ryan (dot) castellucci (at) gmail (dot) com has submitted a patch for this (see: http://hg.linux-ha.org/dev/rev/3d3d903779b2)

39.5. overriding ldirectord health checks from the command line

Joe: Eric found that when he brought down/up the service running on the realserver, that ldirectord's health checking would change the weight of the realserver in the ipvsadm table to 0/1 (as expected). However, for a healthy realserver, when Eric changed the weight to 0 via ipsvadm ldirectord's health checking didn't reset the weight back to 1. This behaviour of ldirectord is an unintended part of the design. Since Eric (and other people) like this behaviour, it will now be a part of the specification of ldirectord.

Robinson, Eric eric (dot) robinson (at) psmnv (dot) com 4 Dec 2008

On my load balancer, you can see that I have two RS listening on port 3001, and both are up..

[root@lb1 ]# ipvsadm|grep 3001
TCP  extrovert.mycharts.md:3001 lblc persistent 360
  -> 192.168.10.62:3001           Masq    1      0          0
  -> 192.168.10.61:3001           Masq    1      0          0
[root@lb1 scripts]#

Now I'll stop the service listening on the server at 192.168.10.61...

[root@appftp1 ]# service tomcat5_001 stop
Stopping tomcat: Using CATALINA_BASE:   /alley/site001/tomcat5
Using CATALINA_HOME:   /alley/site001/tomcat5
Using CATALINA_TMPDIR: /alley/site001/tomcat5/temp
Using JAVA_HOME:       /usr/java/j2sdk1.4.2_09

Back on the load balancer, the down server is detected immediately...

[root@lb1 /]# ipvsadm|grep 3001
TCP  extrovert.mycharts.md:3001 lblc persistent 360
  -> 192.168.10.62:3001           Masq    1      0          0
  -> 192.168.10.61:3001           Masq    0      0          0
[root@lb1 /]#

Now back on the RS, I'll start the service back up...

[root@appftp1 /]# service tomcat5_001 start|grep "startup in"
INFO: Server startup in 4023 ms
[root@appftp1 /]# 

The LB instantly detects the RS is back up...

[root@lb1 /]# ipvsadm|grep 3001
TCP  extrovert.mycharts.md:3001 lblc persistent 360
  -> 192.168.10.62:3001           Masq    1      0          0
  -> 192.168.10.61:3001           Masq    1      0          0 
[root@lb1 /]#

So we know the healthchecks are working, right?

But now look at this. On the LB, I'll change the weight manually, check it, wait 60 seconds, and check it again...

[root@lb1 /]# ipvsadm -e -t 192.168.5.100:3001 -r 192.168.10.61 -w 0 -m
[root@lb1 /]# ipvsadm|grep 3001
TCP  extrovert.mycharts.md:3001 lblc persistent 360
  -> 192.168.10.62:3001           Masq    1      0          0
  -> 192.168.10.61:3001           Masq    0      0          0
[root@lb1 /]# sleep 60
[root@lb1 /]# ipvsadm|grep 3001
TCP  extrovert.mycharts.md:3001 lblc persistent 360
  -> 192.168.10.62:3001           Masq    1      0          0
  -> 192.168.10.61:3001           Masq    0      0          0
[root@lb1 /]#

Still down after a full minute! And it will *stay* down until I run ipvsadm with -w 1

This is what makes me wonder why I seem to get different behavior from the command line than from the healthchecks. I'm sure it must be something simple that I am overlooking.

Graeme 04 Dec 2008

I just played with this and observed the same behaviour, and I think I can summarise it as:

Manage your LVS manually, or automate it with keepalived/ldirectord. Don't do both.

I get the feeling that both of the above will *change* the weight (or remove a server from the pool) when there's a state change, but if the state remains the same the in-memory structures will stay the same so the actual weight assigned in the mix of user/kernel space that's in use at this point remains the same.

If you change weight using a tool like ipvsadm, then *that* weight will apply until something changes to make the automation system change something.

For the record, I've used a MISC_CHECK with keepalived before in order to "manually" quiesce a server by simply appending the IP address to a text file or creating an empty file with the same name as an IP/port pair in the pool. This can then be checked for existence every delay_loop, and if it exists make the script exit with a weight which sets the realserver's weight to zero. This has meant I haven't ever ended up in the situation you're seeing, so I did have to check :)

Jason Ledford jledford (at) biltmore (dot) com 4 Dec 2008

I inquired about this a while back. Ldirectord sets weight differently then ipvsadm, so if you set it with ipvsadm them ldirectord will come behind you and change it to what the config is. Based on help found here I use a script like this to take a server offline. You should be able to tailor for your needs to make a server come back also. The sed command is the important part:

echo "Config backed up"
/bin/cp /etc/ha.d/ldirectord.cf /etc/ha.d/ldirectord.cf.backup
/bin/sed 's/       real=10.37.2.9:25 masq 1/       real=10.37.2.9:25 masq 0/g' $
/bin/mv /etc/ha.d/ldirectord.cf.new /etc/ha.d/ldirectord.cf
echo "Now syncing with tbcsrv907"
/usr/bin/scp /etc/ha.d/ldirectord.cf tbcsrv907:/etc/ha.d/ldirectord.cf
echo "Spamone is Disabled"

I put that in a script and can run it when needed to drain my servers. I also have the reverse in a script so I can bring it back when I need. I back it up first in case of a mistake. I take it a step further and run these scripts from Webmin so there is no need to even login to the server.

Horms 5 Dec 2008

LVS runs inside the kernel and is configured from user-space. The ipvsadm command line is a tool to do this and ldirectord uses also tool. LVS does not (at this time) offer a way for user-space programmes to see updates that have been made by third parties (though now I think of this, the netlink interface that was added in 2.6.28 could likely be extended to do this). So if a user uses ipvsadm to change something, ldirectord doesn't know about it.

ldirectord could poll the LVS setup and reset things that don't match its view of the world, but it doesn't. Or more to the point it only does that at the following times:

  • when it starts
  • when it is restarted
  • when it rereads the configuration file

Other than the times listed above - which all run through the startup code - ldirectord only alters the LVS configuration when the state of a real-server changes, from up to down or drom down to up.

So if you alter the state of a real server (or any other part of the LVS configuration) using ipvsadm then that change will remain until one of the three events listed above occurs or the state of the modified real-server changes.

Personally, I would be a lot more comfortable in altering LVS's setup by modifying ldirectord.cf as needed. As you never know when an event that triggers ldirectord to do something might occur. But you may be comfortable with this behaviour - its up to you.

Eric

For that very reason, the current behavior seems perfect to me. When I place a server in "administratively down" mode, I *want* LVS to ignore commands that may be triggered by ldirectord's health checks. That way I can maintain the services on the RS (stop and start tomcat, for instance) and not worry about the server suddenly becoming available to users.

I also like being able to manipulate this behavior from the command line. No need to edit a config file (or worry about remembering to un-edit it). I can even schedule drain/fill changes with the at command. It just feels right to me.

As for if/when this behaviour might change. Its not really an intentional part of the design - more a side effect of LVS historically not providing notifications of configuration changes to user-space. And I really do think that managing the system through ldirectord.cf is a better way to go. But there was a proposal made to me once that ldirectord should actually take notice of third-party changes and incorporate that into its configuration file. So I guess that if/when ldirectord was to take notice of third-party changes, then the availability of three modes might make sense:

  • Ignore them (until a change event occurs) - the current behaviour
  • Reverse them - the behaviour you were originally expecting
  • Incorporate them into the configuration

Graeme Fowler also mentioned keepalived. I imagine that it behaves in a similar way to ldirectord, but I have not examined the code recently, so I am just guessing.

Robinson, Eric eric (dot) robinson (at) psmnv (dot) com 5 Dec 2008

This makes a great deal of sense to me. I can see where all three options could be useful, but I really don't want to be without option 1 (Ignore them). It is what makes what I call "drain mode" possible, which is a wonderfully graceful way to maintain servers. I set the weight to 0 then I go away. Existing sessions continue uninterrupted, but new ones to that RS are not possible. The user count gradually drops to zero, at which time I am free to power off the server or whatever. No advance notification to users is required. No staying up late to catch the server when nobody is on it. I just put it in drain mode during normal production hours. The next day when all the users are off of it, I can maintain it at my convenience. When I'm done, I issue an ipvsadm command to put the server in "fill mode" and it starts servicing clients. I've been using this approach for a year and it has saved enormous amounts of time and energy.

One other reason that I like manipulating the behavior from the command line is that I sometimes make mistakes in ldirectord.cf that I don't catch until much later. I worry about messing up the config and inadvertently breaking user's access to all 170 realservers. (That's why I don't take advantage of ldirectord's callback directive to copy the config file to the other load balancer. I'd rather do it manually. The last thing I need is to screw up my config on both load balancers at the same time!)

39.6. Mon for server/service failout

Here's the prototype LVS

                        ________
                       |        |
                       | client |
                       |________|
			   |
                           |
                        (router)
                           |
			   |
                           |       __________
                           |  DIP |          |
                           |------| director |
                           |  VIP |__________|
                           |
                           |
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
   ______________    ______________    ______________
  |              |  |              |  |              |
  | realserver1  |  | realserver2  |  | realserver3  |
  |______________|  |______________|  |______________|

Mon has two parts:

  • monitors: these are programs (usually perl scripts) which are run periodically to detect the state of a service. E.g. the telnet monitor attempts to login to the machine of interest and checks whether a program looking like telnetd responds (in this case looking for the string "login:"). The program returns success/fail. Monitors have been written for many services and new monitors are easy to write.

  • Mon demon: reads a config file, specifying the hosts/services to monitor and how often to poll them (with the monitors). The conf file lists the actions for failure/success of each host/service. When a failure (or recovery) of a service is detected by a monitor, an "alert" (another perl script) is run. There are alerts which send email, page you or write to a log. LVS supplies a "virtualserver.alert" which executes ipvsadm commands to remove or add servers/services, in response to host/services changing state (up/down).

39.7. Monitoring the service running on the VIP on the realserver from the director

*Trap for the unwary*

Mon runs on the director, but...

Remember that you cannot connect to any of the LVS controlled services from within the LVS (including from the director) (also see the "gotcha" section in the LVS-mini-HOWTO). You can only connect to the LVS'ed services from the outside (eg from the client). If you are on the director, the packets will not return to you and the connection will hang. If you are connecting from the outside (ie from a client) you cannot tell which server you have connected to. This means that mon (or any agent), running on the director (which is where is needs to be to execute ipvsadm commands), cannot tell whether an LVS controlled service is up or down.

With LVS-NAT an agent on the director can access services on the RIP of the realservers (on the director you can connect to the http on the RIP of each realserver). Normal (i.e. non LVS'ed) IP communication is unaffected on the private director/realserver network of LVS-NAT. If ports are not re-mapped then a monitor running on the director can watch the httpd on server-1 (at 10.1.1.2:80). If the ports are re-mapped (eg the httpd server is listening on 8080), then you will have to either modify the http.monitor (making an http_8080.monitor) or activate a duplicate http service on port 80 of the server.

For LVS-DR and LVS-Tun the service on the realserver is listening to the VIP and you cannot connect to this from the director. The solution to monitoring services under control of the LVS for LVS-DR and LVS-Tun is to monitor proxy services whose accessability should closely track that of the LVS service. Thus to monitor an LVS http service on a particular server, the same webpage should also be made available on another IP (or to 0.0.0.0), not controlled by LVS on the same machine.

Example:

LVS-Tun, LVS-DR
lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (normal login IP)
              eth1 192.168.1.110/32 (VIP)
realserver:  eth0 192.168.1.2/24 (normal login IP)
              tunl0 (or lo:0) 192.168.1.110/32 (VIP)

On the realserver, the LVS service will be on the tunl (or lo:0) interface of 192.168.1.110:80 and not on 192.168.1.2:80. The IP 192.168.1.110 on the realserver 192.168.1.2 is a non-arp'ing device and cannot be accessed by mon. Mon running on the director at 192.168.1.1 can only detect services on 192.168.1.2 (this is the reason that the director cannot be a client as well). The best that can be done is to start a duplicate service on 192.168.1.2:80 and hope that its functionality goes up and down with the service on 192.168.1.110:80 (a reasonable hope).

LVS-NAT
lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (outside IP)
              eth0:1 192.168.1.110/32 (VIP)
              eth1 10.1.1.1/24 (DIP, default gw for realservers)
realserver:  eth0 10.1.1.2/24

Some services listen to 0.0.0.0:port, i.e. will listen on all IPs on the and you will not have to start a duplicate service.

Joe

The director, which has the VIP, can't send a packet to the VIP and expect it to go to the realserver. Thus you need a parallel service running on the RIP (or the service on the realserver bound to 0.0.0.0). You can get around this by doing an rsh/ssh request to the RIP and running a command to check the service running on the VIP.

Graeme Fowler graeme (at) graemef (dot) net 19 May 2006

Alternatively setup an iptables rule on the realserver to snag the packets aimed at the RIP and DNAT them to the VIP instead:

iptables -i $RIP_INTERFACE \
         -p tcp -m tcp -s $DIP -d $RIP \
         --dport 80 -j DNAT --to-destination $VIP

Ah... but I see your realservers are w2k servers, so that won't work. Humbug. You may need to use IIS to do this, and have it run a passthrough script of some sort to attempt to fetch the app server index page from the local instance of the server application. If it fails, generate an appropriate error and pass that back to ldirectord. This is akin to Joe's recommendation to RSH/SSH, but using a webserver instead.

39.8. About Mon

Mon doesn't know anything about the LVS, it just detects the up/down state of services on remote machines and will execute the commands you tell it when the service changes state. We give Mon a script which runs ipvsadm commands to remove services from the ipvsadm table when a service goes down and another set of ipvsadm commands when the service comes back up.

Good things about Mon:

  • It's independant of LVS, i.e. you can setup and debug the LVS without Mon running.

  • You can also test mon independantly of LVS.

  • The monitors and the demon are independant.

  • Most code is in perl (one of he "run anywhere" languages) and code fixes are easy.

Bad things about Mon:

  • I upgraded to 0.38.20 but it does't run properly. I downgraded back to v0.37l. (Mar 2001 - Well I was running 0.37l. I upgraded to perl5.6 and some of the monitors/alerts don't work anymore. Mon-0.38.21 seems to work, with minor changes in the output and mon.cf file.)

  • the author doesn't reply to e-mail.

mon-0.37l, keeps executing alerts every polling period following an up-down transition. Since you want your service polled reasonable often (eg 15secs), this means you'll be getting a pager notice/email every 15secs once a service goes down. Tony Bassette kult (at) april (dot) org let me know that mon-0.38 has a numalert command limiting the number of notices you'll get.

Mon has no way of merging/layering/prioritising failure info. If a node fails you get an avalanch of notices that all the services on that node died too.

Ted Pavlic tpavlic (at) netwalk (dot) com 2 Dec 2000

If you are careful, you can avoid this. For example, it is easy to write a simple script (a super monitor) which runs every monitor you want. This way only one message is sent when one or more services goes down on a system. Unless you specify the actual failure in the message, you will bring the entire system goes down when one service fails.

There are other alternatives to make sure you only receive one notice when a service goes down, but are you sure you would want that? What if genuinely two services on a system go down but all the other services are up? I, personally, would want to receive notifications about both services.

Ideally, what you want is a dependency setup. Most other system monitors (like the popular "WhatsUp" by IpSwitch (or whatever they call themselves now)) that will only send you one notification if, for example, the ICMP monitor reports a failure. This functionality can also be easily built into the monitors.

So the way I see it, mon is fine, but sometimes the monitors one uses with mon might need a little work. Mon is nice because it is so versatile and pluggable. It's modular and doesn't lock one into using some proprietary scripting language to build monitors. Of course, this also makes it very slow.

All of these things can be improved, of course, in a number of ways which can be addressed and fixed one by one. However, most people have gone with ldirectord, so mon, in this application, seems to really have been forgotten. :(

I use mon and have been using it for a long time now and have no real problems with it. My biggest problem was that both of my redundant linuxdirectors notified me when things went down. I just wrote in a simple addition to the mailto script which figured out if the machine on which it was running was a master and if and ONLY if it was, it would send a message. That solved that problem.

Now you should note that I have a made quite a few changes to mon to make it more LinuxDirector-friendly. Rather than configuring mon through its configuration scripts, I configure mon and ipvsadm all through some very simple configuration files I use that control both equally. This required quite a bit of hacking inside mon to get it to dynamically create configuration data, but all of this modification isn't needed for the average LVS-admin.

Abstract: mon ain't really that bad. If ldirectord was everything I'd want it to be, it'd be mon.... consequence: I use mon. :)

39.9. Mon Install

Note

This writeup was done with early (ca. 1999) versions of mon. Ken Brownfield (Mar 2006) has an updated version of the lvs.alert files lvs.alert.tar.gz

Mon is installed on the director. Most of mon is a set of perl scripts. There are few files to be compiled - it is mostly ready to go (rpc.monitor needs to be compiled, but you don't need it for LVS). You do the install by hand.

$ cd /usr/lib
$ tar -zxvof /your_dirpath/mon-x.xx.tar.gz

this will create the directory /usr/lib/mon-x.xx/ with mon and its files already installed. LVS comes with virtualserver.alert (goes in alert.d) and ssh.monitor (goes in mon.d). Make the directory mon-x.xx accessable as mon by linking it to mon or by renaming it

$ln -s mon-x.xx mon
or
$mv mon-x.xx mon

Copy the man files (mon.1 etc) into /usr/local/man/man1 aheck that you have the perl packages required for mon to run

$perl -w mon

do the same for all the perl alerts and monitors that you'll be using (telnet.monitor, dns.monitor, http_t.monitor, ssh.monitor).

DNS in /etc/services is known as domain and nameserver, not as dns. To allow the use of the string dns in the lvs_xxx.conf files and to enable the configure script to autoinclude the dns.monitor, add the string dns to the port 53 services in /etc/services with an entry like

    domain		53/tcp		nameserver dns	# name-domain server
    domain		53/udp		nameserver dns

Mon expects executables to be in /bin, /usr/bin or /usr/sbin. The location of perl in the alerts is #!/usr/bin/perl (and not /usr/local/bin/perl) - make sure this is compatible with your setup. (Make sure you don't have one version of perl in /usr/bin and another in /usr/local/bin).

The configure script will generate the required mon.cf file for you (and if you like copy it to the cannonical location of /etc/mon).

Add an auth.cf file to /etc/mon. I use

#auth.cf ----------------------------------
# authentication file
#
# entries look like this:
# command: {user|all}[,user...]
#
# THE DEFAULT IS TO DENY ACCESS TO ALL IF THIS FILE
# DOES NOT EXIST, OR IF A COMMAND IS NOT DEFINED HERE
#

list:		all
reset:		root
loadstate:	root
savestate:	root
term:		root
stop:		root
start:		root
set:		root
get:		root
dump:		root
disable:	root
enable:		root

#end auth.cf ----------------------------

39.10. Mon Configure

This involves editing /etc/mon/mon.cf, which contains information about

  • nodes monitored

  • how to detect if a node:service is up (does the node ping, does it serve http...?)

  • what to do when the node goes down and what to do later when it comes back up.

The mon.cf generated by the configure script

  • assigns each node to its own group (nodes are brought down one at a time rather than in groups - I did this because it was easier to code rather than for any good technical reason).

  • detects whether a node is serving some service (eg telnet/http) selecting, if possible, a monitor for that service, otherwise defaulting to fping.monitor which detects whether the node is pingable.

  • on failure of a realserver, mon sends mail to root (using mail.alert) and removes the realserver from the ipvsadm table (using virtualserver.alert).

  • on recovery sends mail to root (using mail.alert) and adds the realserver back to the pool of working realservers in the ipvsadm table (using virtualserver.alert).

39.11. Testing mon without LVS

The instructions here show how to get mon working in two steps. First show that mon works independantly of LVS, then second bring in LVS.

The example here assumes a working LVS-DR with one realserver and the following IPs. LVS-DR is chosen for the example here as you can set up LVS-DR with all machines on the same network. This allows you to test the state of all machines from the client (ie using one kbd/monitor). (Presumably you could do it from the director too, but I didn't try it.)

lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (admin IP)
              eth0:1 192.168.1.110/32 (VIP)
realserver:   eth0 192.168.1.8/24

On the director, test ping.monitor (in /usr/lib/mon/mon.d) with

$ ./fping.monitor 192.168.1.8

You should get the command prompt back quickly with no other output. As a control test for a machine that you know is not on the net

$ ./fping.monitor 192.168.1.250
192.168.1.250

ping.monitor will wait for a timeout (about 5secs) and then return the IP of the unpingable machine on exit.

Check test.alert (in /usr/lib/mon/alert.d) - it writes a file in /tmp

$ ./test.alert foo

you will get the date and "foo" in /tmp/test.alert.log

As part of generating the rc.lvs_dr script, you will also have produced the file mon_lvsdr.cf. To test mon, move mon_lvs.dr to /etc/mon/mon.cf

#------------------------------------------------------
#mon.cf
#
#mon config info, you probably don't need to change this very much
#

alertdir   = /usr/lib/mon/alert.d
mondir     = /usr/lib/mon/mon.d
#maxprocs    = 20
histlength = 100
#delay before starting
#randstart = 60s

#------
hostgroup LVS1 192.168.1.8

watch LVS1
#the string/text following service (to OEL) is put in header of mail messages
#service "http on LVS1 192.168.1.8"
service fping
	interval 15s
	#monitor http.monitor
	#monitor telnet.monitor
	monitor fping.monitor
	allow_empty_group
	period wd {Sun-Sat}
	#alertevery 1h
		#alert mail.alert root
		#upalert mail.alert root
		alert test.alert
		upalert test.alert
		#-V is virtual service, -R is remote service, -P is protocol, -A is add/delete (t|u)
		#alert virtualserver.alert -A -d -P -t -V 192.168.1.9:21 -R 192.168.1.8
		#upalert virtualserver.alert -A -a -P -t -V 192.168.1.9:21 -R 192.168.1.8 -T -m -w 1

#the line above must be blank

#mon.cf---------------------------

Now we will test mon on the realserver 192.168.1.8 independantly of LVS. Edit /etc/mon/mon.cf and make sure that all the monitors/alerts except for fping.monitor and test.alert are commented out (there is an alert/upalert pair for each alert, leave both uncommented for test.alert).

Start mon with rc.mon (or S99mon) Here is my rc.mon (copied from the mon package)

# rc.mon -------------------------------
# You probably want to set the path to include
# nothing but local filesystems.
#

echo -n "rc.mon "

PATH=/bin:/usr/bin:/sbin:/usr/sbin
export PATH

M=/usr/lib/mon
PID=/var/run/mon.pid

if [ -x $M/mon ]
	then
	$M/mon -d -c /etc/mon/mon.cf -a $M/alert.d -s $M/mon.d -f 2>/dev/null
	#$M/mon -c /etc/mon/mon.cf -a $M/alert.d -s $M/mon.d -f
fi
#-end-rc.mon----------------------------

After starting mon, check that mon is in the ps table (ps -auxw | grep perl). When mon comes up it will read mon.cf and then check 192.168.1.8 with the fping.monitor. On finding that 192.168.1.8 is pingable, mon will run test.alert and will enter a string like

Sun Jun 13 15:08:30 GMT 1999 -s fping -g LVS3 -h 192.168.1.8 -t 929286507 -u -l 0

into /tmp/test.alert.log. This is the date, the service (fping), the hostgroup (LVS), the host monitored (192.168.1.8), unix time in secs, up (-u) and some other stuff I didn't need to figure out to get everything to work.

Check for the "-u" in this line, indicating that 192.168.1.8 is up.

If you don't see this file within 15-30secs of starting mon, then look in /var/adm/messages and syslog for hints as to what failed (both contain extensive logging of what's happening with mon). (Note: syslog appears to be buffered, it may take a few more secs for output to appear here).

If neccessary kill and restart mon

$ kill -HUP `cat /var/run/mon.pid`

Then pull the network cable on machine 192.168.1.8. In 15secs or so you should hear the whirring of disks and the following entry will appear in /tmp/test.alert.log

Sun Jun 13 15:11:47 GMT 1999 -s fping -g LVS3 -h 192.168.1.8 -t 929286703 -l 0

Note there is no "-u" near the end of the entry indicating that the node is down.

Watch for a few more entries to appear in the logfile, then connect the network cable again. A line with -u should appear in the log and no further entries should appear in the log.

If you've got this far, mon is working.

Kill mon and make sure root can send himself mail on the director. Make sure sendmail can be found in /usr/lib/sendmail (put in a link if neccessary).

Next activate mail.alert and telnet.monitor in /etc/mon/mon.cf and comment out test.alert. (Do not restart mon yet)

Test mail.alert by doing

$ ./mail.alert root
hello
^D

root is the address for the mail, hello is some arbitrary STDIN and controlD exits the mail.alert. Root should get some mail with the string "ALERT" in the subject (indicating that a machine is down).

Repeat, this time you are sending mail saying the machine is up (the "-u")

$ ./mail.alert -u root
hello
^D

Check that root gets mail with the string "UPALERT" in the subject (indicating that a machine has come up).

Check the telnet.monitor on a machine on the net. You will need tcp_scan in a place that perl sees it. I moved it to /usr/bin. Putting it in /usr/local/bin (in my path) did not work.

$ ./telnet.monitor 192.168.1.8

the program should exit with no output. Test again on a machine not on the net

$ ./telnet.monitor 192.168.1.250
192.168.1.250

the program should exit outputting the IP of the machine not on the net.

Start up mon again (eg with rc.mon or S99mon), watch for one round of mail sending notification that telnet is up (an "UPALERT) (note: for mon-0.38.21 there is no initial UPALERT). There should be no further mail while the machine remains telnet-able. Then pull the network cable and watch for the first ALERT mail. Mail should continue arriving every mon time interval (set to 15secs in mon_lvs_test.cf). Then plug the network cable back in and watch for one UPALERT mail.

If you don't get mail, check that you re-edited mon.cf properly and that you did kill and restart mon (or you will still be getting test.alerts in /tmp). Sometimes it takes a few seconds for mail to arrive. If this happens you'll get an avalanche when it does start.

If you've got here you are really in good shape.

Kill mon (kill `cat /var/run/mon.pid`)

39.12. Can virtualserver.alert send commands to LVS?

(virtualserver.alert is a modified version of Wensong's orginal file, for use with 2.2 kernels. I haven't tested it back with a 2.0 kernel. If it doesn't work and the original file does, let me know)

run virtualserver.alert (in /usr/lib/mon/alert.d) from the command line and check that it detects your kernel correctly.

$ ./virtualserver.alert

you will get complaints about bad ports (which you can ignore, since you didn't give the correct arguments). If you have kernel 2.0.x or 2.2.x you will get no other output. If you get unknown kernel errors, send me the output of `uname -r`. Debugging print statements can be uncommented if you need to look for clues here.

Make sure you have a working LVS-DR LVS serving telnet on a realserver. If you don't have the telnet service on realserver 192.168.1.8 then run

$ipvsadm -a -t 192.168.1.110:23 -r 192.168.1.8

then run ipvsadm in one window.

$ipvsadm

and leave the output on the screen. In another window run

$ ./virtualserver.alert -V 192.168.1.110:23 -R 192.168.1.8

this will send the down command to ipvsadm. The entry for telnet on realserver 192.168.1.8 will be removed (run ipvsadm again).

Then run

$ ./virtualserver.alert -u -V 192.168.1.110:23 -R 192.168.1.8

and the telnet service to 192.168.1.8 will be restored in the director:/etc/lvs# ipvsadm table.

39.13. Running mon with LVS

Connect all network connections for the LVS and install a LVS-DR LVS with INITIAL_STATE="off" to a single telnet realserver. Start with a file like lvs_dr.conf.single_telnet_off adapting the IP's for your situation and produce the mon_xxx.cf and rc.lvs_xxx file. Run rc.lvs_xxx on the director and then the realserver.

The output of ipvsadm (on the director) should be

grumpy:/etc/mon# ipvsadm
IP Virtual Server (Version 0.9)
Protocol Local Address:Port Scheduler
      -> Remote Address:Port   Forward Weight ActiveConn FinConn
TCP 192.168.1.110:23 rr

showing that the scheduling (rr) is enabled, but with no entries in the ipvsadm routing table. You should NOT be able to telnet to the VIP (192.168.1.110) from a client.

Start mon (it's on the director). Since the realserver is already online, mon will detect a functional telnet on it and trigger an upalert for mail.alert and for virtualserver.alert. At the same time as the upalert mail arrives run ipvsadm again. You should get

grumpy:/etc/mon# ipvsadm
IP Virtual Server (Version 0.9)
Protocol Local Address:Port Scheduler
      -> Remote Address:Port   Forward Weight ActiveConn FinConn
TCP 192.168.1.110:23 rr
      -> 192.168.1.8:23        Route   1      0          0

which shows that mon has run ipvsadm and added direct routing of telnet to realserver 192.168.1.8. You should now be able to telnet to 192.168.1.110 from a client and get the login prompt for machine 192.168.1.8.

Logout of this telnet session, and pull the network cable to the realserver. You will get a mail alert and the entry for 192.168.1.8 will be removed from the ipvsadm output.

Plug the network cable back in and watch for the upalert mail and the restoration of LVS to the realserver (run ipvsadm again).

If you want to, confirm that you can do this for http instead of telnet.

You're done. Congratulations. You can use the mon_xxx.cf files generated by the configure script from here.

39.14. Why is the LVS monitored for failures/load by an external agent rather than by the kernel?

Patrick Kormann pkormann (at) datacomm (dot) ch

Wouldn't it be nice to have a switch that would tell ipvsadm 'If one of the realservers is unreachable/connection refused, take it out of the list of realservers for x seconds' or even 'check the availability of that server every x seconds, if it's not available, take it out of the list, if it's available again, put it in the list'.

Lars

That does not belong in the kernel. This is definetely the job of a userlevel monitoring tool.

I admit it would be nice if the LVS patch could check if connections directed to the realserver were refused and would log that to userlevel though, so we could have even more input available for the monitoring process.

and quirks to make lvs a real high-availability system. The problem is that all those external checks are never as effective as a decition be the 'virtual server' could be.

That's wrong.

A userlevel tool can check reply times, request specific URLs from the servers to check if they reply with the expected data, gather load data from the real servers etc. This functionality is way beyond kernel level code.

Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk

Personally I think monitoring of systems is probably one of the things the lvs system shouldn't really get into in it's current form. My rationale for this is that LVS is a fancy packet forwarder, and in that job it excels.

For the LVS code to do more than this, it would require for TCP services the ability to attempt to connect to the *service* the kernel is load balancing - which would be a horrible thing for a kernel module to do. For UDP services it would need to do more than pinging... However, in neither case would you have a convincing method for determining if the *services* on those machines was still running effectively, unless you put a large amount of protocol knowledge into the kernel. As a result, you would still need to have external monitoring systems to find out whether the services really are working or not.

For example, in the pathological case (of many that we've seen :-) of a SCSI subsystem failure resulting in indestructable inodes on a cache box, a cache box can reach total saturation in terms of CPU usage, but still respond correctly to pings and TCP connections. However nothing else (or nothing much) happens due to the effective system lockup. The only way round such a problem is to have a monitoring system that knows about this sort of failure, and can then take the service out.

There's no way this sort of failure could be anticipated by anyone, so putting this sort of monitoring into the kernel would create a false illusion of security - you'd still need an auxillary monitoring system. Eg - it's not just enough for the kernel to mark the machine out of service - you need some useful way of telling people what's gone wrong (eg setting off people's pager's etc), and again, that's not a kernel thing.

Lars

Michael, I agree with you.

However, it would be good if LVS would log the failures it detects. ie, I _think_ it can notice if a client receives a port unreachable in response to a forwarded request if running masquerading, however it cannot know if it is running DR or tunneling because in that case it doesn't see the reply to the client.

_think_ it can notice if a client receives a port unreachable in response to a

Wensong

Currently, the LVS can handle ICMP packets for virtual services, and forward them to the right place. It is easy to set the weight of the destination zero or temperarily remove the dest entry directly, if an PORT_UNREACH icmp from the server to the client passes through the LVS box.

If we want the kernel to notify monitoring software that a realserver is down in order to let monitoring software keep consistent state of virtual service table, we need design efficient way to notify monitoring, more code is required. Anyway, there is a need to develop efficient communication between the LVS kernel and monitoring software, for example, monitoring software get the connection number efficiently, it is time-consuming to parse the big IPVS table to get the connection numbers; how to efficiently support ipvsadm -L <protocol, ip, port>? it is good for applications like Ted's 1024 virtual services. I checked the procfs code, it still requires one write_proc and one read_proc to get per virtual service print, it is a little expensive. Any idea?

Currently, the LVS can handle ICMP packets for virtual services, and forward them to the right place. It is easy to set the weight of the destination zero or temperarily remove the dest entry directly, if an PORT_UNREACH icmp from the server to the client passes through the LVS box.

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

PORT_UNREACH can be returned when the packet is rejected from the realserver's firewall. In fact, only UDP returns PORT_UNREACH when the service is not running. TCP returns RST packet. We must carefully handle this (I don't know how) and not to stop the realserver for all clients if we see that one client is rejected. And this works only if the LVS box is default gw for the realservers, i.e. for any mode: MASQ(it's always def gw), DROUTE and TUNNEL (PROT_UNREACH can be one of the reasons not to select other router for the outgoing traffic for these two modes). But LVS cn't detect the outgoing traffic for DROUTE/TUNNEL mode. For TUNNEL it can be impossible if the realservers are not on the LAN.

So, the monitoring software can solve more problems. The TCP stack can return PORT_UNREACH but if the problem with the service in the real server is more complex (realserver died, daemon blocked) we can't expect PORT_UNREACH. It is send only when the host is working but the daemon is stooped. Please restart this daemon. So, don't rely on the realserver, in most of the cases he can't tell "Please remove me from the VS configuration, I'm down" :) This is job for the monitoring software to exclude the destinations and even to delete the service (if we switch to local delivery only, i.e. when we switch from LVS to WEB only mode for example). So, I vote for the monitoring software to handle this :)

Wensong

Yeah, I prefer that monitoring software handles this too, because it is a unified approach for LVS-NAT, LVS-Tun and LVS-DR, and monitoring software can detect more failures and handle more things according to the failures.

What we discuss last time is that the LVS kernel sets the destination entry unavailable in virtual server table if the LVS detect some icmp packets (only for LVS-NAT) or RST packet etc. This approach might detect this kinds of problems just a few seconds earlier than the monitoring software, however we need more code in kernel to notify the monitoring software that kernel changes the virtual server table, in order to let the monitoring software keep the consistent view of the virtual server table as soon as possible. Here is a tradeoff. Personally, I prefer to keeping the system simple (and effective), only one (monitoring software) makes decision and keeps the consistent view of VS table.

39.15. Running multiple directors (each with their own IP)

On a normal LVS (one director, multiple realservers being failed-over with mon), the single director is a SPOF (single point of failure). Director failure can be handled (in principle) with heartbeat. In the meantime, you can have two directors each with their own VIP known to the users and set them up to talk to the same set of realservers. (You can have two VIP's on one director box too).

Michael Sparks michael (dot) sparks (at) mcc (dot) ac (dot) uk

Also has anyone tried this using 2 or more masters - each master with it's own IP? (*) From what I can see theoretically all you should have to do is have one master on IP X, tunneled to clients who recieve stuff via tunl0, and another master on IP Y, tunneled to clients on tunl1 - except when I just tried doing that I can't get the kernel to accept the concept of a tunl1... Is this a limitation of the IPIP module ???

Stephen D. WIlliams sdw (at) lig (dot) net

Do aliasing. I don't see a need for tunl1. In fact, I just throw a dummy address on tunl0 and do everything with tunl0:0, etc.

We plan to run at least two LinuxDirector/DR systems with failover for moving the two (or more) public IP's between the systems. We also use aliased, movable IP's for the realserver addresses so that they can failover also.

39.16. Mon scripts from Christopher DeMarco

Christopher DeMarco cdemarco (at) fastmail (dot) fm 27 Jul 2004

I'm not running heartbeat at the site in question, and wasn't thrilled about setting it up between the director and two realservers. More importantly, Mon is generalizable to an org-wide monitoring system (which heartbeat is not). Mon has a wider range of service check scripts and alerts than ldirectord has and is therefore more flexible. If somebody wanted to monitor ipvs but take action ONLY by alpha pager (i.e. *not* modifying ipvs) then Mon would be more appropriate than ldirectord.

  • ipvs.monitor: Checks whether the specified virtual service is defined, and, optionally, whether it has any realservers defined.
  • ipvs.alert : Brings virtual services and realservers up and down.