25. LVS: Fwmarks (firewall marks)

25. LVS: Fwmarks (firewall marks)
Prev		Next

25.1. Introduction

	Note
fwmark nomenclature: Karl Kopper (Apr 2004) said that he thinks the correct term for this is "netfilter mark". A google search finds references to "netfilter mark" back to 2001, and with "fwmark" current at least to 2003. Both terms seem to be in use. The various netfilter HOWTOs don't say anything about new terminology. Horms (who wrote the fwmark code) doesn't know anything about a change in terminology, but thinks it's possible that fwmark is the implementation of netfilter marks. I asked Harald Welte about this at OLS_2004 and the explanation was as clear as day, except that I didn't write it down and now I've forgotten it (geez, sorry about this). It was a matter of nomenclature rather than logic: it was something like - the entity in the command line is called a mark while the method of marking packets is called fwmark. Whatever it is, you can use either term and people will know what you're talking about.

Note

fwmark nomenclature:

Karl Kopper (Apr 2004) said that he thinks the correct term for this is "netfilter mark". A google search finds references to "netfilter mark" back to 2001, and with "fwmark" current at least to 2003. Both terms seem to be in use. The various netfilter HOWTOs don't say anything about new terminology. Horms (who wrote the fwmark code) doesn't know anything about a change in terminology, but thinks it's possible that fwmark is the implementation of netfilter marks.

I asked Harald Welte about this at OLS_2004 and the explanation was as clear as day, except that I didn't write it down and now I've forgotten it (geez, sorry about this). It was a matter of nomenclature rather than logic: it was something like - the entity in the command line is called a mark while the method of marking packets is called fwmark. Whatever it is, you can use either term and people will know what you're talking about.

fwmark is a way of aggregating an arbitary collection of VIP:port services into one virtual service (the entry made with ipvsadm -A). Thus a virtual service could be composed of multiple VIP:ports (e.g. VIP1:port1, VIP2:port2...VIPn:portn). This is usefull if the client needs to connect to all of the VIP:port services together on one realserver.

Common uses for fwmark are

aggregate VIP:http and VIP:https, so that when a client fills their shopping cart on VIP:http and they move to VIP:https (to give their credit card information), they will stay on the same realserver.
with multi-port services like ftp (there are some wrinkles with ftp, since the 2nd port calls from the realserver rather than from the client - read the setup of ftp elsewhere in this section and in ftp).
when the realserver is a squid. All traffic to port 80 (for all IPs) is aggregated with a fwmark.

A minor advantage is that a realserver can be added, removed and re-weighted with one ipvsadm command. To enable fwmark, the packets coming into the director have to be labelled with a fwmark (some bits are flipped in the tcp packet). This is done with iptables (or ipchains).

	Note
	The fwmark is only a part of the packet while it stays in the skb of the machine which marked the packet (here the director). The fwmark is not on the packet when the packet is put out on the external network (i.e. the fwmark that is put on the packet when it is on the director is not on the packet when it arrives at the realserver).

Once the component services are fwmark'ed, filtering (with iptables/ipchains) can be done on the fwmark, rather than on the individual IP:ports.

The original method for setting up an LVS used the VIP as the target for ipvsadm commands. Using the VIP as the target, it is possible for LVS to forward multiple services on the same VIP and to forward packets for several different VIPs. However this method does not scale well to large numbers of services or IPs. As well, the connections to each service are independant, unless persistence is invoked.

The more flexible fwmark method was introduced by Horms in Apr 2000. Ted Pavlic then showed how used fwmarks to group arbitary services. In this way connection to two otherwise independent services, e.g. http and https, will be linked as one service as far as ipvs is concerned and the client will stay on the same realserver for both services. The fwmarks method is more flexible and simpler to administer for large numbers of services than is the VIP method.

fwmark is used

to group services together within a single LVS e.g.
- group 1 - port 80,443 for an e-commerce site
- group 2 - port 20,21 for an ftp server
group large numbers of VIPs together

Setting up an LVS on fwmarks rather than the VIP is now the method of choice for setups with multiple VIPs or a group of ports that need to be aggregated.

fwmark can be used with all forwarding methods and should have no affect on performance (throughput, latency).

Fwmarks are numbers but can be translated into names using the fwmark name translation table patch.

Some history from Horms (this has also been described in "Wired" Magazine - see LVS in the News).

The story starts with a trip from Sillicon Valley, where I was working for VA Linux Systems, to a VA Linux Systems Professional Services customer site in Fort Laurderdale. It was mid-February 2000. I was called onsite to help sway the customer towards using LVS. The customer was interested in using LVS for a very large number of customers. Part of their requirement called for a very large number of virtual services to be configured. I suggested that we could simplify this by collecting the virtual services into contiguous network blocks and modifying LVS to recognised all addresses in a block as belonging to a virtual service. The customer seemed to like this idea. My original proposal and implementation was to allow virtual services based on netmasks. Wensong rejected this because of some potential performance issues.
I distinctly remember working on the original implementation on a train trip from the Blue Mountains to Sydney's Central Station with my then girlfriend. By the time I had to change trains go to Wynyard the code was working :)
When I got back home to Sillicon Valley I finished off the changes and emailed them to Wensong. That was on the 20th March. He wasn't particularly happy with some aspects of the change, particularly some performance overhead that my implementation introduced. I made some changes and sent him a new version. He suggested making the new code optional, I made that so too. We exchanged email and code for about a week.
A few days latter Julian came up with the idea of using a fwmark, a feature of the ip_masq code that had been around for a while, but wasn't heavily used. Wensong passed this on to me (30 Mar). Wensong clearly was not happy with my approach to the problem and suggested the implementation that he and Julian had hashed out. The change involved using netfilter (iptables) to handle deciding which packets belong to a virtual service, rather than putting that logic into LVS itself - it was this portion of the code that Wensong was worred about the performance of.
We talked this over a little bit over email and I implemented the idea. On the 6th of April I sent the new code to Wensong and Julian. On the 7th Wensong wrote back explaining a few changes he was going to make, mostly involving having the code always compiled in rather than making it an option as there didn't appear to be any performance overhead in the new code. The new option, which by then was known as firewall mark virtual services was included in IPVS 0.9.10 which was released on the 9th April. Minor fixes were made, mainly by Wensong over the following few months and made it into subsequent releases.
I wrote the kernel, ipvsadm and ldirectord changes and largely have maintained them ever since.
It is of note that as a part of the work that came out of this customer the -R and -S options to ipvsadm were suggested and implemented by myself. These were released just before the inclusion of the fwmark code.
This customer was also the impetus for putting together what is now known as Ultra Monkey. All in all quite an interesting outcome for a couple of days on site. Pleasingly I believe that the customer in question is using Ultra Monkey with the fwmark support in LVS.

A bio of Horms:

I am from Sydney Australia. I have been involved in Linux for, well, a long time. My main area of expertise is High Availability and Load Balancing. Though anything from email to routing is just fine by me. You can see a list of the projects I have worked on as well as the papers that I have presented at confereneces on my web page (http://www.vergenet.net/linux/).
I used to work at VA Research which became VA Linux Systems until they changed their business model and became VA Software. During that time I was based in Sillicon Valley, New York City and Sydney (though not all at the same time :). I currently work for VA Linux Systems Japan, in Tokyo - which I should point out is majority owned by the Sumitomo Coropration and is independant of VA Software (USA) these days. I primarily work on the Ultra Monkey Project in conjunction with NTT Commware.
http://www.ultramonkey.org/, http://www.vasoftware.com/, http://www.valinux.co.jp/, http://www.nttcom.co.jp/.

(Joe) I first saw Horms when he gave a talk at the 4th Annual Linux Expo at Duke University, Durham, NC in May 1998, on Creating Redundant Linux Servers (http://www.vergenet.net/linux/redundant_linux_paper/). Although I attended the talk and thought it pretty neat, it never occured to me to introduce myself. Later when we both joined the LVS project, it took quite some time before I connected Horms on the LVS mailing list with the person who gave the presentation at the Linux Expo.

Sample configurations/topologies for fwmarks are at Ultramonkey.

25.2. ipvsadm syntax for fwmark

25.2.1. ipvsadm command ignores ports, fwmark can't translate ports

You can enter a port number with a fwmark command with ipvsadm but it is ignored.

Leonard Soetedjo

From the HOWTO, when using fwmark, I can set the port to be 0. Is this correct? Is it ok if I do that for a single port service such as telnet? for example
iptables -t mangle -A PREROUTING -i eth0 -p tcp -s 0/0 -d VIP --dport telnet -j MARK --set-mark 1
ipvsadm -a -f 1 -r RS1:0 -g -w 1
Is the use of "0" not important? i.e. I can set to whatever I want?

Horms 17 Dec 2002: The LVS kernel code that handles fwmarks really doesn't care about ports at all. If you want a service to match on specific ports, then you should set up the iptables rules to only mark packets to that port or ports.

nick garratt Mar 25, 2004

I'm experiencing issues with port translation using LVS-NAT and FWMARK:
iptables -t mangle -A PREROUTING -d VIP -p tcp -m tcp --syn --dport 1237:1239 -j MARK --set-mark 1238

ipvsadm -A -f 1238 -s wlc -p 900
ipvsadm -a -f 1238 -r 192.168.20.1:1237 -m -w 5 # daemon instance 1
ipvsadm -a -f 1238 -r 192.168.20.1:1238 -m -w 5 # daemon instance 2
What I am trying to achieve is the following: we have a custom written SMPP service that accepts two connection (transmitter and receiver) from a client. We have run into problems with maximum threads per process and large numbers of binds. As an interim measure we are considering running multiple instances of the daemon on the same server. Its is imperative that a user's two binds are routed to the same daemon instance. The user may connect to a port range so as to allow them to specify different receiver and transmitter ports according to their whim or the peculiarities of their client software but the daemon instance will handle both connections on the same port.
The intention is to group the VIP port range using FWMARK as we do with many other services and load balance them across the RIP service ports ensuring that:
 userIP:56789 -> VIP:1237 -> RIP:n
 userIP:56790 -> VIP:1238 -> RIP:n
where n is the same port guaranteed by persistence. Problem: FWMARK and LVS-NAT port translation does not seem to work at all. what actually happens is:
 userIP:56789 -> VIP:1237 -> RIP:1237
 userIP:56790 -> VIP:1238 -> RIP:1238
which splits the binds across daemon instances.

Horms horms (at) verge (dot) net (dot) au 06 Apr 2004

Yes, port translation does not work with fwmarks, because there is no way for LVS to tell what the port translation should be. In a fwmark service the virtual service does not have a port (or address for that matter). So it can't know that it is accepting packets for, say port 1237, and then use the realserver entry to translate that to port 1237 (not much of a translaton) or 1238 (or anything else). It has to just assume that the port will be unchanged.

It would be possible to modify LVS to allow this kind of translation to take place, but it isn't immediately obviously how this would be configured.

Another approach to the problem is to configure multiple virtual interfaces on my realserver, get the daemon instances to bind to specific IPs/same port ranges and handle as per normal i.e. no port translation:
iptables -t mangle -A PREROUTING -d VIP -p tcp -m tcp --syn --dport 1237:1239 -j MARK --set-mark 1238

ipvsadm -A -f 1238 -s wlc -p 900
ipvsadm -a -f 1238 -r 192.168.20.11:0 -m -w 5 # daemon instance 1 listening on 1237 - 1239
ipvsadm -a -f 1238 -r 192.168.20.12:0 -m -w 5 # daemon instance 2 listening on 1237 - 1239
However I would prefer to keep down the number of IPs I need to failover.

I would suggest doing this. You shouldn't need to failover the IP addresses of your realservers anyway. Just use something like ldirectord to monitor their availability and manipulate the LVS table accordingly.

25.3. setting up routing and packet delivery to the director

If you are accepting packets by a fwmark rather than by the VIP, then (in principle) you don't need the VIP on the node with the fwmark rules (which could be either the director or realserver).

To get a working LVS without configuring the VIP on a machine, you need to

be able to deliver the packets to the machine concerned (arp now won't be able to find the machine with the VIP)
and you have to arrange for the machine with the fwmark rules to accept the packet locally. The node normally only accepts a packets for an address on the machine. Without the VIP, the node will forward the packet to somewhere else. It should be possible to arrange for LVS to accept the packet, and Julian has said this is possible, but he's working on other things right now.

To do the examples below, you can either setup the fwmarks to mark only one IP (the VIP) and install the VIP on the director, or you can read the section on routing and delivery of packets and use one of the methods suggested there.

25.4. single-port service: telnet with fwmarks

Assuming you already have setup the networks and default gw for the machines in your LVS, here's how you'd setup telnet without fwmarks (i.e. the "normal" method, using the VIP as the target for ipvsadm commands) on a two realserver LVS-DR.

#make a table for connections to VIP:telnet, with round robin scheduling
#schedule realserver RS2 for connections to VIP:telnet, weight=1, forwarding method=DR
#schedule realserver RS1 for connections to VIP:telnet, weight=1, forwarding method=DR
director:# ipvsadm -A -t VIP:telnet -s rr
director:# ipvsadm -a -t VIP:telnet -r RS2:telnet -g -w 1
director:# ipvsadm -a -t VIP:telnet -r RS1:telnet -g -w 1

Here's how to do the same thing with fwmarks. You first mark the packets with ipchains or iptables.

25.4.1. ipchains for 2.2.x director

Here's the recipe for setting a fwmark with ipchains:

#flush ipchains tables
#mark with value=1, tcp packets from anywhere,
#arriving on eth1 (holds the VIP on my setup),
#with dst_addr=192.168.2.110 (the VIP) for port telnet
#show ipchains tables
director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport telnet -m 1
director:# ipchains -L input
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   telnet

25.4.2. iptables for 2.4.x director

Here's the recipe for setting a fwmark with iptables:

The iptables parameters are taken from an example by Paul Schulz (http://www.foursticks.com.au/~pschulz/qos/pfifo.sample, link dead Jan 2003), which I found through google.

First put a mark of value=1 on tcp packets which arrive from anywhere with dst_addr=VIP:telnet (the VIP is on eth1 in my setup).

#flush the mangle table
#in the skb, put mark=1 on all tcp packets arriving on eth1 from anywhere, with dest=VIP:telnet
#output the mangle table, just for a look
director:# iptables -F -t mangle
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport telnet -j MARK --set-mark 1
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net       tcp dpt:telnet MARK set 0x1

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

The fwmark is only associated with the packet while it is in the director skb (socket buffer). The packet which emerges from the director and is forwarded to the realserver is a normal (unmarked) packet. (You can't use the director's fwmark information when the packet arrives on the realserver to decide on how to handle the packet.)

25.4.3. install the LVS with ipvsadm

#setup an ipvsadm table for packets with mark=1,
#schedule them with round robin.
#schedule realserver RS1 for connections with mark=1, forwarding method=DR, weight=1
#schedule realserver RS2 for connections with mark=1, forwarding method=DR, weight=1
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r RS1.mack.net:telnet -g -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:telnet -g -w 1

Here's the output of ipvsadm

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> RS2.mack.net:23                  Route   1      0          0
  -> RS1.mack.net:23                  Route   1      0          0

You can now telnet to the VIP. You'll get the expected round robin scheduling of your connections to RS2 and RS1.

25.5. Grouping services: single group, active ftp(20,21)

The telnet example above could equally well be done using the VIP or a fwmark as the target for ipvsadm commands. The same is true for any one port service, where connections to services are made independantly of each other. Sometimes we need to group services together, e.g. port 20,21 for an ftp server or port 80, 443 for an e-commerce site. With persistence, you can only make ports persistent singly (but you can make persistent as many or as few as you want, they will be persistent independently); or make all ports persistent at once (with the :0 option), in which case persistence of the ports will be linked. There is no way to make pairs (or groups) of ports persistence with the current persistence code. The current method for handling this, persistent connection, links all ports on the VIP, and the director will forward connections to all ports, not just the two we are interested in. For security purposes, if persistence is used to group services, then connection requests to the other ports will have to be blocked. Although workable, it's an ugly solution.

For background on how the specifications for fwmarks were set to allow services to be grouped, see Appendix 1 for the initial discussion between Ted and the LVS developers (Horms and Julian), Appendix 2 where Ted let me know that he'd had it working, and Appendix 3 for Ted's announcement to the mailing list.

25.5.1. port grouping using VIP and persistence

Here's an example grouping ports 20,21 for ftp. This uses persistence and the VIP as the target for ipvsadm commands (this is the original, VIP way of setting up ftp).

#make a table for connections to all ports on VIP
#with round robin scheduling, persistence timeout=360secs
#schedule realserver RS2 for connections to all ports on VIP, weight=1, forwarding method=DR
#schedule realserver RS1 for connections to all ports on VIP, weight=1, forwarding method=DR
director:# ipvsadm -A -t VIP:0 -s rr -p 360
director:# ipvsadm -a -t VIP:0 -r RS1.mack.net:0 -g -w 1
director:# ipvsadm -a -t VIP:0 -r RS2.mack.net:0 -g -w 1

Here's the output of ipvsadm

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> RS1.mack.net:0                   Route   1      0          0
  -> RS2.mack.net:0                   Route   1      0          0

After the client has made the initial connection on port 21, then any subsequent connection on port 20 (within the 360sec timeout period) will go to the same realserver.

The problem is that the director will forward to the same realserver, connection requests made to any port by the client. If we have listeners on port 80 and 443 on the realserver, then these services will be linked to each other (which we may want), and they will also be linked to the ftp service (which we may not want). If you telnet to the VIP, this request will be forwarded to the realservers too (in production you'll have to block this).

25.5.2. grouping with fwmarks

Here's how to setup an ftp server with fwmarks. First mark the packets of interest with ipchains or iptables (i.e mark all tcp packets destined for VIP:ftp and VIP:ftp-data arriving on eth1).

25.5.2.1. ipchains for 2.2 director

#flush ipchains tables
#mark ftp packets
#put the same mark on ftp-data packets
#show ipchains tables
director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp-data -m 1
director:# ipchains -L input
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp-data

25.5.2.2. iptables for 2.4 director

#clear mangle table
#mark ftp packets
#put the same mark on ftp-data packets
#show mangle table
director:# iptables -F -t mangle
director:/etc/lvs# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -j MARK --set-mark 1
director:/etc/lvs# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp-data -j MARK --set-mark 1
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

25.5.2.3. install LVS with ipvsadm

Next setup ipvsadm to schedule packets marked with fwmark=1 to your realservers. You need persistence (here timeout set to 600secs).

director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r RS1.mack.net:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:0 -g -w 1

Here's the output of ipvsadm with two current connections to the LVS and 3 expiring ones. Note they are all to the same realserver, as expected for a persistent connection. Since forwarding is by LVS-NAT, the ip_vs_ftp module automatically loads.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      2          3
  -> RS1.mack.net:0                   Route   1      0          0

A netpipe test showed the same latency and throughput for a connection based on fwmark or based on VIP.

What happens now when you telnet from the client to the VIP? (pause to let you think.) The director is only forwarding packets with fwmark=1 to the LVS, so a telnet request to the VIP is accepted by the director and not forwarded to the realservers. If telnetd is running on the director, you'll get a login prompt from the director. In production you'll have to block this too (just like you had to when setting up on a VIP).

So what's the difference, you ask, between setting up an ftp server with persistence on the VIP on one hand (which requires you to block all other packets with iptables rules), and grouping 20,21 with fwmarks on the other (which requires exactly the same blocking of unwanted packets)? Not a lot. At the moment you're at least even

Lars Marowsky-Brée lmb (at) suse (dot) de 2000-05-11
When using the LVS box as a firewall/router, the fwmark technique is a perfectly adequate solution, which doesn't cost anything.

But look at the next example.

25.6. Grouping services: two groups, active ftp(20,21) and e-commerce(80,443)

Setup 2 groups of services, group 1 - ftp(20,21), group 2 - ecommerce(80,443).

First mark packets in 2 groups.

25.6.1. ipchains for 2.2 director

director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp-data -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport http -m 2
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport https -m 2
director:# ipchains -L input
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp-data
-          tcp  ------  anywhere             lvs2.mack.net         any ->   www
-          tcp  ------  anywhere             lvs2.mack.net         any ->   https

25.6.2. iptables for 2.4 director

director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp-data -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport http -j MARK --set-mark 2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport https -j MARK --set-mark 2
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:https MARK set 0x2

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

25.6.3. setup LVS to schedule (with persistence) 2 groups of packets

Note: The ipvs code in Apr 2001 needed a patch to get the expected behaviour. This section describes the function of LVS before and after this patch. As a result of these tests, the patch will be applied to future releases. ipvs-1.0.7-2.2.19 is already patched (Apr 2001). The 2.4.3 series are not patched yet. To see if the code has been patched look in ipvs/Changelog for something like this

Julian changed persistent connection template for fwmark-based service from <CIP,VIP,RIP> to <CIP,FWMARK,RIP>, so that different fwmark-based services that share the same VIP can work correctly.

If your ipvs code is pre-patched, then you can skip down to the part where the behaviour after applying the patch is described. If your code isn't patched, you should just go get the patch and skip to the part where the expected behaviour is described.

25.6.4. unexpected behaviour

Here's what happened with the original code.

director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r RS1.mack.net:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:0 -g -w 1
director:# ipvsadm -A -f 2 -s rr -p 600
director:# ipvsadm -a -f 2 -r RS1.mack.net:0 -g -w 1
director:# ipvsadm -a -f 2 -r RS2.mack.net:0 -g -w 1

IP Virtual Server version 0.2.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      0          0
  -> RS1.mack.net:0                   Route   1      0          0
FWM  2 rr persistent 600
  -> RS2.mack.net:0                   Route   1      0          0
  -> RS1.mack.net:0                   Route   1      0          0

If you ftp and http to the VIP, you'd expect the ftp connections to go to fwmark 1 (presumably to the first realserver RS2) and the http connections to go to fwmark 2 (again presumably to RS2).

With the director running 1.0.6-2.2.19 (ipvs/kernel version), all connections (ftp, http) go to group 1. With the director 0.2.7-2.4.2, all connections go to group 2. Here's the output from ipvsadm for the 2.2.19 example immediately after downloading a webpage. You would expect the http InActConn to be associated with FWM2.

IP Virtual Server version 1.0.6 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 30
  -> RS2.mack.net:0                 Route   1      0          2
  -> RS1.mack.net:0                 Route   1      0          0
FWM  2 rr persistent 30
  -> RS2.mack.net:0                 Route   1      0          0
  -> RS1.mack.net:0                 Route   1      0          0
director:/etc/lvs#

It appears (Apr 2001) that the ipvs code doesn't really follow the persistent fwmarks spec. When there is a collision between VIP space and fwmark space (eg in these examples, where all packets are going to the same VIP), then the VIP takes precedence and the two fwmark groups are not differentiated. The collision arises because there is only one set of templates for the connection tables.

25.6.5. expected behaviour

Note: May 2001: the ipvs code now has the persistent-fwmark behaviour.

( The code to produce the expected behaviour requires a separate set of templates for fwmarks and VIP. The patch to do this is on Julian's patch page and has names like persistent-fwmark-0.2.8-2.4-1.diff, persistent-fwmark-1.0.5-2.2.18-1.diff. (Note: the 0.2.8 patch had DOS carriage control and wouldn't patch till I removed the ^M characters). (Note: as of ipvs-0.9.0, this patch has been applied to the source tree.)

After patching the ip_vs code to produce the new ip_vs.o module (rmmod the old one first), you get the expected fwmark behaviour. )

Here's the output of ipvsadm after ftp'ing and http'ing from a client. Note that the ftp connection is to fwmark=1. The InActConn is the expiring connection from the http client to fwmark=2.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 30
  -> RS2.mack.net:0                   Route   1      1          0
  -> RS1.mack.net:0                   Route   1      0          0
FWM  2 rr persistent 30
  -> RS2.mack.net:0                   Route   1      0          1
  -> RS1.mack.net:0                   Route   1      0          0

25.6.6. example

Here's an example of using persistence granularity (from Ratz 3 Jan 2001). The -M 255.255.255.255 sets up /32 granularity. Here port 80 and port 443 are being linked by fwmarks.

ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 80 -m 1 -l
ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 443 -m 1 -l
director:/etc/lvs# ipvsadm -A -f 1 -s wlc -p 333 -M 255.255.255.255
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.1 -g -w 1
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.2 -g -w 1

25.6.7. The original idea from Ted Pavlic

Ted Pavlic tpavlic (at) netwalk (dot) com 2000-10-08

Just another persistence option that you may or may not have thought of... LVS does support port-group sticky persistance. Before FWMARK support was added to LVS, the only types of persistance one could do were:

One port persistence (all queries to 80 return to the same realserver per CIP)
ALL port persistence (all queries to all ports return to the same RIP per CIP)

But now that FWMARK support exists in LVS, it is easy to create group-based sticky persistence. That is... It adds the option where:

Only these two ports (443 and 80) return to the same RIP per CIP
Meanwhile, another persistence table keeps track of 20, 21, and 1024:65535
Any other port is not persistent

Just have ipchains keep track of flagging the incoming packets with the correct port group identifier:

ipchains -A input -D VIPNET/VIPMASK PORT -p PROTOCOL -m FWMARK

And have IPVS stop looking at IPs and start look at FWMARKs:

director:/etc/lvs# ipvsadm -A -f FWMARK
director:/etc/lvs# ipvsadm -a -f FWMARK -r RIP:0

25.6.8. ssl and cookies

Ted Pavlic tpavlic (at) netwalk (dot) com 2000-10-13

LVS DIRECTLY supports two types of persistence and INDIRECTLY supports another. If you are just asking how to make port 443 persistent so that those who receive a cookie on 443 will come back to the same realserver on 443, simply:

/sbin/ipvsadm -A -t 192.168.1.110:443 -p
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.1
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.2
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.3
...

Will setup persistence just for port 443.

However, say someone gets a cookie on port 80 and gives it back on port 443 -- in that case you want to have persistence between multiple ports. Using port 0 accomplishes this:

/sbin/ipvsadm -A -t 192.168.1.110:0 -p
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.1
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.2
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.3
...

In this setup, anyone who visits ANY service will continue to go back to the same realserver. So requests which come in on 80 or 443 will continue to come in to the same realserver regardless of port.

This is an OK solution, but it basically makes all services persistent which might mess up scheduling. That is, this is a decent solution but sometimes not extremely desirable.

If you want to simply group ports 80 and 443 together, you need to do something more intuitive. Use FWMARK...

ipchains -A input -d 192.168.1.110/32 80 -p tcp -m 1
ipchains -A input -d 192.168.1.110/32 443 -p tcp -m 1
/sbin/ipvsadm -A -f 1 -p
/sbin/ipvsadm -a -f 1 -R 192.168.2.1
/sbin/ipvsadm -a -f 1 -R 192.168.2.2
/sbin/ipvsadm -a -f 1 -R 192.168.2.3
...

Now only port 80 and 443 will be grouped together via persistence. Any other director:/etc/lvs# ipvsadm rules will be completely separate. This means that you can make 80 and 443 persistence by their own little "port group" and leave ports 25 and 110 (for example) not persistent. OR... You could group all the FTP ports together as well on a completely different persistence group... i.e.

ipchains -A input -d 192.168.1.110/32 80 -p tcp -m 1
ipchains -A input -d 192.168.1.110/32 443 -p tcp -m 1
/sbin/ipvsadm -A -f 1 -p
/sbin/ipvsadm -a -f 1 -R 192.168.2.1
/sbin/ipvsadm -a -f 1 -R 192.168.2.2
/sbin/ipvsadm -a -f 1 -R 192.168.2.3
# Really adding port 20 isn't needed
ipchains -A input -d 192.168.1.110/32 20 -p tcp -m 2
ipchains -A input -d 192.168.1.110/32 21 -p tcp -m 2
ipchains -A input -d 192.168.1.110/32 1024:65535 -p tcp -m 2
/sbin/ipvsadm -A -f 2 -p
/sbin/ipvsadm -a -f 2 -R 192.168.2.1
/sbin/ipvsadm -a -f 2 -R 192.168.2.2
/sbin/ipvsadm -a -f 2 -R 192.168.2.3
...

and again

Wayne wrote
Is there a easy way to relating server in both port 80 and port 443 (with LVS-NAT)?

Say I have two farms, each with same three servers. One farm load balancing HTTP requests and another farm load balancing HTTPS farms. To make sure the user in the persistent mode connected to the HTTP server always go to the same server for HTTPS service, we would like to have some way to relate the services between the two farms, is there a easy way to do it?

ratz ratz (at) tac (dot) ch 2001-01-03

Two possibilities to solve this with LVS

Use port 0 in your setup. (advantage: easy to set up and easy understand)
Use fwmark and group them together. (advantage: finer port granularity possible)

Example (1):

director:/etc/lvs# ipvsadm -A -t 192.168.1.100:0 -s wlc -p 333 -M 255.255.255.255
director:/etc/lvs# ipvsadm -a -t 192.168.1.100:0 -r 192.168.1.1 -g -w 1
director:/etc/lvs# ipvsadm -a -t 192.168.1.100:0 -r 192.168.1.2 -g -w 1

Example (2):

ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 80 -m 1 -l
ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 443 -m 1 -l
director:/etc/lvs# ipvsadm -A -f 1 -s wlc -p 333 -M 255.255.255.255
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.1 -g -w 1
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.2 -g -w 1

25.7. passive ftp

You can setup passive ftp with the VIP as the target using persistence. This is not a particular satisfactory solution, as connect requests to all ports will be forwarded. As well, if another service on the realserver fails (eg http), then all services have to be failed out together.

Here's a solution to passive ftp from Ted Pavlic using fwmark. This allows setting up passive ftp independantly of other services. Passive ftp listens on an unknown and unpredictable high port on realserver. This is handled by forwarding requests to all high ports (it's still ugly, but at least this way, we can fail out ftp independently of other services).

25.7.1. test session with active ftp

Here's ftp setup in active mode, as a control.

director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1
#
#setup ipvsadm, making all packets with mark=1 persistent
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r RS1:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2:0 -g -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      0          0
  -> RS1.mack.net:0                   Route   1      0          0

Here's netstat -an on the client and the realserver (RS2) immediately after an ftp file transfer (with the client still connected).

#client:
client:~# netstat -an | grep 110 #110 is part of the VIP
tcp        0      0 client:1176   VIP:21           ESTABLISHED
#realserver
RS2:/home/ftp/pub# netstat -an | grep 254 #254 is part of the client IP
tcp        0      0 VIP:20        client:1180      TIME_WAIT
tcp        0      0 VIP:20        client:1178      TIME_WAIT
tcp        0      0 VIP:20        client:1177      TIME_WAIT
tcp        0      0 VIP:21        client:1176      ESTABLISHED

Only port 20,21 are involved here.

Here's the command line at the client during the active ftp transfer (all expected output).

ftp> get tulip.c
local: tulip.c remote: tulip.c
200 PORT command successful.
150 Opening BINARY mode data connection for tulip.c (104241 bytes).
226 Transfer complete.
104241 bytes received in 0.0232 secs (4.4e+03 Kbytes/sec)

The iptables rules on the director do not allow passive ftp connection. To test this put the ftp client into passive mode.

ftp> pass
Passive mode on.
ftp> dir
227 Entering Passive Mode (192,168,2,110,4,72)
ftp: connect: Connection refused
ftp>

connection is not allowed. To check that the system is still functioning, put the client back into active mode.

ftp> pass
Passive mode off.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
total 155178
.
.
-rw-r--r--   1 root     root       104241 Nov 10  1999 tulip.c
226 Transfer complete.
ftp>

25.7.2. test session with passive ftp

Here's the setup for passive ftp (2.4.x director) (you can leave ipvsadm untouched).

director:# iptables -F -t mangle
#mark ftp packets
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport 1024: -j MARK --set-mark 1
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpts:1024:65535 MARK set 0x1

Here's the command line from the ftp client still in active mode

ftp>  dir
200 PORT command successful.

The session is hung, the server shows an established connection to port 21 and the client session has to be killed.

Here's the passive session.

client:~# ftp VIP
Connected to VIP.
220 RS2.mack.net FTP server (Version wu-2.4.2-academ[BETA-15](1) Wed May 20
 13:45:04 CDT 1998) ready.
Name (VIP:root): ftp
331 Guest login ok, send your complete e-mail address as password.
Password:
230 Guest login ok, access restrictions apply.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> pass
Passive mode on.
ftp> cd pub
250 CWD command successful.
ftp> dir *.c
227 Entering Passive Mode (192,168,2,110,4,75)
150 Opening ASCII mode data connection for /bin/ls.
-rw-r--r--   1 root     root       104241 Nov 10  1999 tulip.c
226 Transfer complete.
ftp> mget *.c
mget tulip.c? y
227 Entering Passive Mode (192,168,2,110,4,78)
150 Opening BINARY mode data connection for tulip.c (104241 bytes).
226 Transfer complete.
104241 bytes received in 0.0233 secs (4.4e+03 Kbytes/sec)
ftp>

Here's the connections at the realserver immediately after the file transfer. There is the regular connection at the ftp port (21) and a connection timing out to a high port on the realserver.

RS2:/home/ftp/pub# netstat -an | grep 254 #254 is part of the client IP
tcp        0      0 VIP:1104      client:1191      TIME_WAIT
tcp        0      0 VIP:21        client:1184      ESTABLISHED

Here's the output from ipvsadm after connecting to the URL ftp://vip/ using a web-browser

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      1          5
  -> RS1.mack.net:0                   Route   1      0          0

25.7.3. LVS with 2 groups: group 1 = ftp(active and passive), group 2 = http

#fwmark rules
director:# iptables -F -t mangle
#active and passive ftp in group 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp-data -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport 1024: -j MARK --set-mark 1
#http as group 2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport http -j MARK --set-mark 2
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpts:1024:65535 MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2
#
#setup LVS for 2 groups
director:# ipvsadm -C
#ftp (active and passive) are persistent as group 1
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r RS1:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2:0 -g -w 1
#http as group 2 (not persistent)
director:# ipvsadm -A -f 2 -s rr
director:# ipvsadm -a -f 2 -r RS1:http -g -w 1
director:# ipvsadm -a -f 2 -r RS2:http -g -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      0          0
  -> RS1.mack.net:0                   Route   1      0          0
FWM  2 rr
  -> RS1.mack.net:80                  Route   1      0          0
  -> RS2.mack.net:80                  Route   1      0          0

The client connected (in order) ftp://VIP/, http://VIP/ (passive ftp) and then by active (command line) ftp to VIP. Here's the ipvsadm output.

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      2          3
  -> RS1.mack.net:0                   Route   1      0          0
FWM  2 rr
  -> RS1.mack.net:80                  Route   1      2          2
  -> RS2.mack.net:80                  Route   1      4          0

Here's the connections showing on the realserver. The most recent ones are at the top of the list. The connection list shows (from the bottom, i.e. in the order of connection), passive ftp, http, and active ftp.

RS2:/home/ftp/pub# netstat -an | grep 254 #254 is part of the CIP
tcp        0      0 VIP:21        client:1207      ESTABLISHED
tcp        0      0 VIP:80        client:1206      FIN_WAIT2
tcp        0      0 VIP:80        client:1204      FIN_WAIT2
tcp        0      0 VIP:1108      client:1202      TIME_WAIT
tcp        0      0 VIP:21        client:1201      ESTABLISHED

The whole point of this setup is to make ftp and http, which belonged to one persistence group when setup on a VIP, into two groups. Now you can bring the httpd and the ftpd up and down independantly (if you want to fail them out, to change the configuration or software).

25.8. fwmark with LVS-NAT

(based on a posting by Horms on 14 Jul 2000)

Here we setup a LVS-NAT LVS on a 2.4.x director. (Note: With 2.4 LVS, the masquerading is setup by the ipvs code, i.e. you don't have to masquerade the packets back from the realservers). These examples assume that the VIP is on eth1 and your network is already setup (i.e. the realservers are using the director as the default gw etc).

Mark packets for the VIP and setup the LVS for telnet.

	Warning
	this first example is not going to get you anything you want.

#
#mark packets
director:# iptables -F -t mangle
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 -j MARK --set-mark 1
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      MARK set 0x1
#
#Setup ipvsadm
director:# ipvsadm -C
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r RS1.mack.net:telnet -m -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:telnet -m -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> RS2.mack.net:23                  Masq    1      0          0
  -> RS1.mack.net:23                  Masq    1      0          0

You can connect with telnet to the VIP and you'll be forwarded to both realservers in the expected way.

All packets from the client will be marked and processed by the director:/etc/lvs# ipvsadm rules. What happens if you attempt to connect to VIP:80 (pause to think)?

Here's the answer.

client:~# telnet VIP 80
Trying 192.168.2.110...
Connected to lvs2.mack.net.
Escape character is '^]'.

Welcome to Linux 2.2.19.


RS2 login: root
Linux 2.2.19.
Last login: Fri Apr 13 11:43:52 on ttyp1 from client2.mack.net.
No mail.

If you connect to VIP:80 using a browser for a client, it sits there showing the watch symbol for quite a while.

What happened? The explanation is that you told the director to mark all packets (i.e. from any port) from the client, rewrite them to have dest_addr=RIP:telnet and forward the rewritten packets to the realserver. So when you telnet'ed to VIP:80, the packets were forwarded to RIP:23.

Just to make sure that I'd interpretted this correctly, here's the first packets seen by tcpdump running on the client and the realserver during the connect attempts. (These are from different sessions, so the ports shown on the client are different.)

client: here the client is connecting to VIP:80 (lvs2.www)

12:09:44.449566 client2.1118 > lvs2.www: S 2887976275:2887976275(0) win 5840 <mss 1460,sackOK,timestamp 118456418[|tcp]> (DF) [tos 0x10]
12:09:44.450453 lvs2.www > client2.1118: S 1441372470:1441372470(0) ack 2887976276 win 32120 <mss 1460,sackOK,timestamp 117741798[|tcp]> (DF)
12:09:44.450579 client2.1118 > lvs2.www: . ack 1 win 5840 <nop,nop,timestamp 118456418 117741798> (DF) [tos 0x10]

realserver (RS2): here the realserver is receiving packets to the RIP:23 (RS2.telnet)

11:44:28.319675 client2.1116 > RS2.telnet: S 2722509719:2722509719(0) win 5840 <mss 1460,sackOK,timestamp 118440378[|tcp]> (DF) [tos 0x10]
11:44:28.319974 RS2.telnet > client2.1116: S 1283414485:1283414485(0) ack 2722509720 win 32120 <mss 1460,sackOK,timestamp 117725760[|tcp]> (DF)
11:44:28.320681 client2.1116 > RS2.telnet: . ack 1 win 5840 <nop,nop,timestamp 118440378 117725760> (DF) [tos 0x10]

If you want only telnet requests to be forwarded to the realservers, you should mark only packets for VIP:telnet. If you want both telnet and http forwarded then you should give them each their own mark. Here's how to setup LVS-NAT with fwmark for both telnet and http.

director:# iptables -F -t mangle
#telnet packets to the VIP get fwmark=1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport telnet -j MARK --set-mark 1
#http packets to the VIP get fwmark=2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport http -j MARK --set-mark 2
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:telnet MARK set 0x1
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2
#
#setup ipvsadm
director:# ipvsadm -C
#forward packets with mark=1 to the telnet port
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r RS1.mack.net:telnet -m -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:telnet -m -w 1
#forward packets with mark=2 to the httpd port
director:# ipvsadm -A -f 2 -s rr
director:# ipvsadm -a -f 2 -r RS1.mack.net:http -m -w 1
director:# ipvsadm -a -f 2 -r RS2.mack.net:http -m -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> RS2.mack.net:23                  Masq    1      0          0
  -> RS1.mack.net:23                  Masq    1      0          0
FWM  2 rr
  -> RS2.mack.net:80                  Masq    1      0          0
  -> RS1.mack.net:80                  Masq    1      0          0

Here's the (expected) output of ipvsadm showing the client with 2 telnet sessions and having just downloaded a webpage from the LVS.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> RS2.mack.net:23                  Masq    1      1          0
  -> RS1.mack.net:23                  Masq    1      1          0
FWM  2 rr
  -> RS2.mack.net:80                  Masq    1      0          1
  -> RS1.mack.net:80                  Masq    1      0          0

25.9. collisions between fwmark and VIP rules

Since it's possible to write iptables rules that include many different types of packets, it's possible to write VIP and fwmark rules that would conflict by accepting the same packet. Here's a setup that would accept telnet by both VIP and fwmarks.

director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 \
	--dport ftp -j MARK --set-mark 1
director:# ipvsadm -A -t lvs2.mack.net:telnet -s rr
director:# ipvsadm -a -t lvs2.mack.net:telnet -r RS1.mack.net:telnet -g -w 1
director:# ipvsadm -a -t lvs2.mack.net:telnet -r RS2.mack.net:telnet -g -w 1
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r RS1.mack.net:telnet -g -w 1
director:# ipvsadm -a -f 1 -r RS2.mack.net:telnet -g -w 1
#
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1
#
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:telnet rr
  -> RS2.mack.net:telnet              Route   1      0          0
  -> RS1.mack.net:telnet              Route   1      0          0
FWM  1 rr
  -> RS2.mack.net:telnet              Route   1      0          0
  -> RS1.mack.net:telnet              Route   1      0          0

Here's the ipvsadm output after 4 telnet connections from a client

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:telnet rr
  -> RS2.mack.net:telnet              Route   1      2          0
  -> RS1.mack.net:telnet              Route   1      2          0
FWM  1 rr
  -> RS2.mack.net:telnet              Route   1      0          0
  -> RS1.mack.net:telnet              Route   1      0          0

All connections go to the first (here VIP) entries. The same ipvsadm table and connection pattern results if you feed the VIP and fwmarks rules into ipvsadm in the reverse order. This behaviour is not part of the spec (yet). You might want to check the behaviour, if you are doing this sort of setup.

25.10. persistence granularity with fwmark

25.10.1. Introduction

Persistence granularity was added to LVS by Lars lmb (at) suse (dot) de 1999-10-13

This patch adds netmasks to persistent ports, so you can adjust the granularity of the templates. It should help solve the problems created with non-persistent cache clusters on the client side."

The problem being addressed is that some clients (eg AOL customers) connect to the internet via large proxy farms. The IP they present to the server will not neccessarily be the same for different sessions (tcp connections), even though they remain connected to their proxy machine. Persistence granularity makes all clients from a network equivalent as far as persistence is concerned. Thus a client could appear as CIP=x.x.x.13 for their http connections, but CIP=x.x.x.14 for their https connections. With persistence granularity set to /24, all CIPs from the same class C network will be sent to the same realserver. The default behaviour (i.e. persistence granularity is /32) has the effect that all connections from the same CIP to be sent to the one realserver but other connections from the same network will be scheduled to other realservers.

Persistence granularity is applied to the CIP and works the same whether you are using fwmark or the VIP to setup the LVS.

You set the netmask (granularity) for persistence granularity with ipvsadm. If the LVS was setup with the following command, the persistence granularity is 255.255.255.0.

director:/etc/lvs# ipvsadm -A -t 192.168.1.100:0 -s wlc -p 333 -M 255.255.255.0

Let's say a client from a class C network (e.g. with IP=100.100.100.2) connects to the LVS. If any other client connects from 100.100.100.0/24 they will also connect to the same realserver as long as the original client's entry in the persistence table has not expired (i.e. the first client is still connected, or disconnected < 333 secs ago).

25.10.2. examples

Here's an example LVS-DR LVS set to mark packets for an IP on the outside of the director (this IP serves as the VIP in the usual LVS setup, but there's no such thing as a VIP with fwmarks) with --dport telnet. Persistence granularity is set to the default (-M 255.255.255.255).

director:# ipvsadm -C
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r RS1:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2:0 -g -w 1

Two clients (192.168.2.254, 192.168.2.253) connect to the LVS. Each host connects to different realservers but multiple connects from each client go to the same realserver (i.e. client A always goes to realserver A; client B always goes to realserver B, at least till the persistence timeout clears). Here both clients have connected twice.

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                   Route   1      2          0
  -> RS1.mack.net:0                   Route   1      2          0

This is the connection pattern expected if the connections were based on the CIP/32 and fwmark (ie all clients are scheduled independently).

Here's the same setup with persistence granularity set to /24.

director:# ipvsadm -C
director:# ipvsadm -A -f 1 -s rr -p 600 -M 255.255.255.0
director:# ipvsadm -a -f 1 -r RS1:0 -g -w 1
director:# ipvsadm -a -f 1 -r RS2:0 -g -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600 mask 255.255.255.0
  -> RS2.mack.net:0                   Route   1      0          0
  -> RS1.mack.net:0                   Route   1      0          0

Here's what happens when the 2 clients, both of who belong to the same CIP/24 persistence group, connect twice - all connections go to the same realserver.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600 mask 255.255.255.0
  -> RS2.mack.net:0                   Route   1      4          0
  -> RS1.mack.net:0                   Route   1      0          0

25.10.3. Discussion with Julian about persistence granularity

Joe

I expect if you were using persistence with fwmark, then any connection requests arriving with the same fwmark will be treated as belonging to that persistence group. Presumably any combination of client IPs and/or networks could have been used to make the rules which marks the packets.

Julian

Yes, it is for the same group but in one fwmark group there are many templates created. These templates are different for the client groups. The template looks like this:

CIPNET:0 -> SERVICE(FWMARK/VIP):0 -> RIP:0

All ports 0 for the fwmark-based services

So, for client 10.1.2.3/24 (24=persistent granularity) the template looks like this:

10.1.2.0:0 -> VIP:0 -> RIP:0

LVS patched with the persistent_fwmark patch:

10.1.2.0:0 -> FWMARK:0 -> RIP:0

So, the templates are created with CIP/GRAN in mind and the lookup uses CIPNET too. We use

CIPNET = CIP & CNETMASK

before creation and lookup.

so if I did
iptables -s 10.1.2.3 -m 1
director:/etc/lvs# ipvsadm -A -f 1 -s rr -p 600 -M 255.255.255.0
only packets from 10.1.2.3 will have a fwmark on them, but the director would forward all packets from 10.1.2.0/24, even those without fwmarks?

The patched LVS will accept only the marked packets for this fwmark service, from the same /24 client subnet. If only one client IP sends packets that are marked then the real service will receive packets only from 10.1.2.3. The current LVS versions don't consider the service and all packets CIPNET -> VIP will be forwarded using the first created template for CIPNET:0->VIP:0, i.e. these packets will randomly hit one of the many services that accept packets for the same VIP (just like in your setup) and then may be a wrong realserver.

The current LVS versions don't consider the service and all packets CIPNET -> VIP
but there is no VIP here, I'm using fwmark only. what does the -M 255.255.255.0 do in this case?

The current LVS versions (i.e. without the persistent_fwmark patch) assume the VIP is the iphdr->daddr, i.e. the destination address in the datagram and this addresses is used to lookup/create the template.

how about your persistent-patch, which I've been working with?

The patch ignores this daddr when creating or looking for templates. Instead, the service fwmark values is used when the service is fwmark-based: CIPNET:0 -> FWMARK:0 -> RIP:0

The normal services use daddr as VIP when looking for or creating templates: CIPNET:0 -> daddr:0 -> RIP:0

The persistence is associated with the client address (CIP). The sequence is this:

- packet comes from CIP to VIP1

- fw marking, optional

- lookup for existing connection CIP:CPORT->VIP1:VPORT, if yes => forward, if not found:

- lookup service => fwmark 1, persistent

- try to select real service in context of the virtual service

Apply the persistence granularity to the client address

CIPNET = CIP & svc->netmask

Now lookup for template

not patched: check for existing template CIPNET:0, VIP1:0
patched: check for existing template CIPNET:0, 1(fwmark):0

if there is template, bind the new connection to the template's destination

if there is no existing template, get one destination using the scheduler and bind it to the newly created template and the new connection. The created template is

CIPNET:0, VIP1:0, DEST_RIP:0
CIPNET:0, 1(fwmark):0, DEST_RIP:0

- forward the packet

Persistence granularity was designed for people coming in from large proxy servers (eg AOL). With fwmarks, this can be handled by iptables rules.

Yes, the fact that we group the clients using this netmask is not related to the virtual service type: normal or fwmark-based.

Yes, each different IP is treated as different client. When a netmask <32 is used, the group of addresses is treated as one client when applying the persistence rules. This is not related to the packet marking and virtual service type.

25.11. fwmark allows LVS-DR director to be default gw for realservers

If a LVS-DR director is accepting packets by fwmarks, then it does not have a VIP. The director can then be the default gw for the realservers (see LVS-DR director is default gw for realservers).

25.12. fwmark simplifies configuration for large numbers of addresses

If a fwmark rule accepts packets for a /24 network, then 254 IPs are configured in one instruction. The next sections are examples.

25.13. Example: firewall farm

Horms horms (at) vergenet (dot) net 2000-12-06

Assume that packets from out local network (192.168.0.0/23) are outgoing traffic.

Mark all outgoing packets with fwmark 1

ipchains -A input  -s 192.168.0.0/23 -m 1
# Now, set up a virtual service to act on the marked packets
director:/etc/lvs# ipvsadm -A -f 1
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.7
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.8
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.1.9

Where 192.168.1.7, 192.168.1.8 and 192.168.1.9 are your firewall boxen.

25.14. Example: LVS'ing a CIDR block

Matthew S. Crocker wrote:

would like to put a CIDR block of addresses (/25) through my LVS server. Is there a way I can set one entry for a VIP range and then the load balancing will be handled over the entire range.

Horms horms (at) vergenet (dot) net 2001-01-13

Set up fwmark rules on the input chain to match incoming packets for the CIDR and mark them with a fwmark.

e.g.

ipchains -A input -d 192.168.192.0/24 -m 1

Use the fwmark (1 in this case) as the virtual service.

director:/etc/lvs# ipvsadm -A -f 1
director:/etc/lvs# ipvsadm -a -f 1 -r 10.0.0.1
director:/etc/lvs# ipvsadm -a -f 1 -r 10.0.0.2

Miri Groentman, 11 Jul 2001
Is it possible to configure a range of ports rather than a single-port

Joe

if you mean ports for services, yes, see fwmark in the HOWTO. You can also forward a range of IPs.

25.15. Example: forwarding based on client source IP

client A (from 192.x.x.x) should go to realserver 1..3, and client B (from 10.x.x.x) should go to realserver 4..6.

(Julian, 10-05-2000)

Write fwmark rules based on the source IP of the packets. Then create two virtual services, one for each fwmark.

25.16. Example: load balancing multiple class C networks

Ian Courtney wrote:

Basically here at our ISP, we tend to have 2-3 Class C's worth of hosting per server. We would like to move the the LVS, but I'm not exactly sure how I should be setting it up.

Chris chris (at) isg (dot) de 2001-01-15

You can use the fwmark option for the loadbalancing

#mark the incoming packets with ipchains
ipchains -A input -s 0.0.0.0/0 -d 192.168.0.0/24 -m 1
#then you can setup your LVS like
director:/etc/lvs# ipvsadm -A -f 1 -s wlc
director:/etc/lvs# ipvsadm -a -f 1 -r 10.10.10.15 -g
director:/etc/lvs# ipvsadm -a -f 1 -r 10.10.10.16 -g

the router should point to the director.

Ian Courtney wrote back:

It didn't work until I aliased all 3 class C's to my director. Do I have to do this?

Julian Anastasov ja (at) ssi (dot) bg 2001-01-16

Yes, only the packets destined for local addresses/networks are accepted. The others are dropped or forwarded to another box.

the next project involves redoing our standard linux web space, which so far consists of about 8 webservers, each hosting atleast 2 class C's worth of hosting. I some how don't think Linux will take nicely to have 16 or more class C's aliased to it.

If possible use netmask <24. I assume you execute (replace with the right Class C nets):

ifconfig lo:1 207.228.79.0  netmask 255.255.254.0
ifconfig lo:2 207.148.155.0 netmask 255.255.255.0
ifconfig lo:3 207.148.151.0 netmask 255.255.255.0

on the director and on each realserver and solve the arp problem using:

echo 1 > /proc/sys/net/ipv4/conf/all/hidden
echo 1 > /proc/sys/net/ipv4/conf/lo/hidden

in the realservers. If you don't want to advertise these addresses using ARP to the Cisco LAN, you can execute the above two commands in the director too.

25.17. Example: proxy server

Thomas Proell, 16 Aug 2000

How do you use fwmark if you want the director to accept packets for a wide range of addresses, for which is doesn't have IPs.

(Horms)

Here's a setup I used...

                                               Internet
                                                  |
                                                Router 192.168.128.1
"client"                  Linux Director          |
  va2-------------------------va3-----------------+--------- proxy (va4)
192.168.16.3      192.168.16.1   192.168.128.2        192.168.128.5

I have used 192.168/16, but these could be real addresses too. I have only put one proxy server in the diagram but I did test it with 2

Client: default gw va3 (192.168.16.1)

Linux Director:
eth0: 192.168.128.2             (internet/proxy side)
eth1:  192.168.16.1             (client side)
Default gw: Router ,192.168.128.1
IPV4 forwarding enabled.

Ipvsadm rules - these can be translated into ldirectord configuration.
director:/etc/lvs# ipvsadm -A -f 1 -s wlc
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.128.3:0 -g -w 1
... add additonal proxy servers

Interestingly enough if you add a proxy that just forwads traffic then it will end up going direct. This may be useful as a failback server if the proxy servers fail.

ipchains -A input -s 0.0.0.0/0.0.0.0 -d 127.0.0.1/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.128.2/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.16.2/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 0.0.0.0/0.0.0.0 80 -p tcp -j REDIRECT 80 -m 1

The -m 1 means that IPVS will regognise packets patched by this filter as belonging to the virtual service as long as it sees the packets as local. -j REDIRECT 80 makes the packets appear as local. It is of note that the port you redirect to is _ignored_ because of the way IPVS works - paickets using fwmark are sent to the port they arrived on. This means that packets will be sent to proxy servers as port 80 traffic.

Proxy:
eth0: 192.168.128.5
Default gw: 192.168.128.1 (router)
IPV4 forwarding enabled.

ipchains -A input -s 0.0.0.0/0.0.0.0 -d 127.0.0.1/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.128.5/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 0.0.0.0/0.0.0.0 80:80 -p 6 -j REDIRECT +8080

Note, this is where the redirection to port 8080 takes place.

25.18. Example: transparent web cache

Pongsit (at) yahoo (dot) samart (dot) co (dot) th May 08, 2000

If I would like to use LVS to balance 3 transparent proxy is this how i do it ?

                 Internet
                    |
                    |
 ------------------------------------------- hub 1
          |          |           |
          |eth0      |           |           proxy1 ,2 and 3 set as a
        proxy1     proxy2      proxy3        transparent proxy with firewall
          |eth1      |           |           where eth0 connect to internet
          |          |           |           and eth1 to the internal network
 ___________________________________________
             |          |     |     |    |    hub 2
             |          |     |     |    |
          LVS/DR       client machines   |
                                         |
                                         |
 ___________________________________________  hub 3 if i have more internel
                                                   users

Horms horms (at) vergenet (dot) net 2000-05-08

If you want to do transparent proxying then I would suggest a topology more along the lines of:

                 Internet
                    |
                    |
------------------------------------------------ hub 1
                    |
                    |
                 LVS/DR
                    |
                    |
________________________________________________
   |      |      |      |     |     |    |    hub 2
   |      |      |      |     |     |    |
 proxy1 proxy2 proxy3  client machines   |
                                         |
                                         |
_________________________________________________
                                               hub 3 if i have more internel users

Use IP chains mark all outgoing port 80 traffic, other than from the 3 proxy servers with firewall mark 1 (ipchains -m 1...).

Set up a IPVS virtual service matching of fwmark 1 (ipvsadm -A -f 1...).

The proxy servers will need to be set up to recognise all port 80 traffic forwarded to them as local.

This way all outgoing traffic hits the LVS box. If it is for port 80 and isn't from one of the proxy servers then it gets load balanced and forwarded to one of the proxy servers.

You may want to consider a hot standby LVS/DR host to eliminate a single point of failure on your network.

I haven't tested this but I think it should work.

25.19. Example: Multiply-connected router

(Joe: my initial -apparently incorrect- reaction was that routing protocols would handle this better.)

Martin Sk?tt martin (at) xenux (dot) dk 19 Jun 2001

I have several ADSL connections to the Internet (same ISP) and I wan't all the users on my network to be using them. I would like it to work in a way so that all the lines are utilised all the time and without assigning groups of users to specific gateways.

                                              Internet
                                            /
My users ---- Linux box with LVS - Internet
                                            \
                                              Internet

What I want to do is assign one default gateway, the LVS box.

Joe

..for doing by LVS, you could set up a director to be a router and setup like it was infront of 4 squid boxes (you'll need the IP's of the other end of the ADSL link).
There's an example proxy above.

Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr

I have tried some time ago that kind of setup. I have test 4 differents topology
Using a dynamic routing protocol like BGP. Using BGP you can use cost onto your routing path. To setup a multipath Internet connection using BGP all the ISP connected to your BGP setting must be informed to add BGP their side. This setup is recommanded by ISP for corporate Internet use. It is mostly expensive due to ISP router side reconfiguration.
Implementing a loadsharing topology like discribed into the "Linux 2.4 advanced routing HOWTO" section 9.5. You need here to use the same ISP for all your Internet connections because your ISP must implement the symetric config. This mean that ISP must support linux 2.4 loadsharing over multiple interface. This is rarely implemented by ISP because it is much more interesting implementing constructor integration that is more expensive. This is my feedback in France :/
Setting up router with multiple default gateway. That way you will loadbalance by TCP conversation. I have only implemented this on CISCO, your are limited to the max default gw number implemented (3 or 4 for CISCO).
Implement the solution discribed in the LVS HOWTO (above). Loadbalancing a squid server pool, each squid directly connected to your ADSL line.
Personally, I prefer the LVS solution which is much more easy and recommanded because it is ISP configuration independent. I have tested that on a RTSP proxy pool.

25.20. httpd clients (browsers)

Initially when testing you should use a non-persistent (in the netscape sense) client, e.g. telnet VIP 80, or lynx VIP. Or else revert to these if you don't understand what you're seeing with netscape.

25.20.1. Opera

Peter Mastren Peter (dot) Mastren (at) chron (dot) com 18 Dec 2001

For the past several weeks, we have experienced almost daily denial of service attacks/events on our www servers. A remote client somewhere has opened a number of TCP connections to LVS that have absolutely no traffic whatsoever, save a single keepalive packet every two minutes. I have seen few as 3 and as many as 120 connections in the various incidents over the weeks.

These open connections are counted in the algorithm LVS uses to schedule servers, so the server that has all these open connections receives proportionately fewer new connections, in most cases taking the target server completely out of rotation.

Yesterday, I noticed an event coming from 130.80.XXX.XXX, our firewall address. Three connections were being held open from a machine inside our network. The culprit was my own workstation. I killed my browser and the connections went away. I fired up my browser again and tried to retrace my steps to duplicate the situation.

To make a long story short, it appears that Opera version 6.0 beta will leave a connection open to a server even after the window that was used for that connection has been closed. The only time the connection is closed is when Opera exits.

I will submit a problem report to Opera, in the meantime, there could be hundreds if not thousands of beta Opera browsers out there that could lock up ports on our servers for hours or days or longer.

This morning I made a configuation change to LVS that seems to have solved the problem. The masquerading portion of the Linux kernel (using LVS_NAT) uses default times to keep connections open, one for TCP connections, one for closed TCP connections that have received a FIN, and one for UDP connections. These defaults are 15, 2, and 5 minutes respectively. I changed the TCP timeout from 15 minutes to 110 seconds, which is shorter that the two minute intervals that the keepalive packets occur, yet long enough for any imaginable connection to a web server.

The change I made was:

      ipchains -M -S 110 0 0

Wensong Aug 2002,
for 2.4 kernels
director:/etc/lvs# ipvsadm --set tcp tcpfin

25.21. Example: dynamically generated images in webpages

One of the assumptions of setting up an LVS is that the content presented on the realservers is identical. This is required because the client can be sent to any of the realservers. This requirement is not handled if the client fills in a form which produces a gif on the realserver.

Alois Treindl alois (at) astro (dot) ch 30 Apr 2001

If a page is created by a CGI and contains dynamically created GIFs, the requests for these gifs will land on a different realserver than the one where the cgi runs. Will I need persistence?

I am running an astrology site; a typical request is to a CGI which creates an astrological drawing, based on some form data; this drawing is stored as a temporary GIF file on the server. A html page is output by the CGI which contains a reference to this GIF.
The browser receives the html, and then requests the GIF file from the server. It will mostly hit a different server than the one who created the GIF.
So either we make sure that the new client request for the GIF hits the same realserver which ran the CGI (i.e. have persistence) or we must create the GIF on a shared directory, so that each realserver sees it.
I have not tested it yet (not ported the CGIs yet to the new LVS box) but I think things are not so simple. In a 'rr' scheduling configuration, for example, the scheduler could play dirty, depending on the number of http requests for the given page, and the number of realservers. Both could be incommensurable in a way that the http request for the GIF never reaches the same realserver as the one which ran the CGI request.
I had already decided that I need shared directories between all realservers for our CGI environment which does computationally expensive things all the time. Some CGIs create also data files which are used by later CGIs. It is either shared directories for such files, or a shared database (which we also use).
These temp files will be sitting in the RAM cache of the NFS server, so that only network bandwidth between the realservers and the NFS server is the limiting factor. This is why I give the NFS server 2 gb of RAM, the max it will physically take, and this is why I chose 2.2.19 as the kernel because it contains NFS-3, which is said to be faster than NFS-2.

(Joe)

I tested it here on a page which generates a gif for the client. I found that I could never get the gif. Presumably after downloading the page containing the reference to the gif, the round robin scheduler sends the request for the gif to another realserver.

Presumably even page counters will have this problem. Writing to a shared directory should work.

Here's a solution with persistent fwmark using ip_tables to setup on a 2.4.x kernel. (Note: for page counters, this method will increment for each realserver, and not for the total page count over all the realservers as would happen with a shared directory.)

#put fwmark=1 on all tcp packets for VIP:http arriving on eth0
director:# iptables -t mangle -A PREROUTING -i eth0 -p tcp -s 0.0.0.0/0 -d 192.168.1.110/32 \
	--dport http -j MARK --set-mark 1
#setup a 2 realserver LVS to persistently forward packets with fwmark=1 using rr scheduling.
director:# ipvsadm -A -f 1 -s rr -p 600
director:# -a -f 1 -r RS1.mack.net:0 -g -w 1
director:# -a -f 1 -r RS2.mack.net:0 -g -w 1
#output setup
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             lvs.mack.net       tcp dpt:http MARK set 0x1
director:# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                 Route   1      0          0
  -> RS1.mack.net:0                 Route   1      0          0

Here's the output of ipvsadm after the successful generation and display of the dynamically generated gif. Note all connections went to one realserver.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> RS2.mack.net:0                 Route   1      5          3
  -> RS1.mack.net:0                 Route   1      0          0

25.22. Example: Balancing many IPs/services as one block

The simplest LVS balances requests to a VIP:port amongst a group of realservers. If you are servicing many VIPs, then few requests may be present for any particular IP at any time and a disproportionate number of requests will be sent to the first realserver. In this case you should balance all the different IPs as one group.

Josh Marcus josh (at) serve (dot) com> 02 Oct 2001

I'm using LVS to serve a few thousand domains, but I don't see how I can setup LVS to load balance all of the domains as if they were all a single ip. In my ideal world, I would have a single entry *:80 that would forward all of our ips at port 80 to our set of realservers, and load balance all requests coming in. The way LVS is working for us now, the vast majority of all of our requests are going to the server that is for some reason being listed first. Only sites with heavy traffic get pushed along to the other servers.

Michael E Brown michael_e_brown (at) dell (dot) com>

fwmarks

25.23. Example: Source controlled LVS - services and realserver customised by Client IP

In a LVS, you may want requests from a certain IP/netmask to to be forwarded to one set of realservers/services (which may be a subset of the total realservers, or may be other dedicated realservers), while the rest of the requests are forwarded normally to the whole LVS.

Or another way of putting it... You may want 2 (or more) LVSs setup on the one director, with one of the LVS's accepting only packets from an IP/netmask, while the rest of the requests go to other LVS.

Peter Mueller pmueller (at) sidestep (dot) com 18 Apr 2002

source-controlled routing for us gives a few advantages.
when clients inside our company launch the client for our product (Sidestep), we want that client to redirect automatically to staging. Going to staging directly means it is easier to test code, etc. This is a small advantage and is merely the "proving grounds" or first step.
One of our customers has a proxy server java-code caching problem (their client doesn't work) and we want to steer them to a server that won't have the problem. Unfortunately the customer is not technically competent, and we'd like to avoid changing anything at their end.
It'd be nice to redirect our competitors/delinquent customers to a machine that had incorrect or out of date information. Surely most companies would think this is a cool feature!
it is advantageous to have more control in case of mishap.
Julian supplied the recipe.
#for each $client, mark their packets with fwmark 1
director:# ipchains -A input -p TCP -s $client -d VIP 80 -m 1 -j ACCEPT
.
.
#create an LVS for packets with fwmark 1
director:# ipvsadm -A -f 1 -s wlc
director:# ipvsadm -a -f 1 -r $real_server
#create LVS for other client IPs (or for everyone) <emphasis>i.e.</emphasis> normal LVS setup here
.
.

Here's the implementation for iptables.

Armin.Haken Armin (dot) Haken (at) Sun (dot) COM 10 Feb 2006

How to forward packets based on source address

Using the fwmarks in iptables you can create ipvs rules to forward packets to particular realservers or groups of realservers based on source address or source network. I got this information out of a December 2002 post to the lvs-users group by Ratz

Here is an example using LVS-NAT with 2 realservers with RIP 10.1.1.1 and 10.1.1.2. The first realserver serves clients on the 10.0.1.X network and the 10.0.5.X network, while the other realserver serves clients on 10.0.2.X. The VIP is 10.0.1.1.

Packets destined for server 1 get mark 1, packets destined for server 2 get mark 2

iptables -t mangle -A PREROUTING -s 10.0.1.0/24 -d 10.0.1.1 -j MARK --set-mark 1   
iptables -t mangle -A PREROUTING -s 10.0.5.0/24 -d 10.0.1.1 -j MARK --set-mark 1 
iptables -t mangle -A PREROUTING -s 10.0.2.0/24 -d 10.0.1.1 -j MARK --set-mark 2

The following command shows you counters of matched rules

iptables -t mangle -L PREROUTING -n -v

ipvs forwards based on the marks

ipvsadm -A -f 1
ipvsadm -A -f 2
ipvsadm -a -f 1 -r 10.1.1.1 -m
ipvsadm -a -f 2 -r 10.1.1.2 -m

The iptables rules also allow you to specify the protocol or interface of the packets you mark and you can use negations, specify port numbers, etc. If a packet matches several of the rules, the marks get overwritten so the last matching rule determines the mark.

For failover you could either configure multiple realservers per fwmark or put in a system that changes the marking rules or forwarding rules once a failed realserver is detected.

25.24. Appendix 1: Specificiations for grouping of services with fwmarks

Here are the discussions that has resulted in the current specifications for handling of persistence with fwmarks in LVS.

Ted Pavlic Jul 14, 2000

What I was asking about would be something like this:
	virtual=192.168.6.2-192.168.6.30:80
	real=192.168.6.240:80 gate
	service=http
	request="index.html"
	receive="Test Page"
	scheduler=rr
I have 1029 virtual servers -- that is I have 1029 hosts which need to be load balanced.

Horms horms (at) vergenet (dot) net 2000-07-14

(fwmark) has the advantage of simplfying the amount of _kernel_ configuration that has to be done which is a big win, even if this is automated by a user space application. The basic idea is that this provides a means for LVS to have virtual services that have more than one host/port/protocol triplet. In your situation this means that you can have a single virtual service that handles many virtual IP addresses and all ports and protocols (UDP, TCP and

You should take a look at ultramonkey (note from Joe, April 2001, UM is now 1.0.2, look for examples there). My understanding is that this is quite similar to how your LVS topology will be set up, though I understand you will be having more than one of these configured.

Basically what happens is that you set up LVS to consider any packets like other LVS virtual services other than that no VIP is specified.

e.g.

director:/etc/lvs# ipvsadm -A -f 1 -s rr
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.6.3:80 -m
director:/etc/lvs# ipvsadm -a -f 1 -r 192.168.6.2:80 -m
director:/etc/lvs# ipvsadm -L -n
IP Virtual Server version 0.9.11 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 rr
  -> 192.168.6.3:80              Masq    1      0          0
  -> 192.168.6.2:80              Masq    1      0          0

The other half of the equation is that ipchains is used to match incoming traffic for virtual IP addresses and mark them with fwmark 1. Say you have 8 contiguous class C's of virtual addresses beginning at 192.168.0.0/24. The ipchains command to set up matching of these packets would be:

ipchains -A input -d 192.168.0.0/21 -m 1

You also need to set up a silent interface so that the LVS box sees traffic for the VIPs as local. To do this use:

ifconfig lo:0 192.168.0.0 netmask 255.255.248.0 mtu 1500
echo 1 > /proc/sys/net/ipv4/conf/all/hidden
echo 1 > /proc/sys/net/ipv4/conf/lo/hidden

Now, as long as 192.168.0.0/21 is routed to the LVS box, or more particularly the floating IP address of the LVS box brought up by heartbeat, traffic for the VIPs will be routed to the LVS box, the ipchains rules will mark it with fwmark 1 and LVS will see this fwmark and consider the traffic as destined for a virtual service.

Ted Jul 14, 2000

for me to enable persistent connections to every port using direct routing, would this work?
director:/etc/lvs# ipvsadm -A -f 1 -s rr -p 1800
director:/etc/lvs# ipvsadm -a -f 1 -r 216.69.192.201:0 -g
director:/etc/lvs# ipvsadm -a -f 1 -r 216.69.192.202:0 -g

Horms

Yes, that would work. The port in the "ipvsadm -a" commands is ignored if the realservers are being added to a fwmark service. Connections will be sent to the port on the realserver that they will be recieved on the virtual server. So port 80 traffic will go to port 80, port 443 traffic will go to port 443 etc...

As a caveat you should really make sure that your ipchains statments catch all traffic for the given addresses including ICMP traffic so ICMP traffic is handled correctly by LVS.

(Julian on catching ICMP traffic)

IIRC, this is already not a requirement in the last LVS versions. If we look in skb->fwmark for ICMP packets it is impossible to use normal and fwmark virtual services to same VIP because we can't create such ipchains rules. The good news is that in 2.4 (0.0.3) the virtual service lookup (the fwmark field) is used only for the new connections. In 2.2 the service is looked up even for existing entries but we don't want to break the MASQ code entirely

Ted Pavlic tpavlic (at) netwalk (dot) com 19 Jul 2000

When using fwmark to assign realservers to virtual servers, how is scheduling and persistence handled?

In my particular example, I have: 216.69.196.0/22 (ie 4 class C networks) all marked with a fwmark of 1. ipvsadm setup is

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
-> nw01:0                      Route   1      0          0
-> nw02:0                      Route   1      0          0

Say someone connects to 216.69.196.1 and the connection is assigned to nw01. At this point ipvsadm shows

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
-> nw01:0                      Route   1      1          0
-> nw02:0                      Route   1      0          0

A new person connects to another IP in 216.69.196.0/22 (say 216.69.196.2). Will this new connection to 216.69.196.2 go to nw02 because it has the least number of TOTAL connections, or will it go to nw01 because for that PARTICULAR IP, both have 0 connections?

Now then say that the person who just connected to 216.69.196.1 makes a connection (within the 600 persistence seconds) to 216.69.196.3. Will this new connection go to nw01 because it's being persistent? Or will it go to either server depending on the number of connections?

Here's what I think would be the best way to do things...

If multiple IPs are marked with FWMARK 1, LVS should consider them all one entry in its active/inactive table. I don't believe that's how things are currently being handled.

(Julian)

The templates are not accounted in the active/inactive counters.

(Joe, almost a year later - Julian, what do you mean here?) (Julian 13 Apr 2001)

Ted here thinks that the templates are accounted in the inactive/active counters. And before the persistent-fwmark patch we can have many templates for one fwmark-based service:
CIPNET:0 -> VIP1:0 -> RIP_A:0
CIPNET:0 -> VIP2:0 -> RIP_B:0
where VIP1 and VIP2 are marked with same fwmark.
Ted recommends these two templates to be replaced with one, i.e. just like in the persistent-fwmark patch:
CIPNET:0 -> FWMARK:0 -> RIP1:0

We can't see the templates (which are normal connection entries with some reserved values in the connections structure fields) accounted in the inactive/active counters. The reason for this is that the inactive/active counters are used to represent the realserver load but our templates don't lead to any load in the realservers, we use them only to maintain the persistence.

When a service is marked persistent all connections from CIP to VIP go to same RIP for the specified period. Even for the fwmark based services. This works for many independent VIPs.

The other case is fwmark service covering a DNS name. I expect comments from users with SSL problems and persistent fwmark service. Is there a problem or may be not?

I agree, may be the both cases can be useful:

1. CIP->VIP
2. CIP->FWMARK

Any examples where/why (2) is needed?

But switching the LVS code always to use (2) for the persistent fwmark services is possible.

(Ted)

In my opinion, here are some pros and cons of case 2:

Pros:

Improves scheduling, I think, and true load balancing. If someone is using [W]RR or [W]LC, the LVS box will actually look at the realservers as a whole rather than separate realserver entries for EACH VIP. Does that make sense?

For example, in my particular configuration I have over one thousand VIPs which are load balanced onto four RIPs. When I configure the LVS server to use LC scheduling, I'd like it to look at how many TOTAL connections are being made to each RIP not how many connections are being made to each RIP PER VIP. I would like to load balance all one thousand VIPs as a WHOLE onto the four RIPs rather than load balance EACH VIP.

That is, in some of my less active sites, most of their traffic will probably hit one VIP just because not much traffic will need to be load balanced. However, more active sites will hit both servers. The load will then not be distributed equally among the servers as one server will probably get not only the active traffic but also the less active traffic and the other server will only get the more active traffic (in the case of having two RIPs).

Cons:

One person on the Internet will keep connecting to the same RIP for many different VIPs if persistence is turned on.

If this causes a problem, the LVS administrator can do one of two different things:

1) Rather than load balancing a fwmark template, go back to load balancing specific VIPs. The scheduling will then be unique for those particular VIPs.

2) Create multiple fwmark templates. The scheduling for each template will be unique.

In my opinion if you group a bunch of IPs together by marking them with an fwmark, that you say that you want to load balance all of those COLLECTIVELY -- almost like load balancing one site.

I'm just saying, are there any examples where CIP->FWMARK is not needed?

As far as the LVS is concerned, if someone connects to a VIP marked with fwmark 1, it should treat it just like every other VIP marked with fwmark 1 -- as if they were all one VIP.

But today on my LVS (where I have a ten minute persistence setup) I connected to one virtual server marked with fwmark 1 and got a certain real server. I then expected to connect to another virtual server also marked with fwmark 1 and get that same realserver. I did not, however. If what you're telling me is correct, the persistence should have connected me to the same realserver as long as I was connecting within that ten minute window.

Now in this particular example -- connecting to DIFFERENT virtual servers -- it isn't so necessary for persistence to be carried through PER virtual server. I'm just worried that least connection scheduling and round-robin scheduling aren't working at the fwmark level -- I'm worried that they are working at the VIP level as if I had setup hundreds of explicit VIP rules inside IPVSADM.

Julian
I hope this feature (2) will be implemented in the next LVS version (if Wensong don't see any problems). I.e. the templates can be changed to case (2) for the persistent fwmark services. For now we (I and Horms) don't see any problems after this change. Then connections from one client IP to different VIPs (from the same fwmark service) will go to the same realserver (only for the persistent fwmark services).

Do you see any reason why enabling CIP->FWMARK for all cases would be a bad thing?

That is, not only using case 2 for persistent fwmark, but just whenever fwmark was used. Personally, I cannot ever forsee a scenerio when a person would setup an fwmark for load balancing and want each VIP associated with that fwmark to act independently.

Web cluster for independent domains (VIPs). fwmark service is used only to reduce the amount of work for configuration.

I've always thought that the scheduling algorithms should look directly at the realservers rather than the realserver stats for each particular virtual server. That is, least connection scheduling would look at the total number of connections on a realserver, not just the connections from that particular VIP. Round-robin would go round-robin from realserver to real server based on the last connection from ANY VIP to the realservers... However, before fwmark I realized that this would probably very difficult to do especially in cases where an LVS administrator was load balancing to a number of different realserver clusters that may overlap.

This is a job for the user space tools: WRR scheduling method + weights derived from the realserver load. Yes, one real server can be loaded from:

many directors
many virtual services
other processes not part from the real service

In this case the director's opinion (for each virtual service) about the realserver load is wrong. The only way to handle such case properly is to use WRR method. In the other cases WLC, LC and RR can do their job.

fwmark, to me, just by causing all VIPs marked with a particular fwmark to look like one big VIP makes it possible to do basically that which I just described. I don't see why anyone would not want such functionality with the fwmark services. If one did want such functionality, he would probably partition the VIPs associated with his fwmark into separate fwmarks or even explicit VIP entries anyway.

Yes. IMO, this can be a problem only for the balancing but I don't think so. The problems will come when one realserver dies and the client can't access any VIP part from the fwmark service for a period of time.

25.25. Appendix 2: Demonstration of grouping services with fwmarks

Here's the original e-mail between Ted tpavlic (at) netwalk (dot) com 3 Aug 2000 and Joe

One of the things it fwmarks lets me do is make ports sticky by groups.

Basically I setup ipchains rules that say all packets to ports 80/tcp and 443/tcp mark with a 1. All packets to ports 20/tcp and 21/tcp as well as 1024:65535/tcp mark with a 2. Voila... I just made ports stick by groups.

I then go into IPVS and setup my realservers under FWMARK1 and FWMARK2. Ports 80 and 443 are now persistent as a group just as 20 and 21 and 1024:65535 are persistent as a group. If my HTTP goes down on one of my real servers, I do not have to take my FTP down as well. I only have to remove the realserver from the FWMARK1 group. It's great!

Joe
most people don't program their own on-line transaction processing program and the point of an LVS is for the realservers to be running the same code as when they're stand alone.

My users run PHP scripts as well as ASPs that keep session information. That session information is unique per server and usually is stored in a local /tmp directory. Users are handed cookies which tie them to their session information. If they go to the wrong realserver, that session information won't exist and a number of things could go wrong.

most of my realservers run a lot of services... HTTP, HTTPS, FTP, SMTP, POP3, IMAP, DNS, And when one of them went down (with persistence set up), I would have to take the entire realserver down.

Several problems:

*) One little thing goes down... POP3, for example. Now the load increases a great deal on all my other realservers... Perhaps causing the load to become so high that sendmail starts rejecting connections... and then THAT realserver also is taken COMPLETELY down... domino effect. If I could have just taken POP3 down off of that server, it would have been perfect.

*) Say something horrible happens causing sendmail to go down on all the servers... or HTTP... or POP3... any one service -- just as long as it goes down on all servers. Rather than just causing that service to be affected, ALL of my services go down because every realserver was taken completely off-line until that ONE service is fixed. :(

But I figured that those two problems wouldn't be that big of a deal... I could probably put such a system in production.

Well -- I put such a system in production and those problems weren't that big of a deal... Except for a COUPLE of times when all services went down and caused a BIG hassle. So my superiors wanted something better -- needed in fact.

So at first I came up with the interim idea of separating persistent services and non-persistent services by IP. All of my persistent services were basically on one supernet and all of my non-persistent services were on another subnet. Consequently, I could tie the one supernet to one FWMARK and the other subnet to another FWMARK. Now if a persistent service went down, it would bring down only all of the persistent services. Also, if a non-persistent service went down, it would only bring down all of the non-persistent services.

This was definitely an interim solution because it required a lot more IPs that any one administrator should need, and it still was far from perfect.. BUT... I started to realize that just as I could mark different supernets and subnets with different FWMARKs, I could go farther down the TCP/IP layers and mark things at their protocol and port level. That's where I realized that we COULD do persistents by port group just with a little help from ipchains.

Joe
I asked Horms if there was any point in having multiple fwmarks. His only example was if you had duplicate sets of realservers. Eg the paying customers get the fast servers, while the people coming into the free site get the 486 with 16M.

Similar idea here... except rather than setting up your policies like:

Paying customers -> fast server
Free -> slow server

You have:

SMTP -> a realserver
POP3 -> another realserver
HTTP/HTTPS -> yet another realserver
FTP -> and another realserver

The key of it all is the fact that you can group by about any parameter that ipchains can see. If ipchains can segregate it, you can group it. Anything that ipchains can do IPVS can then add onto itself.

Joe
Have you solved passive ftp without using persistance?

I really don't think there's any way to get around it... In order to get passive FTP to work, you need to make TCP port 21 persistent with every TCP port above 1024. I mean -- how else could you do it without putting some big brother software inside of LVS which would keep an eye on FTP and see what port it tells the end-user to connect to.

Still, putting 21 and 1024:65535 together is a lot better than putting everything together. Personally I only plan on load balancing things in the <1024 range anyway, so I have no problem including that huge group above 1024.

This is my setup

FWMARK1 => HTTP/HTTPS (persistent)
FWMARK2 => FTP (persistent)
FWMARK3 => SMTP
FWMARK4 => POP3
FWMARK5 => DOMAIN
FWMARK6 => IMAP
FWMARK7 => ICMP (for kicks)

================
IP Virtual Server version 0.9.12 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
  -> nw04:0                      Route   1      58         121
  -> nw03:0                      Route   1      49         76
  -> nw02:0                      Route   1      60         98
  -> nw01:0                      Route   1      61         44
FWM  2 lc persistent 600
  -> nw04:0                      Route   1      0          2
  -> nw03:0                      Route   1      0          2
  -> nw02:0                      Route   1      1          13
  -> nw01:0                      Route   1      1          0
FWM  3 lc
  -> nw04:0                      Route   1      4          11
  -> nw03:0                      Route   1      4          12
  -> nw02:0                      Route   1      3          20
  -> nw01:0                      Route   1      3          16
FWM  4 lc
  -> nw04:0                      Route   1      3          54
  -> nw03:0                      Route   1      1          74
  -> nw02:0                      Route   1      3          51
  -> nw01:0                      Route   1      2          73
FWM  5 lc
  -> nw03:0                      Route   1      0          46
  -> nw01:0                      Route   1      0          44
  -> nw02:0                      Route   1      0          45
  -> nw04:0                      Route   1      0          45
FWM  6 lc
  -> nw04:0                      Route   1      0          0
  -> nw03:0                      Route   1      0          0
  -> nw02:0                      Route   1      1          0
  -> nw01:0                      Route   1      0          0
FWM  7 lc
  -> nw04:0                      Route   1      0          0
  -> nw03:0                      Route   1      0          0
  -> nw02:0                      Route   1      0          0
  -> nw01:0                      Route   1      0          0
==============

Is this anything new?

Joe
It's new to me and Horms didn't have any other ideas for multiple fwmarks 3 weeks ago, so I expect it will be new to him.

I've been thinking of ways of combining different programs which already exist out there to get L7 scheduling working. For example -- you have some program (sorta like policy routing but one more layer up) that filters packets at the application layer and does something to them... routes them to a particular IP... something like that... and then have ipchains mark each one of those packets with a particular mark... and have LVS work from there.

You see -- using multiple fwmarks makes me think that you can do a lot more with LVS.

We could probably borrow some of the ideas used for some of the dynamic routing protocols, like BGP or RIP. A master could advertise its IPVS hash table. If it didn't advertise within a given interval of time, other LVS's could take over.

During the failover, rather than trading an IP like we were talking about, all LVSs could know which one is the active one and ICMP redirect to that LVS or something like that.

Right now I'm routing every virtual server through the active LVS. This lets me do a lot of nifty things (for me at least):

* Very little has to happen on the LVS during failovers. They basically just trade an IP. In fact -- I COULD do the failover right at the router before the LVS's -- just have it route to another IP.

* I do not have to bring every IP up on my realservers -- I just have to bring the network that they're on up on a hidden loopback device. When you route an entire network to a loopback device, the loopback device answers every IP on that subnet automatically. So even with 1024+ IPs, I have to setup very few interfaces/aliases because a great deal of them are on the same subnet.

25.26. Appendix 3: Announcement of grouping services with fwmarks

Ted Pavlic tpavlic (at) netwalk (dot) com 4 Aug 2000

Periodically the issue comes up regarding wanting to do persistence by groups of ports. Until now, an LVS administrator could make a single-port persistent or all ports persistent.

Single port persistence was nice for quite a few things. However, things like HTTP and HTTPS caused complications with it. Someone who connected to a webpage on HTTP and started a session tied to them with a cookie would want to return to that same realserver when they went to the HTTPS version of that site. FTP would also cause a problem with single-port persistence as someone who wanted to use passive FTP wouldn't be gauranteed the same server when they returned on a random TCP port above 1024. There are other examples as well.

So the solution to these problems would be to make every port persistent. This works pretty well, but now anytime a user of a large network behind a firewall would connect to a realserver on ANY service, everyone behind that firewall would hit that same realserver. Plus, if an administrator wanted to stop scheduling a single service to a single realserver, he would have to take all services down on that single realserver. This causes many problems as well... especially if one small service dies on every real server -- brings down every service on every realserver.

So there has been the need for persistence by port GROUPS. Rather than saying all ports are persistent, it would be nice to tell LVS to tie just 80/tcp and 443/tcp together or just 21/tcp and 1024:65535/tcp together. Before the wonderful FWMARK additions to LVS, this was not possible.

But now that LVS listens to FWMARKs, it becomes possible to group ports together inside ipchains with different FWMARKs and then tell LVS to listen to those FWMARKs.

For example, one can setup a rule inside FWMARK to do this...

80/tcp, 443/tcp --> FWMARK1
21/tcp, 1024:65535/tcp --> FWMARK2
25/tcp --> FWMARK3
110/tcp --> FWMARK4

Then inside LVS (assume on this setup all of these services are served by the same realserver cluster), say:

FWMARK1 -> PERSISTENT -> real1,real2,real3,real4
FWMARK2 -> PERSISTENT -> real1, real2, real3, real4
FWMARK3 -> real1, real2, real3, real4
FWMARK4 -> real1, real2, real3, real4

Not only have you now setup persistence by port groups, but you've also split your services back up into autonomous services that will not bring EVERY server down for the sake of persistence. If FTP goes down on real1, real1 only needs to be stopped scheduling for FTP.

25.26.1. another explanation

Ted Pavlic tpavlic (at) netwalk (dot) com 2000-09-15

Using fwmark, you can setup something which used to be a big desire in LVS, persistence by port groups.

For example... Say you were serving HTTP and HTTPS. In this case, you would probably want calls to one HTTP server to end up hitting the same HTTPS server. This way session information and such would be accessable no matter how the end-user was accessing the website.

Say you also wanted all forms of FTP to work... You would need persistence there, but not necessarily the same persistence as HTTP/HTTPS.

And other protocols do not need to be persistent.

Back in the olden days before fwmark, to do any of this you would have to make ALL ports persistent. You couldn't simply say "Group 80 and 443 together and make them persistent and then make 21, 20, AND 1024:65535 persistent." If one service went down, you would have to bring down ALL services. Some sort of persistence by port groups would allow you to only need to take down whatever went down and the affected server could still serve other services.

FWMARK allows you to do this by way of setting up multiple FWMARKs.

That is -- you can use ipchains to say that:

HTTP,HTTPS --> FWMARK1
FTP --> FWMARK2
SMTP --> FWMARK3
POP --> FWMARK4

Then in LVS, setup:

FWMARK1 --> WLC Persistent 600
FWMARK2 --> WLC Persistent 300
FWMARK3 --> WLC
FWMARK4 --> WLC

And if FTP went down, all you'd have to do is stop scheduling FTP rather than stop scheduling EVERYTHING.

Also note that FWMARK makes setting up MASS VIPs really easy (of course because of recent ARIN policy changes, this probably won't be done much more anymore). That is, if you wanted to load balance 1000 VIPs, it might be easy to setup one single rule in ipchains to cover them all, where it would be 1000 rules for EACH realserver in ipvsadm.

It makes me think that if there was a utility already out there that could sit on a director and figure out where name-based packets were going it might be able to mark each name-based host with a different FWMARK and pass that right back to LVS... Then LVS wouldn't have to worry about handling name-based stuff ITSELF. Of course the name-based challenge is even more challenging considering how much data needs to be looked at to figure out if a TCP stream is a name-based HTTP session going to specific name X.... but that's a completely other argument... Just food for thought.

25.27. fwmark examples from the mailing list

Simone Sestini, September 23, 2003

I would like to use director and a backup director as a realserver too. I would like to run http and https on the backup director/realserver
how can I configure more than one https domain for each server ? Apache need to use a unique IP for each https domain

Matthew Crocker matthew (at) crocker (dot) com 24 Sep 2003

Search some of the archives a bit. I handle my HTTPS servers with LVS-DR going through my LVS director. The actual web servers are not on the Internet. Here is what I do.

Setup keepalived/VRRP to handle a VIP failover between two LVS boxes
Setup a static route in the upstream router for a /24 to the VIP IP address
Setup netfilter/iptables to mark packets with dest = the /24 and dport = 443 and 80 with fwmark 0x1
Setup LVS to load balance FWM 1 using LVS-DR to the Real servers internal IP (192.168.x.y)
Setup the LVS servers (presumably directors - Joe) to treat packets with FWM1 as local
Setup the realservers to list on each IP in the /24
Setup apache with SSL certs for each /24 IP address.
Point DNS records for https servers to unqiue IPs in the /24

This works great for me. Only packets in the /24 that are marked with the firewall mark actually hit the LVS server and/or the realservers. All other packets are not treated local by the lVS server and will be routed to its default route which will create a routing loop. If you ping/traceroute it will look broken but if you telnet to port 80 on one of the IPs you will get an answer. This also eliminated any ARP issues because the realservers are not on the same LAN segment as the LVS directors and the router doesn't ARP for the /24 IPs anyway because of the static route.

Most of the configs for steps 3,4,5 are in the archives from a couple months ago.

Prev	Up	Next
24. LVS: Routing and packet delivery to a director without a VIP (for fwmark and transparent proxy)	Home	26. LVS: Transparent proxy (TP or Horms' method)