Peter Warasin peter (at) endian (dot) com 11 Sep 2007
I made some modifications on the lvs specific kernel code, which now leads into kernel oops. Could someone give me some pointers about how to find the bug? I am not very familiar with the kernel code, so maybe I missed some simple tricks which routined people know and me not. Basically I altered the lvs code in order to make it catch packets within the PREROUTING chain instead of the INPUT chain. My setup works, but sometimes I have a kernel oops. I think somewhere it lacks some sort of spinlock, but I not really know where to begin in order to find where it must be inserted.
My setup:
With the standard LVS this setup is not possible, because of 2 causes:
I solved those problems this way:
--- linux-2.6.9/net/ipv4/ipvs/ip_vs_core.c.orig 2007-07-30 20:40:31.000000000 +0200 +++ linux-2.6.9/net/ipv4/ipvs/ip_vs_core.c 2007-07-30 20:40:37.000000000 +0200 @@ -1095,7 +1095,7 @@ .hook = ip_vs_in, .owner = THIS_MODULE, .pf = PF_INET, - .hooknum = NF_IP_LOCAL_IN, + .hooknum = NF_IP_PRE_ROUTING, .priority = 100, }; |
--- linux-2.6.9/include/net/ip.h.orig 2007-08-01 20:22:35.000000000 +0200 +++ linux-2.6.9/include/net/ip.h 2007-08-01 20:22:50.000000000 +0200 @@ -87,6 +87,7 @@ struct ip_options *opt); extern int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt); +extern int ip_rercv(struct sk_buff *skb); extern int ip_local_deliver(struct sk_buff *skb); extern int ip_mr_input(struct sk_buff *skb); extern int ip_output(struct sk_buff **pskb); --- linux-2.6.9/include/net/ip_vs.h.orig 2007-08-01 22:12:52.000000000 +0200 +++ linux-2.6.9/include/net/ip_vs.h 2007-08-01 22:13:10.000000000 +0200 @@ -925,6 +925,8 @@ */ extern int ip_vs_null_xmit (struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp); +extern int ip_vs_loop_xmit +(struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp); extern int ip_vs_bypass_xmit (struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp); extern int ip_vs_nat_xmit --- linux-2.6.9/net/ipv4/ipvs/ip_vs_conn.c.orig 2007-08-01 21:52:32.000000000 +0200 +++ linux-2.6.9/net/ipv4/ipvs/ip_vs_conn.c 2007-08-01 21:52:51.000000000 +0200 @@ -322,7 +322,7 @@ break; case IP_VS_CONN_F_LOCALNODE: - cp->packet_xmit = ip_vs_null_xmit; + cp->packet_xmit = ip_vs_loop_xmit; break; case IP_VS_CONN_F_BYPASS: --- linux-2.6.9/net/ipv4/ipvs/ip_vs_xmit.c.orig 2007-08-01 19:28:52.000000000 +0200 +++ linux-2.6.9/net/ipv4/ipvs/ip_vs_xmit.c 2007-08-03 16:47:16.000000000 +0200 @@ -24,6 +24,8 @@ #include <net/route.h> /* for ip_route_output */ #include <linux/netfilter.h> #include <linux/netfilter_ipv4.h> +#include <linux/netfilter_ipv4/ip_nat.h> +#include <linux/netfilter_ipv4/ip_conntrack.h> #include <net/ip_vs.h> @@ -141,12 +143,47 @@ ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp) { + IP_VS_DBG(10, "NULL transmitter called\n"); /* we do not touch skb and do not need pskb ptr */ return NF_ACCEPT; } /* + * LOOP transmitter (reinject on NF_IP_PRE_ROUTING) + */ +int +ip_vs_loop_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, + struct ip_vs_protocol *pp) +{ + + struct ip_conntrack *ct; + enum ip_conntrack_info ctinfo; + struct ip_nat_info *info; + + IP_VS_DBG(5, "LOOP transmitter called\n"); + if (skb->nfcache & NFC_IPVS_PROPERTY) { + IP_VS_DBG(10, "Already passed LVS. Receive it normally\n"); + return NF_ACCEPT; + } + + IP_VS_DBG(10, "Retransmit to IP_PRE_ROUTING hook starting with priority NF_IP_PRI_MANGLE\n"); + nf_reset_debug(skb); + skb->nfcache |= NFC_IPVS_PROPERTY; + skb->ip_summed = CHECKSUM_NONE; + + ct = ip_conntrack_get(skb, &ctinfo); + if (ct && (ctinfo == IP_CT_NEW)) { + info = &ct->nat.info; + info->initialized = 0; + } + NF_HOOK_THRESH(PF_INET, NF_IP_PRE_ROUTING, skb, skb->dev, + NULL, ip_rercv, NF_IP_PRI_MANGLE); + return NF_STOLEN; +} + + +/* * Bypass transmitter * Let packets bypass the destination when the destination is not * available, it may be only used in transparent cache cluster. --- linux-2.6.9/net/ipv4/ip_input.c.orig 2007-08-01 19:29:54.000000000 +0200 +++ linux-2.6.9/net/ipv4/ip_input.c 2007-08-01 19:32:42.000000000 +0200 @@ -355,6 +355,14 @@ } /* + * Retransmit packet + */ +int ip_rercv(struct sk_buff *skb) +{ + return ip_rcv_finish(skb); +} + +/* * Main IP Receive routine. */ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt) @@ -429,4 +437,5 @@ } EXPORT_SYMBOL(ip_rcv); +EXPORT_SYMBOL(ip_rercv); EXPORT_SYMBOL(ip_statistics); |
the patch causes incoming packets which should go to Local to retransmit through the netfilter hooks starting on NF_IP_PRI_MANGLE, instead of transmit them directly with ip_vs_null_xmit. This way I can remove the mark within the mangle table in order to pass it through LVS twice and then simply DNAT it. (please ask if you like to have the detailed iptables/ipvsadm rules.)
The setup works. But sometimes I have a kernel oops (fatal exception in interrupt, ip_rcv, ip_rcv_finish is involved). I tried to narrow down the problem, by removing patch nr 2, but the problem still exists. So the problem must be with the 1st patch. But what could cause this. I simply let LVS catch packets within PREROUTING chain instead INPUT chain. That seems not too different to me.
I think somewhere it lacks some sort of spinlock, but I not really know where to begin in order to find where it must be inserted.
Horms 13 Sep 2007
As Joe mentioned in a subsequent email, being able to move LVS from one chain to another is something that we are interested in. In particular I am of the believe that the FORWARD chain would be a much more logical home than LOCAL_IN as in some ways would allow LVS to act more like a router than a proxy (not that it is a proxy, but it kind of behaves like one in some ways because of its home on LOCAL_IN).
As I recall, I did try moving the code to the FORWARD chan a long time ago. I believe that the change was very similar to the LOCAL_IN to PRE_ROUTING snippet that you have below. I'm not sure that I ever posted the change, as I never tested it thorougly. So perhaps it too broke occasionally. In any case, this was a long time ago, and the kernel has changed significantly then, so any testing done at that time wouldn't really hold water now (incidently 2.6.9 is also pretty old, though I'm not sure what patches RHEL include to modernise it).
As for debugging your problem. Providing the oops message - if any - might help. Hopefully there is a stack trace in there and that should start to point to where the problem is.
Some portions of the locking schemantics of LVS are non-trivial and I have a deep suspicion that there are some races in there anyway. By which I mean, don't be surprised if things get a bit hairy as you are tracing through what is going on.
If your kernel is compiled with IP_VS_DEBUG then you can enable and adjust the verbosity of debugging messages that LVS produced by tweaking /proc/sys/net/ipv4/vs/debug_level as documented in Documentation/networking/ipvs-sysctl.txt in the kernel source tree.
Also, if you are doing development work, I do recommend considering using a more up to date kernel. Perhaps the latest rc kernel, currently 2.6.23-rc6. I'm not suggsting that you neccessarily drop this into production. But for development work, it is much easier to work with the kernel guys if you are on the same page as them.