kernel-ark

Author	SHA1	Message	Date
Stephen Hemminger	317a76f9a4	[TCP]: Add pluggable congestion control algorithm infrastructure. Allow TCP to have multiple pluggable congestion control algorithms. Algorithms are defined by a set of operations and can be built in or modules. The legacy "new RENO" algorithm is used as a starting point and fallback. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-23 12:19:55 -07:00
Paulo Marques	543537bd92	[PATCH] create a kstrdup library function This patch creates a new kstrdup library function and changes the "local" implementations in several places to use this function. Most of the changes come from the sound and net subsystems. The sound part had already been acknowledged by Takashi Iwai and the net part by David S. Miller. I left UML alone for now because I would need more time to read the code carefully before making changes there. Signed-off-by: Paulo Marques <pmarques@grupopie.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-23 09:45:18 -07:00
Chuck Short	7abaa27c1c	[IPV4]: Fix route.c gcc4 warnings Signed-off by: Chuck Short <zulcss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-22 22:10:23 -07:00
Harald Welte	5d927eb010	[NETFILTER]: Fix handling of ICMP packets (RELATED) in ipt_CLUSTERIP target. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-22 12:37:50 -07:00
Kumar Gala	b535420739	[PATCH] Fix extra double quote in IPV4 Kconfig Kconfig option had an extra double quote at the end of the line which was causing in warning when building. Signed-off-by: Kumar Gala <kumar.gala@freescale.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-06-22 10:40:39 -07:00
David S. Miller	90f66914c8	[IPV4]: Fix fib_trie.c's args to fib_dump_info(). Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:43:28 -07:00
Patrick McHardy	2715bcf9ef	[NETFILTER]: Drop conntrack reference in ip_call_ra_chain()/ip_mr_input() Drop reference before handing the packets to raw_rcv() Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:06:24 -07:00
Patrick McHardy	6150bacfec	[NETFILTER]: Check TCP checksum in ipt_REJECT Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:03:46 -07:00
Keir Fraser	e3be8ba792	[NETFILTER]: Avoid unncessary checksum validation in UDP connection tracking Signed-off-by: Keir Fraser <Keir.Fraser@xl.cam.ac.uk> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:03:23 -07:00
Phil Oester	1d3cdb41f5	[NETFILTER]: expectation timeouts are compulsory Since expectation timeouts were made compulsory [1], there is no need to check for them in ip_conntrack_expect_insert. [1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-January/018143.html Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:02:42 -07:00
Patrick McHardy	18b8afc771	[NETFILTER]: Kill nf_debug Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:01:57 -07:00
Patrick McHardy	e45b1be8bc	[NETFILTER]: Kill lockhelp.h Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 14:01:30 -07:00
Robert Olsson	19baf839ff	[IPV4]: Add LC-Trie FIB lookup algorithm. Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-21 12:43:18 -07:00
Robert Olsson	246955fe4c	[NETLINK]: fib_lookup() via netlink Below is a more generic patch to do fib_lookup via netlink. For others we should say that we discussed this as a way to verify route selection. It's also possible there are others uses for this. In short the fist half of struct fib_result_nl is filled in by caller and netlink call fills in the other half and returns it. In case anyone is interested there is a corresponding user app to compare the full routing table this was used to test implementation of the LC-trie. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-20 13:36:39 -07:00
Herbert Xu	dd87147eed	[IPSEC]: Add XFRM_STATE_NOPMTUDISC flag This patch adds the flag XFRM_STATE_NOPMTUDISC for xfrm states. It is similar to the nopmtudisc on IPIP/GRE tunnels. It only has an effect on IPv4 tunnel mode states. For these states, it will ensure that the DF flag is always cleared. This is primarily useful to work around ICMP blackholes. In future this flag could also allow a larger MTU to be set within the tunnel just like IPIP/GRE tunnels. This could be useful for short haul tunnels where temporary fragmentation outside the tunnel is desired over smaller fragments inside the tunnel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-20 13:21:43 -07:00
Herbert Xu	72cb6962a9	[IPSEC]: Add xfrm_init_state This patch adds xfrm_init_state which is simply a wrapper that calls xfrm_get_type and subsequently x->type->init_state. It also gets rid of the unused args argument. Abstracting it out allows us to add common initialisation code, e.g., to set family-specific flags. The add_time setting in xfrm_user.c was deleted because it's already set by xfrm_state_alloc. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jmorris@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-20 13:18:08 -07:00
David S. Miller	7df551254a	[TCP]: Fix sysctl_tcp_low_latency When enabled, this should disable UCOPY prequeue'ing altogether, but it does not due to a missing test. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 23:01:10 -07:00
Jesper Juhl	f7d7fc0322	[IPV4]: [4/4] signed vs unsigned cleanup in net/ipv4/raw.c This patch changes the type of the third parameter 'length' of the raw_send_hdrinc() function from 'int' to 'size_t'. This makes sense since this function is only ever called from one location, and the value passed as the third parameter in that location is itself of type size_t, so this makes the recieving functions parameter type match. Also, inside raw_send_hdrinc() the 'length' variable is used in comparisons with unsigned values and passed as parameter to functions expecting unsigned values (it's used in a single comparison with a signed value, but that one can never actually be negative so the patch also casts that one to size_t to stop gcc worrying, and it is passed in a single instance to memcpy_fromiovecend() which expects a signed int, but as far as I can see that's not a problem since the value of 'length' shouldn't ever exceed the value of a signed int). Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 23:00:34 -07:00
Jesper Juhl	93765d8a43	[IPV4]: [3/4] signed vs unsigned cleanup in net/ipv4/raw.c This patch changes the type of the local variable 'i' in raw_probe_proto_opt() from 'int' to 'unsigned int'. The only use of 'i' in this function is as a counter in a for() loop and subsequent index into the msg->msg_iov[] array. Since 'i' is compared in a loop to the unsigned variable msg->msg_iovlen gcc -W generates this warning : net/ipv4/raw.c:340: warning: comparison between signed and unsigned Changing 'i' to unsigned silences this warning and is safe since the array index can never be negative anyway, so unsigned int is the logical type to use for 'i' and also enables a larger msg_iov[] array (but I don't know if that will ever matter). Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 23:00:15 -07:00
Jesper Juhl	926d4b8122	[IPV4]: [2/4] signed vs unsigned cleanup in net/ipv4/raw.c This patch gets rid of the following gcc -W warning in net/ipv4/raw.c : net/ipv4/raw.c:387: warning: comparison of unsigned expression < 0 is always false Since 'len' is of type size_t it is unsigned and can thus never be <0, and since this is obvious from the function declaration just a few lines above I think it's ok to remove the pointless check for len<0. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 23:00:00 -07:00
Jesper Juhl	5418c6926f	[IPV4]: [1/4] signed vs unsigned cleanup in net/ipv4/raw.c This patch silences these two gcc -W warnings in net/ipv4/raw.c : net/ipv4/raw.c:517: warning: signed and unsigned type in conditional expression net/ipv4/raw.c:613: warning: signed and unsigned type in conditional expression It doesn't change the behaviour of the code, simply writes the conditional expression with plain 'if()' syntax instead of '? :' , but since this breaks it into sepperate statements gcc no longer complains about having both a signed and unsigned value in the same conditional expression. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:59:45 -07:00
Herbert Xu	e0f9f8586a	[IPV4/IPV6]: Replace spin_lock_irq with spin_lock_bh In light of my recent patch to net/ipv4/udp.c that replaced the spin_lock_irq calls on the receive queue lock with spin_lock_bh, here is a similar patch for all other occurences of spin_lock_irq on receive/error queue locks in IPv4 and IPv6. In these stacks, we know that they can only be entered from user or softirq context. Therefore it's safe to disable BH only. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:56:18 -07:00
Jamal Hadi Salim	9ed19f339e	[NETLINK]: Set correct pid for ioctl originating netlink events This patch ensures that netlink events created as a result of programns using ioctls (such as ifconfig, route etc) contains the correct PID of those events. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:55:51 -07:00
Jamal Hadi Salim	b6544c0b4c	[NETLINK]: Correctly set NLM_F_MULTI without checking the pid This patch rectifies some rtnetlink message builders that derive the flags from the pid. It is now explicit like the other cases which get it right. Also fixes half a dozen dumpers which did not set NLM_F_MULTI at all. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:54:12 -07:00
David S. Miller	e52c1f17e4	[NET]: Move sysctl_max_syn_backlog into request_sock.c This fixes the CONFIG_INET=n build failure noticed by Andrew Morton. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:49:40 -07:00
Arnaldo Carvalho de Melo	2ad69c55a2	[NET] rename struct tcp_listen_opt to struct listen_sock Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:48:55 -07:00
Arnaldo Carvalho de Melo	0e87506fcc	[NET] Generalise tcp_listen_opt This chunks out the accept_queue and tcp_listen_opt code and moves them to net/core/request_sock.c and include/net/request_sock.h, to make it useful for other transport protocols, DCCP being the first one to use it. Next patches will rename tcp_listen_opt to accept_sock and remove the inline tcp functions that just call a reqsk_queue_ function. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:47:59 -07:00
Arnaldo Carvalho de Melo	60236fdd08	[NET] Rename open_request to request_sock Ok, this one just renames some stuff to have a better namespace and to dissassociate it from TCP: struct open_request -> struct request_sock tcp_openreq_alloc -> reqsk_alloc tcp_openreq_free -> reqsk_free tcp_openreq_fastfree -> __reqsk_free With this most of the infrastructure closely resembles a struct sock methods subset. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:47:21 -07:00
Arnaldo Carvalho de Melo	2e6599cb89	[NET] Generalise TCP's struct open_request minisock infrastructure Kept this first changeset minimal, without changing existing names to ease peer review. Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn has two new members: ->slab, that replaces tcp_openreq_cachep ->obj_size, to inform the size of the openreq descendant for a specific protocol The protocol specific fields in struct open_request were moved to a class hierarchy, with the things that are common to all connection oriented PF_INET protocols in struct inet_request_sock, the TCP ones in tcp_request_sock, that is an inet_request_sock, that is an open_request. I.e. this uses the same approach used for the struct sock class hierarchy, with sk_prot indicating if the protocol wants to use the open_request infrastructure by filling in sk_prot->rsk_prot with an or_calltable. Results? Performance is improved and TCP v4 now uses only 64 bytes per open request minisock, down from 96 without this patch :-) Next changeset will rename some of the structs, fields and functions mentioned above, struct or_calltable is way unclear, better name it struct request_sock_ops, s/struct open_request/struct request_sock/g, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-18 22:46:52 -07:00
David S. Miller	bcfff0b471	[NETFILTER]: ipt_recent: last_pkts is an array of "unsigned long" not "u_int32_t" This fixes various crashes on 64-bit when using this module. Based upon a patch by Juergen Kreileder <jk@blackdown.de>. Signed-off-by: David S. Miller <davem@davemloft.net> ACKed-by: Patrick McHardy <kaber@trash.net>	2005-06-15 20:51:14 -07:00
Patrick McHardy	a96aca88ac	[NETFILTER]: Advance seq-file position in exp_next_seq() Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-13 18:27:13 -07:00
J. Simonetti	1c2fb7f93c	[IPV4]: Sysctl configurable icmp error source address. This patch alows you to change the source address of icmp error messages. It applies cleanly to 2.6.11.11 and retains the default behaviour. In the old (default) behaviour icmp error messages are sent with the ip of the exiting interface. The new behaviour (when the sysctl variable is toggled on), it will send the message with the ip of the interface that received the packet that caused the icmp error. This is the behaviour network administrators will expect from a router. It makes debugging complicated network layouts much easier. Also, all 'vendor routers' I know of have the later behaviour. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-13 15:19:03 -07:00
Neil Horman	cdac4e0774	[SCTP] Add support for ip_nonlocal_bind sysctl & IP_FREEBIND socket option Signed-off-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-13 15:12:33 -07:00
Randy Dunlap	6efd8455cf	[IPV4]: Multipath modules need a license to prevent kernel tainting. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-13 14:29:06 -07:00
Andi Kleen	e7626486c3	[TCP]: Adjust TCP mem order check to new alloc_large_system_hash Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-13 14:24:52 -07:00
Adrian Bunk	64a6c7aa38	[IPVS]: remove net/ipv4/ipvs/ip_vs_proto_icmp.c ip_vs_proto_icmp.c was never finished. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-06-02 13:02:25 -07:00
Edgar E Iglesias	36839836e8	[IPSEC]: Fix esp_decap_data size verification in esp4. Signed-off-by: Edgar E Iglesias <edgar@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-31 17:08:05 -07:00
Herbert Xu	208d89843b	[IPV4]: Fix BUG() in 2.6.x, udp_poll(), fragments + CONFIG_HIGHMEM Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote: > > Reconstructed forward trace: > > net/ipv4/udp.c:1334 spin_lock_irq() > net/ipv4/udp.c:1336 udp_checksum_complete() > net/core/skbuff.c:1069 skb_shinfo(skb)->nr_frags > 1 > net/core/skbuff.c:1086 kunmap_skb_frag() > net/core/skbuff.h:1087 local_bh_enable() > kernel/softirq.c:0140 WARN_ON(irqs_disabled()); The receive queue lock is never taken in IRQs (and should never be) so we can simply substitute bh for irq. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-30 15:50:15 -07:00
Harald Welte	9bb7bc942d	[NETFILTER]: Fix deadlock with ip_queue and tcp local input path. When we have ip_queue being used from LOCAL_IN, then we end up with a situation where the verdicts coming back from userspace traverse the TCP input path from syscall context. While this seems to work most of the time, there's an ugly deadlock: syscall context is interrupted by the timer interrupt. When the timer interrupt leaves, the timer softirq get's scheduled and calls tcp_delack_timer() and alike. They themselves do bh_lock_sock(sk), which is already held from somewhere else -> boom. I've now tested the suggested solution by Patrick McHardy and Herbert Xu to simply use local_bh_{en,dis}able(). Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-30 15:35:26 -07:00
Pravin B. Shelar	37e20a66db	[IPV4]: Kill MULTIPATHHOLDROUTE flag. It cannot work properly, so just ignore it in drr and rr multipath algorithms just like the random multipath algorithm does. Suggested by Herbert Xu. Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-29 20:26:44 -07:00
Harald Welte	8f937c6099	[IPV4]: Primary and secondary addresses Add an option to make secondary IP addresses get promoted when primary IP addresses are removed from the device. It defaults to off to preserve existing behavior. Signed-off-by: Harald Welte <laforge@gnumonks.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-29 20:23:46 -07:00
David S. Miller	314324121f	[TCP]: Fix stretch ACK performance killer when doing ucopy. When we are doing ucopy, we try to defer the ACK generation to cleanup_rbuf(). This works most of the time very well, but if the ucopy prequeue is large, this ACKing behavior kills performance. With TSO, it is possible to fill the prequeue so large that by the time the ACK is sent and gets back to the sender, most of the window has emptied of data and performance suffers significantly. This behavior does help in some cases, so we should think about re-enabling this trick in the future, using some kind of limit in order to avoid the bug case. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-23 12:03:06 -07:00
David S. Miller	8be58932ca	[NETFILTER]: Do not be clever about SKB ownership in ip_ct_gather_frags(). Just do an skb_orphan() and be done with it. Based upon discussions with Herbert Xu on netdev. Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-19 12:36:33 -07:00
Julian Anastasov	d9fa0f392b	[IP_VS]: Remove extra __ip_vs_conn_put() for incoming ICMP. Remove extra __ip_vs_conn_put for incoming ICMP in direct routing mode. Mark de Vries reports that IPVS connections are not leaked anymore. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-19 12:29:59 -07:00
Herbert Xu	2fdba6b085	[IPV4/IPV6] Ensure all frag_list members have NULL sk Having frag_list members which holds wmem of an sk leads to nightmares with partially cloned frag skb's. The reason is that once you unleash a skb with a frag_list that has individual sk ownerships into the stack you can never undo those ownerships safely as they may have been cloned by things like netfilter. Since we have to undo them in order to make skb_linearize happy this approach leads to a dead-end. So let's go the other way and make this an invariant: For any skb on a frag_list, skb->sk must be NULL. That is, the socket ownership always belongs to the head skb. It turns out that the implementation is actually pretty simple. The above invariant is actually violated in the following patch for a short duration inside ip_fragment. This is OK because the offending frag_list member is either destroyed at the end of the slow path without being sent anywhere, or it is detached from the frag_list before being sent. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-18 22:52:33 -07:00
Jesper Juhl	02c30a84e6	[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-05-05 16:36:49 -07:00
Patrick McHardy	60d5306553	[IPV4]: multipath_wrandom.c GPF fixes multipath_wrandom needs to use GFP_ATOMIC. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-05 14:30:15 -07:00
Herbert Xu	aabc9761b6	[IPSEC]: Store idev entries I found a bug that stopped IPsec/IPv6 from working. About a month ago IPv6 started using rt6i_idev->dev on the cached socket dst entries. If the cached socket dst entry is IPsec, then rt6i_idev will be NULL. Since we want to look at the rt6i_idev of the original route in this case, the easiest fix is to store rt6i_idev in the IPsec dst entry just as we do for a number of other IPv6 route attributes. Unfortunately this means that we need some new code to handle the references to rt6i_idev. That's why this patch is bigger than it would otherwise be. I've also done the same thing for IPv4 since it is conceivable that once these idev attributes start getting used for accounting, we probably need to dereference them for IPv4 IPsec entries too. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-03 16:27:10 -07:00
Patrick McHardy	bd96535b81	[NETFILTER]: Drop conntrack reference in ip_dev_loopback_xmit() Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-03 16:21:37 -07:00
Herbert Xu	2a0a6ebee1	[NETLINK]: Synchronous message processing. Let's recap the problem. The current asynchronous netlink kernel message processing is vulnerable to these attacks: 1) Hit and run: Attacker sends one or more messages and then exits before they're processed. This may confuse/disable the next netlink user that gets the netlink address of the attacker since it may receive the responses to the attacker's messages. Proposed solutions: a) Synchronous processing. b) Stream mode socket. c) Restrict/prohibit binding. 2) Starvation: Because various netlink rcv functions were written to not return until all messages have been processed on a socket, it is possible for these functions to execute for an arbitrarily long period of time. If this is successfully exploited it could also be used to hold rtnl forever. Proposed solutions: a) Synchronous processing. b) Stream mode socket. Firstly let's cross off solution c). It only solves the first problem and it has user-visible impacts. In particular, it'll break user space applications that expect to bind or communicate with specific netlink addresses (pid's). So we're left with a choice of synchronous processing versus SOCK_STREAM for netlink. For the moment I'm sticking with the synchronous approach as suggested by Alexey since it's simpler and I'd rather spend my time working on other things. However, it does have a number of deficiencies compared to the stream mode solution: 1) User-space to user-space netlink communication is still vulnerable. 2) Inefficient use of resources. This is especially true for rtnetlink since the lock is shared with other users such as networking drivers. The latter could hold the rtnl while communicating with hardware which causes the rtnetlink user to wait when it could be doing other things. 3) It is still possible to DoS all netlink users by flooding the kernel netlink receive queue. The attacker simply fills the receive socket with a single netlink message that fills up the entire queue. The attacker then continues to call sendmsg with the same message in a loop. Point 3) can be countered by retransmissions in user-space code, however it is pretty messy. In light of these problems (in particular, point 3), we should implement stream mode netlink at some point. In the mean time, here is a patch that implements synchronous processing. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2005-05-03 14:55:09 -07:00

1 2

66 Commits