This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Function sk_filter() is called from tcp_v{4,6}_rcv() functions with arg
needlock = 0, while socket is not locked at that moment. In order to avoid
this and similar issues in the future, use rcu for sk->sk_filter field read
protection.
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
This automatically labels the TCP, Unix stream, and dccp child sockets
as well as openreqs to be at the same MLS level as the peer. This will
result in the selection of appropriately labeled IPSec Security
Associations.
This also uses the sock's sid (as opposed to the isec sid) in SELinux
enforcement of secmark in rcv_skb and postroute_last hooks.
Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds security for IP sockets at the sock level. Security at the
sock level is needed to enforce the SELinux security policy for
security associations even when a sock is orphaned (such as in the TCP
LAST_ACK state).
This will also be used to enforce SELinux controls over data arriving
at or leaving a child socket while it's still waiting to be accepted.
Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Teach sk_lock semantics to the lock validator. In the softirq path the
slock has mutex_trylock()+mutex_unlock() semantics, in the process context
sock_lock() case it has mutex_lock()/mutex_unlock() semantics.
Thus we treat sock_owned_by_user() flagged areas as an exclusion area too,
not just those areas covered by a held sk_lock.slock.
Effect on non-lockdep kernels: minimal, sk_lock_sock_init() has been turned
into an inline function.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (recursive) locking code to the lock validator. Has no effect
on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Teach special (multi-initialized, per-address-family) locking code to the lock
validator. Has no effect on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch generalises the TSO-specific bits from sk_setup_caps by adding
the sk_gso_type member to struct sock. This makes sk_setup_caps generic
so that it can be used by TCPv6 or UFO.
The only catch is that whoever uses this must provide a GSO implementation
for their protocol which I think is a fair deal :) For now UFO continues to
live without a GSO implementation which is OK since it doesn't use the sock
caps field at the moment.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection. The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet. Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.
This patch adds the ability to turn on NETIF_F_TSO and ECN using
GSO if necessary to handle TSO packets with CWR set. Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.
All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.
With help from Herbert Xu <herbert@gondor.apana.org.au>.
Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock
flag is completely removed.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Warning(/var/linsrc/linux-2617-g4//include/linux/skbuff.h:304): No description found for parameter 'dma_cookie'
Warning(/var/linsrc/linux-2617-g4//include/net/sock.h:1274): No description found for parameter 'copied_early'
Warning(/var/linsrc/linux-2617-g4//net/core/dev.c:3309): No description found for parameter 'chan'
Warning(/var/linsrc/linux-2617-g4//net/core/dev.c:3309): No description found for parameter 'event'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a generic segmentation offload toggle that can be turned
on/off for each net device. For now it only supports in TCPv4.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.infradead.org/hdrcleanup-2.6: (63 commits)
[S390] __FD_foo definitions.
Switch to __s32 types in joystick.h instead of C99 types for consistency.
Add <sys/types.h> to headers included for userspace in <linux/input.h>
Move inclusion of <linux/compat.h> out of user scope in asm-x86_64/mtrr.h
Remove struct fddi_statistics from user view in <linux/if_fddi.h>
Move user-visible parts of drivers/s390/crypto/z90crypt.h to include/asm-s390
Revert include/media changes: Mauro says those ioctls are only used in-kernel(!)
Include <linux/types.h> and use __uXX types in <linux/cramfs_fs.h>
Use __uXX types in <linux/i2o_dev.h>, include <linux/ioctl.h> too
Remove private struct dx_hash_info from public view in <linux/ext3_fs.h>
Include <linux/types.h> and use __uXX types in <linux/affs_hardblocks.h>
Use __uXX types in <linux/divert.h> for struct divert_blk et al.
Use __u32 for elf_addr_t in <asm-powerpc/elf.h>, not u32. It's user-visible.
Remove PPP_FCS from user view in <linux/ppp_defs.h>, remove __P mess entirely
Use __uXX types in user-visible structures in <linux/nbd.h>
Don't use 'u32' in user-visible struct ip_conntrack_old_tuple.
Use __uXX types for S390 DASD volume label definitions which are user-visible
S390 BIODASDREADCMB ioctl should use __u64 not u64 type.
Remove unneeded inclusion of <linux/time.h> from <linux/ufs_fs.h>
Fix private integer types used in V4L2 ioctls.
...
Manually resolve conflict in include/linux/mtd/physmap.h
I found a few more spots where pskb_trim_rcsum could be used but were not.
This patch changes them to use it.
Also, sk_filter can get paged skb data. Therefore we must use pskb_trim
instead of skb_trim.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an async_wait_queue and some additional fields to tcp_sock, and a
dma_cookie_t to sk_buff.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use hlist_unhashed() rather than accessing inside data structure.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add some sanity checking. truesize should be at least sizeof(struct
sk_buff) plus the current packet length. If not, then truesize is
seriously mangled and deserves a kernel log message.
Currently we'll do the check for release of stream socket buffers.
But we can add checks to more spots over time.
Incorporating ideas from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is broken, the condition is checked out of socket lock. It is
wonderful the bug survived for so long time.
[ This fixes bugzilla #6233:
race condition in tcp_sendmsg when connection became established ]
Signed-off-by: David S. Miller <davem@davemloft.net>
If you are on a hostile network, or are running protocol tests, you can
easily get the logged swamped by messages about bad UDP and ICMP packets.
This turns those messages off unless a config option is enabled.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should return an unsigned value, and fix sk_filter() as well.
Signed-off-by: Kris Katterjohn <kjak@ispwest.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Its common enough to to justify that, TCP still can't use it as it has the
prequeueing stuff, still to be made generic in the not so distant future :-)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can share several timewait sockets related functions and
make the timewait mini sockets infrastructure closer to the request
mini sockets one.
Next changesets will take advantage of this, moving more code out of
TCP and DCCP v4 and v6 to common infrastructure.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It also looks like there were 2 places where the test on sk_err was
missing from the event wait logic (in sk_stream_wait_connect and
sk_stream_wait_memory), while the rest of the sock_error() users look
to be doing the right thing. This version of the patch fixes those,
and cleans up a few places that were testing ->sk_err directly.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use "hints" to speed up the SACK processing. Various forms
of this have been used by TCP developers (Web100, STCP, BIC)
to avoid the 2x linear search of outstanding segments.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason for sk_add_backlog to be a macro. It can
just be an inline function and get type checking.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Arnaldo and I agreed it could be applied now, because I have other
pending patches depending on this one (Thank you Arnaldo)
(The other important patch moves skc_refcnt in a separate cache line,
so that the SMP/NUMA performance doesnt suffer from cache line ping pongs)
1) First some performance data :
--------------------------------
tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established()
The most time critical code is :
sk_for_each(sk, node, &head->chain) {
if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif))
goto hit; /* You sunk my battleship! */
}
The sk_for_each() does use prefetch() hints but only the begining of
"struct sock" is prefetched.
As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far
away from the begining of "struct sock", it has to bring into CPU
cache cold cache line. Each iteration has to use at least 2 cache
lines.
This can be problematic if some chains are very long.
2) The goal
-----------
The idea I had is to change things so that INET_MATCH() may return
FALSE in 99% of cases only using the data already in the CPU cache,
using one cache line per iteration.
3) Description of the patch
---------------------------
Adds a new 'unsigned int skc_hash' field in 'struct sock_common',
filling a 32 bits hole on 64 bits platform.
struct sock_common {
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse;
int skc_bound_dev_if;
struct hlist_node skc_node;
struct hlist_node skc_bind_node;
atomic_t skc_refcnt;
+ unsigned int skc_hash;
struct proto *skc_prot;
};
Store in this 32 bits field the full hash, not masked by (ehash_size -
1) Using this full hash as the first comparison done in INET_MATCH
permits us immediatly skip the element without touching a second cache
line in case of a miss.
Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to
sk_hash and tw_hash) already contains the slot number if we mask with
(ehash_size - 1)
File include/net/inet_hashtables.h
64 bits platforms :
#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash))
((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \
((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
32bits platforms:
#define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
(((__sk)->sk_hash == (__hash)) && \
(inet_sk(__sk)->daddr == (__saddr)) && \
(inet_sk(__sk)->rcv_saddr == (__daddr)) && \
(!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
- Adds a prefetch(head->chain.first) in
__inet_lookup_established()/__tcp_v4_check_established() and
__inet6_lookup_established()/__tcp_v6_check_established() and
__dccp_v4_check_established() to bring into cache the first element of the
list, before the {read|write}_lock(&head->lock);
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipv4 and ipv6 protocols need to access it unconditionally.
SYSCTL=n build failure reported by Russell King.
Signed-off-by: David S. Miller <davem@davemloft.net>
I've finally found a potential cause of the sk_forward_alloc underflows
that people have been reporting sporadically.
When tcp_sendmsg tacks on extra bits to an existing TCP_PAGE we don't
check sk_forward_alloc even though a large amount of time may have
elapsed since we allocated the page. In the mean time someone could've
come along and liberated packets and reclaimed sk_forward_alloc memory.
This patch makes tcp_sendmsg check sk_forward_alloc every time as we
do in do_tcp_sendpages.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces sk_stream_wmem_schedule as a short-hand for
the sk_forward_alloc checking on egress.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols that make extensive use of SKB cloning,
for example TCP, eat at least 2 allocations per
packet sent as a result.
To cut the kmalloc() count in half, we implement
a pre-allocation scheme wherein we allocate
2 sk_buff objects in advance, then use a simple
reference count to free up the memory at the
correct time.
Based upon an initial patch by Thomas Graf and
suggestions from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
Of this type, mostly:
CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also moved inet_iif from tcp to inet_hashtables.h, as it is
needed by the inet_lookup callers, perhaps this needs a bit of
polishing, but for now seems fine.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This creates struct inet_connection_sock, moving members out of struct
tcp_sock that are shareable with other INET connection oriented
protocols, such as DCCP, that in my private tree already uses most of
these members.
The functions that operate on these members were renamed, using a
inet_csk_ prefix while not being moved yet to a new file, so as to
ease the review of these changes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Out of tcp_create_openreq_child, will be used in
dccp_create_openreq_child, and is a nice sock function anyway.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
And also some TIME_WAIT functions.
[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 282955 13122 9312 305389 4a8ed net/ipv4/built-in.o
/tmp/after.size: 281566 13122 9312 304000 4a380 net/ipv4/built-in.o
[acme@toy net-2.6.14]$
I kept them still inlined, will uninline at some point to see what
would be the performance difference.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This paves the way to generalise the rest of the sock ID lookup
routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
kernels (where IPv6 is always built as a module):
[root@qemu ~]# grep tw_sock /proc/slabinfo
tw_sock_TCPv6 0 0 128 31 1
tw_sock_TCP 0 0 96 41 1
[root@qemu ~]#
Now if a protocol wants to use the TIME_WAIT generic infrastructure it
only has to set the sk_prot->twsk_obj_size field with the size of its
inet_timewait_sock derived sock and proto_register will create
sk_prot->twsk_slab, for now its only for INET sockets, but we can
introduce timewait_sock later if some non INET transport protocolo
wants to use this stuff.
Next changesets will take advantage of this new infrastructure to
generalise even more TCP code.
[acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
/tmp/before.size: 188646 11764 5068 205478 322a6 net/ipv4/built-in.o
/tmp/after.size: 188144 11764 5068 204976 320b0 net/ipv4/built-in.o
[acme@toy net-2.6.14]$
Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
(qemu host)).
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also expose all of the tcp_hashinfo members, i.e. killing those
tcp_ehash, etc macros, this will more clearly expose already generic
functions and some that need just a bit of work to become generic, as
we'll see in the upcoming changesets.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From tcp_v4_setup_caps, that always is preceded by a call to
__sk_dst_set, so coalesce this sequence into sk_setup_caps, removing
one call to a TCP function in the IP layer.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This operation was already generic and DCCP will use it.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>