The UDP transmit path has been running under the socket lock
for a long time because of the corking feature. This means that
transmitting to the same socket in multiple threads does not
scale at all.
However, as most users don't actually use corking, the locking
can be removed in the common case.
This patch creates a lockless fast path where corking is not used.
Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits. In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.
As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts UDP to use the new ip_finish_skb API. This
would then allows us to more easily use ip_make_skb which allows
UDP to run without a socket lock.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced. The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.
It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.
This patch also adds the helper ip_finish_skb which is meant to
be replace ip_push_pending_frames when corking is required.
Previously the protocol stack would peek at the socket write
queue and add its header to the first packet. With ip_finish_skb,
the protocol stack can directly operate on the final skb instead,
just like the non-corking case with ip_make_skb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).
This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless. It can't use them anyway since the whole
point of UFO is to use the original pages without copying.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.
Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.
Rewrite ip_route_newports() such that:
1) The caller passes in the original port values, so we don't need
to use the rth->fl.fl_ip_{s,d}port values to remember them.
2) The lookup flow is constructed by hand instead of being copied
from the routing cache entry's flow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bug that undo_retrans is incorrectly decremented when undo_marker is
not set or undo_retrans is already 0. This happens when sender receives
more DSACK ACKs than packets retransmitted during the current
undo phase. This may also happen when sender receives DSACK after
the undo operation is completed or cancelled.
Fix another bug that undo_retrans is incorrectly incremented when
sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case
is rare but not impossible.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.
Remove it to keep code clearer.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()
This is caused by inet_twsk_purge(), run from process context,
and commit 575f4cd5a5 (net: Use rcu lookups in inet_twsk_purge.)
removed the BH disabling that was necessary.
Add the BH disabling but fine grained, right before calling
inet_twsk_deschedule(), instead of whole function.
With help from Linus Torvalds and Eric W. Biederman
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Lezcano <daniel.lezcano@free.fr>
CC: Pavel Emelyanov <xemul@openvz.org>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: stable <stable@kernel.org> (# 2.6.33+)
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0dbaee3b37 (net: Abstract default ADVMSS behind an
accessor.) introduced a possible crash in tcp_connect_init(), when
dst->default_advmss() is called from dst_metric_advmss()
Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only troublesome bit here is __mkroute_output which wants
to override res->fi and res->type, compute those in local
variables instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
GCC emits all kinds of crazy zero extensions when we go from signed
int, to unsigned short, etc. etc.
This transformation has to be legal because:
1) In tkey_extract_bits() in mask_pfx(), the values are used to
perform shifts, on which negative values are undefined by C.
2) In fib_table_lookup() we perform comparisons with unsigned
values, constants, and additions. None of which should
encounter negative values.
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows avoiding multiple writes to the initial __refcnt.
The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".
Signed-off-by: David S. Miller <davem@davemloft.net>
This also allows us to combine all the dst->flags settings and avoid
read/modify/write sequences to this struct member.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a lot of redundancy and unnecessary stack frames
in the output route creation path.
1) Make __mkroute_output() return error pointers.
2) Eliminate ip_mkroute_output() entirely, made possible by #1.
3) Call __mkroute_output() directly and handling the returning error
pointers in ip_route_output_slow().
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that we do not generate the redirect netevent any longer,
because we don't create a new cached route.
Instead, once the new neighbour is bound to the cached route,
we emit a neigh update event instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
The general idea is that if we learn new PMTU information, we
bump the peer genid.
This triggers the dst_ops->check() code to validate and if
necessary propagate the new PMTU value into the metrics.
Learned PMTU information self-expires.
This means that it is not necessary to kill a cached route
entry just because the PMTU information is too old.
As a consequence:
1) When the path appears unreachable (dst_ops->link_failure
or dst_ops->negative_advice) we unwind the PMTU state if
it is out of date, instead of killing the cached route.
A redirected route will still be invalidated in these
situations.
2) rt_check_expire(), rt_worker_func(), et al. are no longer
necessary at all.
Signed-off-by: David S. Miller <davem@davemloft.net>
NETDEV_NOTIFY_PEER is an explicit request by the driver to send a link
notification while NETDEV_UP/NETDEV_CHANGEADDR generate link
notifications as a sort of side effect.
In the later cases the sysctl option is present because link
notification events can have undesired effects e.g. if the link is
flapping. I don't think this applies in the case of an explicit
request from a driver.
This patch makes NETDEV_NOTIFY_PEER unconditional, if preferred we
could add a new sysctl for this case which defaults to on.
This change causes Xen post-migration ARP notifications (which cause
switches to relearn their MAC tables etc) to be sent by default.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0c838ff1ad (ipv4: Consolidate all default route selection
implementations.) forgot to remove one rcu_read_unlock() from
fib_select_default().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 5811662b15 ("net: use the macros
defined for the members of flowi") accidentally removed the setting of
IPPROTO_GRE from the struct flowi in ipgre_tunnel_xmit. This patch
restores it.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.
The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.
Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.
If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.
The facilities implemented here handle this problem.
First we create a generation ID. When we create a cached route of any
kind, we remember the generation ID at the time of attachment. Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.
The dst_ops->check() callback is where the knowledge of this event
is propagated. If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found. Now
that we've updated the cached route's information, we update the
route's generation ID too.
This clears the way for implementing PMTU and redirects directly in
the inetpeer cache. There is absolutely no need to consult cached
route information in order to maintain this information.
At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.
Signed-off-by: David S. Miller <davem@davemloft.net>
Validity of the cached PMTU information is indicated by it's
expiration value being non-zero, just as per dst->expires.
The scheme we will use is that we will remember the pre-ICMP value
held in the metrics or route entry, and then at expiration time
we will restore that value.
In this way PMTU expiration does not kill off the cached route as is
done currently.
Redirect information is permanent, or at least until another redirect
is received.
Signed-off-by: David S. Miller <davem@davemloft.net>
Future changes will add caching information, and some of
these new elements will be addresses.
Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.
Signed-off-by: David S. Miller <davem@davemloft.net>
The Linux IPv4 AH stack aligns the AH header on a 64 bit boundary
(like in IPv6). This is not RFC compliant (see RFC4302, Section
3.3.3.2.1), it should be aligned on 32 bits.
For most of the authentication algorithms, the ICV size is 96 bits.
The AH header alignment on 32 or 64 bits gives the same results.
However for SHA-256-128 for instance, the wrong 64 bit alignment results
in adding useless padding in IPv4 AH, which is forbidden by the RFC.
To avoid breaking backward compatibility, we use a new flag
(XFRM_STATE_ALIGN4) do change original behavior.
Initial patch from Dang Hongwu <hongwu.dang@6wind.com> and
Christophe Gouault <christophe.gouault@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like metrics, the ICMP rate limiting bits are cached state about
a destination. So move it into the inet_peer entries.
If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.
Signed-off-by: David S. Miller <davem@davemloft.net>
Always lookup to see if we have an existing inetpeer entry for
a route. Let FLOWI_FLAG_PRECOW_METRICS merely influence the
"create" argument to rt_bind_peer().
Also, call rt_bind_peer() unconditionally since it is not
possible for rt->peer to be non-NULL at this point.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 709b46e8d9 ("net: Add compat
ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT") added the
correct plumbing to handle SIOCGETSGCNT properly.
However, whilst definiting a proper "struct compat_sioc_sg_req" it
isn't actually used in ipmr_compat_ioctl().
Correct this oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
If we end up including include/linux/node.h (either explicitly
or implicitly) that header has a definition of "structt node"
too.
So rename the one we use in fib_trie to "rt_trie_node" to avoid
the conflict.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid confusion with the recently deleted fib_hash.c
code, use "fib_info_hash_*" instead of plain "fib_hash_*".
Signed-off-by: David S. Miller <davem@davemloft.net>
The time has finally come to remove the hash based routing table
implementation in ipv4.
FIB Trie is mature, well tested, and I've done an audit of it's code
to confirm that it implements insert, delete, and lookup with the same
identical semantics as fib_hash did.
If there are any semantic differences found in fib_trie, we should
simply fix them.
I've placed the trie statistic config option under advanced router
configuration.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
In 135367b "netfilter: xtables: change xt_target.checkentry return type",
the type returned by checkentry was changed from boolean to int, but the
return values where not adjusted.
arptables: Input/output error
This broke arptables with the mangle target since it returns true
under success, which is interpreted by xtables as >0, thus
returning EIO.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Both fib_trie and fib_hash have a local implementation of
fib_table_select_default(). This is completely unnecessary
code duplication.
Since we now remember the fib_table and the head of the fib
alias list of the default route, we can implement one single
generic version of this routine.
Looking at the fib_hash implementation you may get the impression
that it's possible for there to be multiple top-level routes in
the table for the default route. The truth is, it isn't, the
insert code will only allow one entry to exist in the zero
prefix hash table, because all keys evaluate to zero and all
keys in a hash table must be unique.
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used later to implement fib_select_default() in a
completely generic manner, instead of the current situation where the
default route is re-looked up in the TRIE/HASH table and then the
available aliases are analyzed.
Signed-off-by: David S. Miller <davem@davemloft.net>
When an IPSEC SA is still being set up, __xfrm_lookup() will return
-EREMOTE and so ip_route_output_flow() will return a blackhole route.
This can happen in a sndmsg call, and after d33e455337 ("net: Abstract
default MTU metric calculation behind an accessor.") this leads to a
crash in ip_append_data() because the blackhole dst_ops have no
default_mtu() method and so dst_mtu() calls a NULL pointer.
Fix this by adding default_mtu() methods (that simply return 0, matching
the old behavior) to the blackhole dst_ops.
The IPv4 part of this patch fixes a crash that I saw when using an IPSEC
VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it
looks to be needed as well.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
which unfortunately means the existing infrastructure for compat networking
ioctls is insufficient. A trivial compact ioctl implementation would conflict
with:
SIOCAX25ADDUID
SIOCAIPXPRISLT
SIOCGETSGCNT_IN6
SIOCGETSGCNT
SIOCRSSCAUSE
SIOCX25SSUBSCRIP
SIOCX25SDTEFACILITIES
To make this work I have updated the compat_ioctl decode path to mirror the
the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
function into struct proto so I can break out ioctls by which kind of ip socket
I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
ipmr_ioctl.
This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
has unsigned longs in it so changes between 32bit and 64bit kernels.
This change was sufficient to run a 32bit ip multicast routing daemon on a
64bit kernel.
Reported-by: Bill Fenner <fenner@aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fib metric memory in this case is static in the kernel image,
so we don't need to reference count it since it's never going
to go away on us.
Signed-off-by: David S. Miller <davem@davemloft.net>
If there are no explicit metrics attached to a route, hook
fi->fib_info up to dst_default_metrics.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the initial gateway towards super-sharing metrics
if they are all set to zero for a route.
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP is going to record metrics for the connection,
so pre-COW the route metrics at route cache entry
creation time.
This avoids several atomic operations that have to
occur if we COW the metrics after the entry reaches
global visibility.
Signed-off-by: David S. Miller <davem@davemloft.net>
Please note that the IPSEC dst entry metrics keep using
the generic metrics COW'ing mechanism using kmalloc/kfree.
This gives the IPSEC routes an opportunity to use metrics
which are unique to their encapsulated paths.
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the RTAX_LOCKED metric to INETPEER_METRICS_NEW (basically,
all ones) on fresh inetpeer entries.
This way code can determine if default metrics have been loaded
in from a routing table entry already.
Signed-off-by: David S. Miller <davem@davemloft.net>
Routing metrics are now copy-on-write.
Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.
The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.
For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.
Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.
But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.
TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.
Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.
Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.
The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug that causes TCP RST packets to be generated
on otherwise correctly behaved applications, e.g., no unread data
on close,..., etc. To trigger the bug, at least two conditions must
be met:
1. The FIN flag is set on the last data packet, i.e., it's not on a
separate, FIN only packet.
2. The size of the last data chunk on the receive side matches
exactly with the size of buffer posted by the receiver, and the
receiver closes the socket without any further read attempt.
This bug was first noticed on our netperf based testbed for our IW10
proposal to IETF where a large number of RST packets were observed.
netperf's read side code meets the condition 2 above 100%.
Before the fix, tcp_data_queue() will queue the last skb that meets
condition 1 to sk_receive_queue even though it has fully copied out
(skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
tcp_recvmsg() often returns all the copied out data successfully
without actually consuming the skb, due to a check
"if ((chunk = len - tp->ucopy.len) != 0) {"
and
"len -= chunk;"
after tcp_prequeue_process() that causes "len" to become 0 and an
early exit from the big while loop.
I don't see any reason not to free the skb whose data have been fully
consumed in tcp_data_queue(), regardless of the FIN flag. We won't
get there if MSG_PEEK is on. Am I missing some arcane cases related
to urgent data?
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.
Occurences found by `grep -r` on net/, drivers/net, include/
[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a8b690f98b (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.
st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].
We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset
Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.
Reported-by: PK <runningdoglackey@yahoo.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Family was hard-coded to AF_INET but should be daddr->family.
This fixes crashes when unlinking ipv6 peer entries, since the
unlink code was looking up the base pointer properly.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 941666c2e3 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()
Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If SNAT isn't done, the wrong info maybe got by the other cts.
As the filter table is after DNAT table, the packets dropped in filter
table also bother bysource hash table.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ret != NF_QUEUE only works in the "--queue-num 0" case; for
queues > 0 the test should be '(ret & NF_VERDICT_MASK) != NF_QUEUE'.
However, NF_QUEUE no longer DROPs the skb unconditionally if queueing
fails (due to NF_VERDICT_FLAG_QUEUE_BYPASS verdict flag), so the
re-route test should also be performed if this flag is set in the
verdict.
The full test would then look something like
&& ((ret & NF_VERDICT_MASK) == NF_QUEUE && (ret & NF_VERDICT_FLAG_QUEUE_BYPASS))
This is rather ugly, so just remove the NF_QUEUE test altogether.
The only effect is that we might perform an unnecessary route lookup
in the NF_QUEUE case.
ip6table_mangle did not have such a check.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
sctp: user perfect name for Delayed SACK Timer option
net: fix can_checksum_protocol() arguments swap
Revert "netlink: test for all flags of the NLM_F_DUMP composite"
gianfar: Fix misleading indentation in startup_gfar()
net/irda/sh_irda: return to RX mode when TX error
net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
USB CDC NCM: tx_fixup() race condition fix
ns83820: Avoid bad pointer deref in ns83820_init_one().
ipv6: Silence privacy extensions initialization
bnx2x: Update bnx2x version to 1.62.00-4
bnx2x: Fix AER setting for BCM57712
bnx2x: Fix BCM84823 LED behavior
bnx2x: Mark full duplex on some external PHYs
bnx2x: Fix BCM8073/BCM8727 microcode loading
bnx2x: LED fix for BCM8727 over BCM57712
bnx2x: Common init will be executed only once after POR
bnx2x: Swap BCM8073 PHY polarity if required
iwlwifi: fix valid chain reading from EEPROM
ath5k: fix locking in tx_complete_poll_work
ath9k_hw: do PA offset calibration only on longcal interval
...
Adding support for SNMP broadcast connection tracking. The SNMP
broadcast requests are now paired with the SNMP responses.
Thus allowing using SNMP broadcasts with firewall enabled.
Please refer to the following conversation:
http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
Patrick McHardy wrote:
> > The best solution would be to add generic broadcast tracking, the
> > use of expectations for this is a bit of abuse.
> > The second best choice I guess would be to move the help() function
> > to a shared module and generalize it so it can be used for both.
This patch implements the "second best choice".
Since the netbios-ns conntrack module uses the same helper
functionality as the snmp, only one helper function is added
for both snmp and netbios-ns modules into the new object -
nf_conntrack_broadcast.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When a packet is meant to be handled by another node of the cluster,
silently drop it instead of flooding kernel log.
Note : INVALID packets are also dropped without notice.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
My previous patch (netfilter: nf_nat: don't use atomic bit operation)
made a mistake when converting atomic_set to a normal bit 'or'.
IPS_*_BIT should be replaced with IPS_*.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use is_vmalloc_addr() in nf_ct_free_hashtable() and get rid of
the vmalloc flags to indicate that a hash table has been allocated
using vmalloc().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.
Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.
Reported-by: Vladis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
One iptables invocation with 135000 rules takes 35 seconds of cpu time
on a recent server, using a 32bit distro and a 64bit kernel.
We eventually trigger NMI/RCU watchdog.
INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)
COMPAT mode has quadratic behavior and consume 16 bytes of memory per
rule.
Switch the xt_compat algos to use an array instead of list, and use a
binary search to locate an offset in the sorted array.
This halves memory need (8 bytes per rule), and removes quadratic
behavior [ O(N*N) -> O(N*log2(N)) ]
Time of iptables goes from 35 s to 150 ms.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb_cow_data() may allocate a new data buffer, so pointers on
skb should be set after this function.
Bug was introduced by commit dff3bb06 ("ah4: convert to ahash")
and 8631e9bd ("ah6: convert to ahash").
Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com>
Acked-by: Krzysztof Witek <krzysztof.witek@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_bind_conflict() logic currently disallows a bind() if
it finds a friend socket (a socket bound on same address/port)
satisfying a set of conditions :
1) Current (to be bound) socket doesnt have sk_reuse set
OR
2) other socket doesnt have sk_reuse set
OR
3) other socket is in LISTEN state
We should add the CLOSE state in the 3) condition, in order to avoid two
REUSEADDR sockets in CLOSE state with same local address/port, since
this can deny further operations.
Note : a prior patch tried to address the problem in a different (and
buggy) way. (commit fda48a0d7a tcp: bind() fix when many ports
are bound).
Reported-by: Gaspar Chilingarov <gasparch@gmail.com>
Reported-by: Daniel Baluta <daniel.baluta@gmail.com>
Tested-by: Daniel Baluta <daniel.baluta@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 over firewire needs to be able to remove ARP entries
from the ARP cache that belong to nodes that are removed, because
IPv4 over firewire uses ARP packets for private information
about nodes.
This information becomes invalid as soon as node drops
off the bus and when it reconnects, its only possible
to start talking to it after it responded to an ARP packet.
But ARP cache prevents such packets from being sent.
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.
Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).
This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.
Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
non-dump requests with NLM_F_EXCL set are mistaken as dump requests.
Substitute the condition to test for _all_ bits being set.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 1ae4de0cdf, the secctx was exported
via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
instead of the secmark.
That patch introduced the use of security_secid_to_secctx() which may
return a non-zero value on error.
In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
security modules. Thus, security_secid_to_secctx() returns a negative
value that results in the breakage of the /proc and `conntrack -L'
outputs. To fix this, we skip the inclusion of secctx if the
aforementioned function fails.
This patch also fixes the dynamic netlink message size calculation
if security_secid_to_secctx() returns an error, since its logic is
also wrong.
This problem exists in Linux kernel >= 2.6.37.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC3168 (The Addition of Explicit Congestion Notification to IP)
states :
5.3. Fragmentation
ECN-capable packets MAY have the DF (Don't Fragment) bit set.
Reassembly of a fragmented packet MUST NOT lose indications of
congestion. In other words, if any fragment of an IP packet to be
reassembled has the CE codepoint set, then one of two actions MUST be
taken:
* Set the CE codepoint on the reassembled packet. However, this
MUST NOT occur if any of the other fragments contributing to
this reassembly carries the Not-ECT codepoint.
* The packet is dropped, instead of being reassembled, for any
other reason.
This patch implements this requirement for IPv4, choosing the first
action :
If one fragment had NO-ECT codepoint
reassembled frame has NO-ECT
ElIf one fragment had CE codepoint
reassembled frame has CE
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The preferred source address is currently ignored for local routes,
which results in all local connections having a src address that is the
same as the local dst address. Fix this by respecting the preferred source
address when it is provided for local routes.
This bug can be demonstrated as follows:
# ifconfig dummy0 192.168.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 192.168.0.1
# ip route change table local local 192.168.0.1 dev dummy0 \
proto kernel scope host src 127.0.0.1
# ip route show table local | grep local.*dummy0
local 192.168.0.1 dev dummy0 proto kernel scope host src 127.0.0.1
We now establish a local connection and verify the source IP
address selection:
# nc -l 192.168.0.1 3128 &
# nc 192.168.0.1 3128 &
# netstat -ant | grep 192.168.0.1:3128.*EST
tcp 0 0 192.168.0.1:3128 192.168.0.1:33228 ESTABLISHED
tcp 0 0 192.168.0.1:33228 192.168.0.1:3128 ESTABLISHED
Signed-off-by: Joel Sing <jsing@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip_route_output_slow(), instead of allowing a route to be created on
a not UPed device, report -ENETUNREACH immediately.
# ip tunnel add mode ipip remote 10.16.0.164 local
10.16.0.72 dev eth0
# (Note : tunl1 is down)
# ping -I tunl1 10.1.2.3
PING 10.1.2.3 (10.1.2.3) from 192.168.18.5 tunl1: 56(84) bytes of data.
(nothing)
# ./a.out tunl1
# ip tunnel del tunl1
Message from syslogd@shelby at Dec 22 10:12:08 ...
kernel: unregister_netdevice: waiting for tunl1 to become free.
Usage count = 3
After patch:
# ping -I tunl1 10.1.2.3
connect: Network is unreachable
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 4465b46900.
Conflicts:
net/ipv4/fib_frontend.c
As reported by Ben Greear, this causes regressions:
> Change 4465b46900 caused rules
> to stop matching the input device properly because the
> FLOWI_FLAG_MATCH_ANY_IIF is always defined in ip_dev_find().
>
> This breaks rules such as:
>
> ip rule add pref 512 lookup local
> ip rule del pref 0 lookup local
> ip link set eth2 up
> ip -4 addr add 172.16.0.102/24 broadcast 172.16.0.255 dev eth2
> ip rule add to 172.16.0.102 iif eth2 lookup local pref 10
> ip rule add iif eth2 lookup 10001 pref 20
> ip route add 172.16.0.0/24 dev eth2 table 10001
> ip route add unreachable 0/0 table 10001
>
> If you had a second interface 'eth0' that was on a different
> subnet, pinging a system on that interface would fail:
>
> [root@ct503-60 ~]# ping 192.168.100.1
> connect: Invalid argument
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 86bcebafc5 ("tcp: fix >2 iw selection") fixed a case
when congestion window initialization has been mistakenly omitted
by introducing cwnd label and putting backwards goto from the
end of the function.
This makes the code unnecessarily tricky to read and understand
on a first sight.
Shuffle the code around a little bit to make it more obvious.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Vlasov found /proc/net/tcp could sometime loop and display
millions of sockets in LISTEN state.
In 2.6.29, when we converted TCP hash tables to RCU, we left two
sk_next() calls in listening_get_next().
We must instead use sk_nulls_next() to properly detect an end of chain.
Reported-by: Alexey Vlasov <renton@renton.name>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
This patch changes the default initial receive window to 10 mss
(defined constant). The default window is limited to the maximum
of 10*1460 and 2*mss (when mss > 1460).
draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
increasing TCP's initial congestion window to 10 mss or about 15KB.
Leading up to this proposal were several large-scale live Internet
experiments with an initial congestion window of 10 mss (IW10), where
we showed that the average latency of HTTP responses improved by
approximately 10%. This was accompanied by a slight increase in
retransmission rate (0.5%), most of which is coming from applications
opening multiple simultaneous connections. To understand the extreme
worst case scenarios, and fairness issues (IW10 versus IW3), we further
conducted controlled testbed experiments. We came away finding minimal
negative impact even under low link bandwidths (dial-ups) and small
buffers. These results are extremely encouraging to adopting IW10.
However, an initial congestion window of 10 mss is useless unless a TCP
receiver advertises an initial receive window of at least 10 mss.
Fortunately, in the large-scale Internet experiments we found that most
widely used operating systems advertised large initial receive windows
of 64KB, allowing us to experiment with a wide range of initial
congestion windows. Linux systems were among the few exceptions that
advertised a small receive window of 6KB. The purpose of this patch is
to fix this shortcoming.
References:
1. A comprehensive list of all IW10 references to date.
http://code.google.com/speed/protocols/tcpm-IW10.html
2. Paper describing results from large-scale Internet experiments with IW10.
http://ccr.sigcomm.org/drupal/?q=node/621
3. Controlled testbed experiments under worst case scenarios and a
fairness study.
http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf
4. Raw test data from testbed experiments (Linux senders/receivers)
with initial congestion and receive windows of both 10 mss.
http://research.csc.ncsu.edu/netsrv/?q=content/iw10
5. Internet-Draft. Increasing TCP's Initial Window.
https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flush the routing cache only of entries that match the
network namespace in which the purge event occurred.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>