This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1 (ipv4: Early TCP socket demux.)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)
One example is the ipgre_tunnel_xmit() calling
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
We had a similar case fixed in commit a34a101e1e (ipv6: disable GSO on
sockets hitting dst_allfrag).
Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.
This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.
Reported-by: Maciej Żenczykowski <maze@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.
Suggested by Joe Perches.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch impelements the common code for both the client and server.
1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.
When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.
2. The new sysctl tcp_fastopen
3. A place holder init function
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.
Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)
Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.
Use it in sunrpc and use more bit shuffling, using hash_32().
Use it in net/ipv6/addrconf.c, using hash_32() as well.
As a cleanup, use it in net/ipv4/tcp_metrics.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should provide to inet6_csk_route_socket a struct flowi6 pointer,
so that net6_csk_xmit() works correctly instead of sending garbage.
Also add some consts
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These patches implement the final mechanism necessary to really allow
us to go without the route cache in ipv4.
We need a place to have long-term storage of PMTU/redirect information
which is independent of the routes themselves, yet does not get us
back into a situation where we have to write to metrics or anything
like that.
For this we use an "next-hop exception" table in the FIB nexthops.
The one thing I desperately want to avoid is having to create clone
routes in the FIB trie for this purpose, because that is very
expensive. However, I'm willing to entertain such an idea later
if this current scheme proves to have downsides that the FIB trie
variant would not have.
In order to accomodate this any such scheme, we need to be able to
produce a full flow key at PMTU/redirect time. That required an
adjustment of the interface call-sites used to propagate these events.
For a PMTU/redirect with a fully specified socket, we pass that socket
and use it to produce the flow key.
Otherwise we use a passed in SKB to formulate the key. There are two
cases that need to be distinguished, ICMP message processing (in which
case the IP header is at skb->data) and output packet processing
(mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)).
We also have to make the code able to handle the case where the dst
itself passed into the dst_ops->{update_pmtu,redirect} method is
invalidated. This matters for calls from sockets that have cached
that route. We provide a inet{,6} helper function for this purpose,
and edit SCTP specially since it caches routes at the transport rather
than socket level.
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used so that we can compose a full flow key.
Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.
In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to check the passed in multicast address and return
appropriate errno(EINVAL) if it is not valid. And it's no need
to walk through the ipv6_mc_list in this situation.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace implementations of network routing protocols sometimes need to
tell RA-originated IPv6 routes from other kernel routes to make proper
routing decisions. This makes most sense for RA routes with nexthops,
namely, default routes and Route Information routes.
The intended mean of preserving RA route origin in a netlink message is
through indicating RTPROT_RA as protocol code. Function rt6_fill_node()
tried to do that for default routes, but its test condition was taken
wrong. This change is modeled after the original mailing list posting
by Jeff Haran. It fixes the test condition for default route case and
sets the same behaviour for Route Information case (both types use
nexthops). Handling of the 3rd RA route type, Prefix Information, is
left unchanged, as it stands for interface connected routes (without
nexthops).
Signed-off-by: Denis Ovsienko <infrastation@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
We start initializing the struct rt6_info at the first field
behind the struct dst_enty. This is error prone because it
might leave a new field uninitialized. So start initializing
the struct rt6_info right behind the dst_entry.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sets things up so that we can have the protocol error handlers
call down into the ipv6 route code for redirects just as ipv4 already
does.
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed. TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics. Therefore there is no
longer any reason to pre-cow metrics in this way.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bug in ip6_dst_lookup_tail(), where typeof(dst) is
"struct dst_entry **", not "struct dst_entry *"
Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove redundant declarations, they belong in include/net/tcp.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
git commit 97cac082 (ipv6: Store route neighbour in rt6_info struct)
added a neighbour pointer to rt6_info. Currently we don't initialize
this pointer at allocation time. We assume this pointer to be valid
if it is not a null pointer, so initialize it on allocation.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
opt always equals np->opts, so it is meaningless to define opt, and
check if opt does not equal np->opts and then try to free opt.
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes for a simplified conversion away from dst_get_neighbour*().
All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*().
Signed-off-by: David S. Miller <davem@davemloft.net>
Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).
Signed-off-by: David S. Miller <davem@davemloft.net>
When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.
This removes the dependency in dst_confirm() of dst's having an
attached neigh.
While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generalizes nf_ct_l4proto_net by splitting it into chunks and
moving the corresponding protocol part to where it really belongs to.
To clarify, note that we follow two different approaches to support per-net
depending if it's built-in or run-time loadable protocol tracker.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap
an extra IP header on incoming packets so they can be routed to the backend.
In the v4 tunnel driver, when these packets fall on the default tunl0 device,
the behavior is to decapsulate them and drop them back on the stack. So our
setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real
address.
In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
behavior - if you didn't have an explicit tunnel setup, it would drop the
packet.
This patch brings that v4 feature to the v6 driver.
The same IPv6 address checks are performed as with any normal tunnel,
but as the fallback tunnel endpoint addresses are unspecified, the checks
must be performed on a per-packet basis, rather than at tunnel
configuration time.
[Patch description modified by phil@ipom.com]
Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com>
Tested-by: Phil Dibowitz <phil@ipom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code in tcp_v6_conn_request() was implicitly assuming that
tcp_v6_send_synack() would take care of dst_release(), much as
tcp_v4_send_synack() already does. This resulted in
tcp_v6_conn_request() leaking a dst if sysctl_tw_recycle is enabled.
This commit restructures tcp_v6_send_synack() so that it accepts a dst
pointer and takes care of releasing the dst that is passed in, to plug
the leak and avoid future surprises by bringing the IPv6 behavior in
line with the IPv4 side.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the recent change (earlier in this patch series) to set
flowi6_oif to treq->iif in inet6_csk_route_req(), the dst lookup in
these two functions is now identical, so tcp_v6_send_synack() can now
just call inet6_csk_route_req(), to reduce code duplication and keep
things closer to the IPv4 side, which is structured this way.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit changes inet_csk_route_req() so that it uses a pointer to
a struct flowi6, rather than allocating its own on the stack. This
brings its behavior in line with its IPv4 cousin,
inet_csk_route_req(), and allows a follow-on patch to fix a dst leak.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif,
which is correctly fixed up in tcp_v6_conn_request() to handle the
case of link-local addresses. This brings it in line with the
tcp_v6_send_synack() code, which is already correctly using the
treq->iif in this way.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/caif/caif_hsi.c
drivers/net/usb/qmi_wwan.c
The qmi_wwan merge was trivial.
The caif_hsi.c, on the other hand, was not. It's a conflict between
1c385f1fdf ("caif-hsi: Replace platform
device with ops structure.") in the net-next tree and commit
39abbaef19 ("caif-hsi: Postpone init of
HIS until open()") in the net tree.
I did my best with that one and will ask Sjur to check it out.
Signed-off-by: David S. Miller <davem@davemloft.net>
dropwatch wrongly diagnose all received UDP packets as drops.
This patch removes trace_kfree_skb() done in skb_free_datagram_locked().
Locations calling skb_free_datagram_locked() should do it on their own.
As a result, drops are accounted on the right function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split sysctl function into smaller chucks to cleanup code and prepare
patches to reduce ifdef pollution.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
l4proto->init contain quite redundant code. We can simplify this
by adding a new parameter l3proto.
This patch prepares that code simplification.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix to allow IPv6 packets originating locally to match rules with the "iff"
set to "lo". This allows IPv6 rule matching work the same as it does for
IPv4. From the iproute2 man page:
iif NAME
select the incoming device to match. If the interface is loop‐
back, the rule only matches packets originating from this host.
This means that you may create separate routing tables for for‐
warded and local packets and, hence, completely segregate them.
Signed-off-by: David McCullough <david_mccullough@mcafee.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If security_inet_conn_request() returns non-zero then TCP/IPv6 should
drop the request, just as in TCP/IPv4 and DCCP in both IPv4 and IPv6.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
net/batman-adv/translation-table.c
net/ipv6/route.c
qmi_wwan.c resolution provided by Bjørn Mork.
batman-adv conflict is dealing merely with the changes
of global function names to have a proper subsystem
prefix.
ipv6's route.c conflict is merely two side-by-side additions
of network namespace methods.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2bec5a369e (ipv6: fib: fix crash when changing large fib
while dumping it) introduced ability to restart the dump at tree root,
but failed to skip correctly a count of already dumped entries. Code
didn't match Patrick intent.
We must skip exactly the number of already dumped entries.
Note that like other /proc/net files or netlink producers, we could
still dump some duplicates entries.
Reported-by: Debabrata Banerjee <dbavatar@gmail.com>
Reported-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays. Remove all unnecessary hash masking.
Document MAX_INET_PROTOS.
Use RAW_HTABLE_SIZE when appropriate.
Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.
This opens up a short time frame to access fib_table_hash with its pants
down.
Move the registration of the proc files to a later point in the init
order to avoid the race.
Tested :-)
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo says:
====================
This is the second batch of Netfilter updates for net-next. It contains the
kernel changes for the new user-space connection tracking helper
infrastructure.
More details on this infrastructure are provides here:
http://lwn.net/Articles/500196/
Still, I plan to provide some official documentation through the
conntrack-tools user manual on how to setup user-space utilities for this.
So far, it provides two helper in user-space, one for NFSv3 and another for
Oracle/SQLnet/TNS. Yet in my TODO list.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are good reasons to supports helpers in user-space instead:
* Rapid connection tracking helper development, as developing code
in user-space is usually faster.
* Reliability: A buggy helper does not crash the kernel. Moreover,
we can monitor the helper process and restart it in case of problems.
* Security: Avoid complex string matching and mangling in kernel-space
running in privileged mode. Going further, we can even think about
running user-space helpers as a non-root process.
* Extensibility: It allows the development of very specific helpers (most
likely non-standard proprietary protocols) that are very likely not to be
accepted for mainline inclusion in the form of kernel-space connection
tracking helpers.
This patch adds the infrastructure to allow the implementation of
user-space conntrack helpers by means of the new nfnetlink subsystem
`nfnetlink_cthelper' and the existing queueing infrastructure
(nfnetlink_queue).
I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register
ipv[4|6]_helper which results from splitting ipv[4|6]_confirm into
two pieces. This change is required not to break NAT sequence
adjustment and conntrack confirmation for traffic that is enqueued
to our user-space conntrack helpers.
Basic operation, in a few steps:
1) Register user-space helper by means of `nfct':
nfct helper add ftp inet tcp
[ It must be a valid existing helper supported by conntrack-tools ]
2) Add rules to enable the FTP user-space helper which is
used to track traffic going to TCP port 21.
For locally generated packets:
iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp
For non-locally generated packets:
iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp
3) Run the test conntrackd in helper mode (see example files under
doc/helper/conntrackd.conf
conntrackd
4) Generate FTP traffic going, if everything is OK, then conntrackd
should create expectations (you can check that with `conntrack':
conntrack -E expect
[NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
[DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
This confirms that our test helper is receiving packets including the
conntrack information, and adding expectations in kernel-space.
The user-space helper can also store its private tracking information
in the conntrack structure in the kernel via the CTA_HELP_INFO. The
kernel will consider this a binary blob whose layout is unknown. This
information will be included in the information that is transfered
to user-space via glue code that integrates nfnetlink_queue and
ctnetlink.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/ipv6/route.c
Pull in 'net' again to get the revert of Thomas's change
which introduced regressions.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 2a0c451ade.
It causes crashes, because now ip6_null_entry is used before
it is initialized.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv6/route.c
This deals with a merge conflict between the net-next addition of the
inetpeer network namespace ops, and Thomas Graf's bug fix in
2a0c451ade which makes sure we don't
register /proc/net/ipv6_route before it is actually safe to do so.
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.
This opens up a short time frame to access fib_table_hash with its pants
down.
fib6_init() as a whole can't be moved to an earlier position as it also
registers the rtnetlink message handlers which should be registered at
the end. Therefore split it into fib6_init() which is run early and
fib6_init_late() to register the rtnetlink message handlers.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts
to handle the error pass the 32-bit info cookie in network byte order
whereas ipv4 passes it around in host byte order.
Like the ipv4 side, we have two helper functions. One for when we
have a socket context and one for when we do not.
ip6ip6 tunnels are not handled here, because they handle PMTU events
by essentially relaying another ICMP packet-too-big message back to
the original sender.
This patch allows us to get rid of rt6_do_pmtu_disc(). It handles all
kinds of situations that simply cannot happen when we do the PMTU
update directly using a fully resolved route.
In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very
likely be removed or changed into a BUG_ON() check. We should never
have a prefixed ipv6 route when we get there.
Another piece of strange history here is that TCP and DCCP, unlike in
ipv4, never invoke the update_pmtu() method from their ICMP error
handlers. This is incredibly astonishing since this is the context
where we have the most accurate context in which to make a PMTU
update, namely we have a fully connected socket and associated cached
socket route.
Signed-off-by: David S. Miller <davem@davemloft.net>
With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.
Create two helper functions to facilitate this.
1) ipv4_sk_update_pmtu()
This updates the PMTU when we have a socket context to
work with.
2) ipv4_update_pmtu()
Raw version, used when no socket context is available. For this
interface, we essentially just pass in explicit arguments for
the flow identity information we would have extracted from the
socket.
And you'll notice that ipv4_sk_update_pmtu() is simply implemented
in terms of ipv4_update_pmtu()
Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key(). This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like. Instead, we only want the route that
would get us to the node described by the outermost IP header.
Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
MAINTAINERS
drivers/net/wireless/iwlwifi/pcie/trans.c
The iwlwifi conflict was resolved by keeping the code added
in 'net' that turns off the buggy chip feature.
The MAINTAINERS conflict was merely overlapping changes, one
change updated all the wireless web site URLs and the other
changed some GIT trees to be Johannes's instead of John's.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.
ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().
Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.
Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encode the pointer(s) into an unsigned long with one state bit.
The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.
Later the peer roots will be per-FIB table, and this change works to
facilitate that.
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.
In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.
First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.
Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.
Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one. So add an rt{,6}_get_peer_create()
for their sake.
There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.
and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4293 defines ipIfStatsOutOctets (similar definition for
ipSystemStatsOutOctets):
The total number of octets in IP datagrams delivered to the lower
layers for transmission. Octets from datagrams counted in
ipIfStatsOutTransmits MUST be counted here.
And ipIfStatsOutTransmits:
The total number of IP datagrams that this entity supplied to the
lower layers for transmission. This includes datagrams generated
locally and those forwarded by this entity.
Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
IPSTATS_MIB_OUTFORWDATAGRAMS.
IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
include forwarded datagrams:
The total number of IP datagrams that local IP user-protocols
(including ICMP) supplied to IP in requests for transmission. Note
that this counter does not include any datagrams counted in
ipIfStatsOutForwDatagrams.
Signed-off-by: Vincent Bernat <bernat@luffy.cx>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 5339ab8b1d (ipv6: fib: Convert fib6_age() to
dst_neigh_lookup().) seems to have mistakenly inverted the
exception for cached NTF_ROUTER routes.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds namespace support for cttimeout.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the sysctl data for l[3|4]proto now resides in pernet nf_proto_net.
We can now remove this unused fields from struct nf_contrack_l[3,4]proto.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds namespace support for IPv6 protocol tracker.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds namespace support for ICMPv6 protocol tracker.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the namespace support for layer 3 protocol trackers.
Basically, this modifies the following interfaces:
* nf_ct_l3proto_[un]register_sysctl.
* nf_conntrack_l3proto_[un]register.
We add a new nf_ct_l3proto_net is used to get the pernet data of l3proto.
This adds rhe new struct nf_ip_net that is used to store the sysctl header
and l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6) because the
protos such tcp and tcp6 use the same data,so making nf_ip_net as a field
of netns_ct is the easiest way to manager it.
This patch also adds init_net to struct nf_conntrack_l3proto to initial
the layer 3 protocol pernet data.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the namespace support for layer 4 protocol trackers.
Basically, this modifies the following interfaces:
* nf_ct_[un]register_sysctl
* nf_conntrack_l4proto_[un]register
to include the namespace parameter. We still use init_net in this patch
to prepare the ground for follow-up patches for each layer 4 protocol
tracker.
We add a new net_id field to struct nf_conntrack_l4proto that is used
to store the pernet_operations id for each layer 4 protocol tracker.
Note that AF_INET6's protocols do not need to do sysctl compat. Thus,
we only register compat sysctl when l4proto.l3proto != AF_INET6.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Adding casts of objects to the same type is unnecessary
and confusing for a human reader.
For example, this cast:
int y;
int *p = (int *)&y;
I used the coccinelle script below to find and remove these
unnecessary casts. I manually removed the conversions this
script produces of casts with __force and __user.
@@
type T;
T *p;
@@
- (T *)p
+ p
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_make_synack() clones the dst, and callers release it.
We can avoid two atomic operations per SYNACK if tcp_make_synack()
consumes dst instead of cloning it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.
Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.
Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
After patch, not a single SYNACK is dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit ad0081e43a
"ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed"
the fragment of packets is incorrect.
because tunnel mode needs IPsec headers and trailer for all fragments,
while on transport mode it is sufficient to add the headers to the
first fragment and the trailer to the last.
so modify mtu and maxfraglen base on ipsec mode and if fragment is first
or last.
with my test,it work well(every fragment's size is the mtu)
and does not trigger slow fragment path.
Changes from v1:
though optimization, mtu_prev and maxfraglen_prev can be delete.
replace xfrm mode codes with dst_entry's new frag DST_XFRM_TUNNEL.
add fuction ip6_append_data_mtu to make codes clearer.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corrects the function that determines the esp payload size. The calculations
done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for
certain mtu values and suboptimal frames for others.
According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(),
net_header_len must be taken into account before doing the alignment
calculation.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following tightens the padding check from commit
c1412fce7e :
* Take into account combinations of consecutive Pad1 and PadN.
* Catch the corner case of when only padding is present in the
header, when the extention header length is 0 (i.e., 8 bytes).
In this case, the header would have exactly 6 bytes of padding:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
: Next Header : Hdr Ext Len=0 : :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
: Padding (Pad1 or PadN) :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_frag_reasm() can use skb_try_coalesce() to build optimized skb,
reducing memory used by them (truesize), and reducing number of cache
line misses and overhead for the consumer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed space issues relating to operators found by
checkpatch.pl tool in net/ipv6/udp.c
Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed a trailing white space issue found by
checkpatch.pl tool in net/ipv6/udp.c
Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the allfrag feature has been set on a host route (due to an ICMPv6
Packet Too Big received indicating a MTU of less than 1280), we hit a
very slow behavior in TCP stack, because all big packets are dropped and
only a retransmit timer is able to push one MSS frame every 200 ms.
One way to handle this is to disable GSO on the socket the first time a
super packet is dropped. Adding a specific dst_allfrag() in the fast
path is probably overkill since the dst_allfrag() case almost never
happen.
Result on netperf TCP_STREAM, one flow :
Before : 60 kbit/sec
After : 1.6 Gbit/sec
Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly bool conversions, some inline removals and const additions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Tore Anderson from :
If the allfrag feature has been set on a host route (due to an ICMPv6
Packet Too Big received indicating a MTU of less than 1280),
TCP SYN/ACK packets to that destination appears to get an incorrect
TCP checksum. This in turn means they are thrown away as invalid.
In the case of an IPv4 client behind a link with a MTU of less than
1260, accessing an IPv6 server through a stateless translator,
this means that the client can only download a single large file
from the server, because once it is in the server's routing cache
with the allfrag feature set, new TCP connections can no longer
be established.
</endquote>
It appears ip6_fragment() doesn't handle CHECKSUM_PARTIAL properly.
As network drivers are not prepared to fetch correct transport header, a
safe fix is to call skb_checksum_help() before fragmenting packet.
Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
csummode variable is always CHECKSUM_NONE in ip6_append_data()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_opt_accepted() returns a bool, and can use const pointers
ipv6_addr_equal(), ipv6_addr_any(), ipv6_addr_loopback(),
ipv6_addr_orchid() return a bool.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- match() method returns a boolean
- return (A && B && C && D) -> return A && B && C && D
- fix indentation
Signed-off-by: Eric Dumazet <edumazet@google.com>
The padding destination or hop-by-hop option is called Pad1 and not Pad0.
See RFC2460 (4.2) or the IANA ipv6-parameters registry:
http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bool conversions where possible.
__inline__ -> inline
space cleanups
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Why use several macros when one will do?
Convert the multiple ND_PRINTKx macros to a single
ND_PRINTK macro. Use the new net_<level>_ratelimited
mechanism too.
Add pr_fmt with "ICMPv6: " as prefix.
Remove embedded ICMPv6 prefixes from messages.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the current debugging style and enable dynamic_debug.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add #define pr_fmt(fmt) as appropriate.
Add "IPv6: " to appropriate files.
Convert printk(KERN_<LEVEL> to pr_<level> (but not KERN_DEBUG).
Standardize on "%s: " not "%s(): " when emitting __func__.
Use "%s: ", __func__ instead of embedding function name.
Coalesce formats, align arguments.
ADDRCONF output is now prefixed with "IPv6: "
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to delete the Token ring support. This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.
The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Standardize the net core ratelimited logging functions.
Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By making this a standalone config option (auto-selected as needed),
selecting CRYPTO from here rather than from XFRM (which is boolean)
allows the core crypto code to become a module again even when XFRM=y.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
af_inet6.c:80: ERROR: do not initialise statics to 0 or NULL
af_inet6.c:259: ERROR: spaces required around that '=' (ctx:VxV)
af_inet6.c:394: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:412: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:422: ERROR: do not use assignment in if condition
af_inet6.c:425: ERROR: do not use assignment in if condition
af_inet6.c:433: ERROR: do not use assignment in if condition
af_inet6.c:437: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:446: ERROR: spaces required around that '=' (ctx:VxV)
af_inet6.c:478: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:485: ERROR: that open brace { should be on the previous line
af_inet6.c:485: ERROR: space required before the open parenthesis '('
af_inet6.c:513: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:629: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:647: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:687: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:709: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
af_inet6.c:1073: ERROR: space required before the open parenthesis '('
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the RFC4944 (Transmission of IPv6 Packets over
IEEE 802.15.4 Networks), chapter 7:
The IPv6 link-local address [RFC4291] for an IEEE 802.15.4 interface
is formed by appending the Interface Identifier, as defined above, to
the prefix FE80::/64.
10 bits 54 bits 64 bits
+----------+-----------------------+----------------------------+
|1111111010| (zeros) | Interface Identifier |
+----------+-----------------------+----------------------------+
This patch adds IPv6 address generation support for the 6lowpan
interfaces.
Signed-off-by: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the flags parameter to ipv6_find_hdr. This flags
allows us to:
* know if this is a fragment.
* stop at the AH header, so the information contained in that header
can be used for some specific packet handling.
This patch also adds the offset parameter for inspection of one
inner IPv6 header that is contained in error messages.
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch removes ip_queue support which was marked as obsolete
years ago. The nfnetlink_queue modules provides more advanced
user-space packet queueing mechanism.
This patch also removes capability code included in SELinux that
refers to ip_queue. Otherwise, we break compilation.
Several warning has been sent regarding this to the mailing list
in the past month without anyone rising the hand to stop this
with some strong argument.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.
Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Perry Lorier <perryl@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Wilmer van der Gaast <wilmer@google.com>
Cc: Ankur Jain <jankur@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Dave Täht <dave.taht@bufferbloat.net>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For implementing other protocols on top of IPv6, such as L2TPv3's IP
encapsulation over ipv6, we'd like to call some IPv6 functions which
are not currently exported. This patch exports them.
Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the sematics of udpv6_queue_rcv_skb() match IPv4's
udp_queue_rcv_skb(), introduce the UDP encap_rcv() hook for IPv6.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to make sure that when the encap_rcv() hook is introduced it is
not called with the socket lock held, move socket locking from callers into
udpv6_queue_rcv_skb(), matching what happens in IPv4.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first step in reworking the IPv6 UDP code to be structured more
like the IPv4 UDP code. This patch creates __udpv6_queue_rcv_skb() with
the equivalent sematics to __udp_queue_rcv_skb(), and wires it up to the
backlog_rcv method.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Tore Anderson from :
https://bugzilla.kernel.org/show_bug.cgi?id=42572
When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
size does not take into account the size of the IPv6 Fragmentation
header that needs to be included in outbound packets, causing every
transmitted TCP segment to be fragmented across two IPv6 packets, the
latter of which will only contain 8 bytes of actual payload.
RTAX_FEATURE_ALLFRAG is typically set on a route in response to
receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
PTBs with MTU < 1280 are still valid, in particular when an IPv6
packet is sent to an IPv4 destination through a stateless translator.
Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
the path will be translated to ICMPv6 PTB which may then indicate an
MTU of less than 1280.
The Linux kernel refuses to reduce the effective MTU to anything below
1280 bytes, instead it sets it to exactly 1280 bytes, and
RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
instead of 1232 (additionally taking into account the 8 bytes required
by the IPv6 Fragmentation extension header).
This in turn results in rather inefficient transmission, as every
transmitted TCP segment now is split in two fragments containing
1232+8 bytes of payload.
After this patch, all the outgoing packets that includes a
Fragmentation header all are "atomic" or "non-fragmented" fragments,
i.e., they both have Offset=0 and More Fragments=0.
With help from David S. Miller
Reported-by: Tore Anderson <tore@fud.no>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
Tested-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some kfree_skb() calls should be replaced by consume_skb() to avoid
drop_monitor/dropwatch false positives.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")
The former moved around the sysctl register/unregister calls, the
later simply removed them.
With help from Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.
A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.
A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.
Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.
No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock':
net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable]
net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock':
net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable]
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f5fff5d forgot to fix TCP_MAXSEG behavior IPv6 sockets, so IPv6
TCP server sockets that used TCP_MAXSEG would find that the advmss of
child sockets would be incorrect. This commit mirrors the advmss logic
from tcp_v4_syn_recv_sock in tcp_v6_syn_recv_sock. Eventually this
logic should probably be shared between IPv4 and IPv6, but this at
least fixes this issue.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().
Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).
When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't use struct ctl_path anymore so delete the exported constants.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This results in code with less boiler plate that is a bit easier
to read.
Additionally stops us from using compatibility code in the sysctl
core, hastening the day when the compatibility code can be removed.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using an ascii path to register_net_sysctl as opposed to the slightly
awkward ctl_path allows for much simpler code.
We no longer need to malloc dev_name to keep it alive the length of our
sysctl register instead we can use a small temporary buffer on the
stack.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sysctl core no longer natively understands sysctl tables
with .child entries.
Split the ipv6_table to remove the .child entries.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl no longer requires explicit creation of directories. The neigh
directory is always populated with at least a default entry so this
should cause no user visible changes.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes it clearer which sysctls are relative to your current network
namespace.
This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.
This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
register_sysctl_rotable never caught on as an interesting way to
register sysctls. My take on the situation is that what we want are
sysctls that we can only see in the initial network namespace. What we
have implemented with register_sysctl_rotable are sysctls that we can
see in all of the network namespaces and can only change in the initial
network namespace.
That is a very silly way to go. Just register the network sysctls
in the initial network namespace and we don't have any weird special
cases to deal with.
The sysctls affected are:
/proc/sys/net/ipv4/ipfrag_secret_interval
/proc/sys/net/ipv4/ipfrag_max_dist
/proc/sys/net/ipv6/ip6frag_secret_interval
/proc/sys/net/ipv6/mld_max_msf
I really don't expect anyone will miss them if they can't read them in a
child user namespace.
CC: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we need to clone skb, we dont drop a packet.
Call consume_skb() to not confuse dropwatch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we need to reallocate skb, we dont drop a packet.
Call consume_skb() to not confuse dropwatch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/atheros/atlx/atl1.c
drivers/net/ethernet/atheros/atlx/atl1.h
Resolved a conflict between a DMA error bug fix and NAPI
support changes in the atl1 driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.
We achieve this by adding a relaxation parameter to
inet_csk_bind_conflict. When 'relax' parameter is off
we return a conflict whenever the current searched
pair (addr, port) is not unique.
This tries to address the problems reported in patch:
8d238b25b1
Revert "tcp: bind() fix when many ports are bound"
Tests where ran for creating and binding(0) many sockets
on 100 IPs. The results are, on average:
* 60000 sockets, 600 ports / IP:
* 0.210 s, 620 (IP, port) duplicates without patch
* 0.219 s, no duplicates with patch
* 100000 sockets, 1000 ports / IP:
* 0.371 s, 1720 duplicates without patch
* 0.373 s, no duplicates with patch
* 200000 sockets, 2000 ports / IP:
* 0.766 s, 6900 duplicates without patch
* 0.768 s, no duplicates with patch
* 500000 sockets, 5000 ports / IP:
* 2.227 s, 41500 duplicates without patch
* 2.284 s, no duplicates with patch
Signed-off-by: Alex Copot <alex.mihai.c@gmail.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the per-cpu statistics kept for GRE, IPIP, and SIT tunnels
to use 64 bit statistics.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the ipv6 dst cache which copy from the dst generated by ICMPV6 RA packet.
this dst cache will not check expire because it has no RTF_EXPIRES flag.
So this dst cache will always be used until the dst gc run.
Change the struct dst_entry,add a union contains new pointer from and expires.
When rt6_info.rt6i_flags has no RTF_EXPIRES flag,the dst.expires has no use.
we can use this field to point to where the dst cache copy from.
The dst.from is only used in IPV6.
rt6_check_expired check if rt6_info.dst.from is expired.
ip6_rt_copy only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.
ip6_dst_destroy release the ort.
Add some functions to operate the RTF_EXPIRES flag and expires(from) together.
and change the code to use these new adding functions.
Changes from v5:
modify ip6_route_add and ndisc_router_discovery to use new adding functions.
Only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added strict checking of PadN, as PadN can be used to increase header
size and thus push the protocol header into the 2nd fragment.
PadN is used to align the options within the Hop-by-Hop or
Destination Options header to 64-bit boundaries. The maximum valid
size is thus 7 bytes.
RFC 4942 recommends to actively check the "payload" itself and
ensure that it contains only zeroes.
See also RFC 4942 section 2.1.9.5.
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix bluetooth userland regression reported by Keith Packard, from
Gustavo Padovan.
2) Revert ath9k PS idle change, from Sujith Manoharan.
3) Correct default TCP memory limits (again), from Eric Dumazet.
4) Fix tcp_rcv_rtt_update() accidental use of unscaled RTT, from Neal
Cardwell.
5) We made a facility for layers like wireless to say how much tailroom
they need in the SKB for link layer stuff such as wireless
encryption etc., but TCP works hard to fill every SKB out to the end
defeating this specification.
This leads to every TCP packet getting reallocated by the wireless
code in order to have the right amount of tailroom available.
Fix TCP to only fill SKBs out to the real amount of data area it
asked for during the allocation, this way it won't eat into the
slack added for the device's tailroom needs.
Reported by Marc Merlin and fixed by Eric Dumazet.
6) Leaks, endian bugs, and new device IDs in bluetooth from Santosh
Nayak, João Paulo Rechi Vita, Cho, Yu-Chen, Andrei Emeltchenko,
AceLan Kao, and Andrei Emeltchenko.
7) OOPS on tty_close fix in bluetooth's hci_ldisc from Johan Hovold.
8) netfilter erroneously scales TCP window twice, fix from Changli Gao.
9) Memleak fix in wext-core from Julia Lawall.
10) Consistently handle invalid TCP packets in ipv4 vs. ipv6 conntrack,
from Jozsef Kadlecsik.
11) Validate IP header length properly in netfilter conntrack's
ipv4_get_l4proto().
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
NFC: Fix the LLCP Tx fragmentation loop
rtlwifi: Add missing DMA buffer unmapping for PCI drivers
rtlwifi: Preallocate USB read buffers and eliminate kalloc in read routine
tcp: avoid order-1 allocations on wifi and tx path
net: allow pskb_expand_head() to get maximum tailroom
bridge: Do not send queries on multicast group leaves
MAINTAINERS: Mark NATSEMI driver as orphan'd.
tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample
tcp: restore correct limit
Revert "ath9k: fix going to full-sleep on PS idle"
rt2x00: Fix rfkill_polling register function.
bcma: fix build error on MIPS; implicit pcibios_enable_device
netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net
netfilter: nf_ct_ipv4: packets with wrong ihl are invalid
netfilter: nf_ct_ipv4: handle invalid IPv4 and IPv6 packets consistently
net/wireless/wext-core.c: add missing kfree
rtlwifi: Fix oops on rate-control failure
mac80211: Convert WARN_ON to WARN_ON_ONCE
rtlwifi: rtl8192de: Fix firmware initialization
nl80211: ensure interface is up in various APIs
...
extern int sysctl_mld_max_msf is already defined in linux/ipv6.h.
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As specified in RFC6106, DNSSL option contains one or more domain names
of DNS suffixes. 8-bit identifier of the DNSSL option type as assigned
by the IANA is 31. This option should also be treated as userland.
Signed-off-by: Alexey I. Froloff <raorn@raorn.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
1/ regression fix for Xen as it now trips over a broken assumption
about the dma address size on 32-bit builds
2/ new quirk for netdma to ignore dma channels that cannot meet
netdma alignment requirements
3/ fixes for two long standing issues in ioatdma (ring size overflow)
and iop-adma (potential stack corruption)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPhIfhAAoJEB7SkWpmfYgCguIQAL4qF+RC9/JggSHIjfOrYiPd
yboV80GqqQHHBwy8hfZVUrIEPMebvD/xUIk6iUQNXR+6EA8Ln0jukvQMpWNnI+Cc
TXgA5Ok70an4PD1MqnCsWyCJjsyPyhprbRHurxBcesf+y96POJxhING0rcKvft50
mvYnbtrkYe9M9x3b8TBGc0JaTVeL29Ck3FtkTz4uUktbkhRNfCcfEd28NRQpf8MB
vkjbjRGBQmGsnKxYCaEhlF1GPJyTlYjg4BBWtseJgb2R9s7tvJrkotFea/NmSfjq
XCuVKjpiFp3YyJuxJERWdwqRWvyAZFfcYyZX440nG0b7GBgSn+T7A9XhUs8vMboi
tLwoDfBbJDlKMaFpHex7Z6RtZZmVl3gWDNZTqpG44n4pabd4RPip04f0k7Wfs+cp
tzU9hGAOvgsZ8w4/JgxH8YJOZbIGzbDGOA1IhWcbxIbmFTblMiFnV3TC7qfhoRbR
8qtScIE7bUck2MYVlMMn9utd9tvKFa6HNgo41+f78/4+U7zQ/VrsbA/DWQct40R5
5k+EEvyYFUzIXn79E0GVN5h4NHH5gfAs3MZ7jIgwgHedBp4Ki68XYKNu+pIV3YwG
CFTPn1mVOXnCdt+fsjG5tL9Jecx1Mij6w3nWU93ZU6cHmC77YmU+DLxPIGuyR1a2
EmpObwfq5peXzkgQpEsB
=F3IR
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
Pull dmaengine fixes from Dan Williams:
1/ regression fix for Xen as it now trips over a broken assumption
about the dma address size on 32-bit builds
2/ new quirk for netdma to ignore dma channels that cannot meet
netdma alignment requirements
3/ fixes for two long standing issues in ioatdma (ring size overflow)
and iop-adma (potential stack corruption)
* tag 'dmaengine-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
netdma: adding alignment check for NETDMA ops
ioatdma: DMA copy alignment needed to address IOAT DMA silicon errata
ioat: ring size variables need to be 32bit to avoid overflow
iop-adma: Corrected array overflow in RAID6 Xscale(R) test.
ioat: fix size of 'completion' for Xen
We may hit this in xt_LOG:
net/built-in.o:xt_LOG.c:function dump_ipv6_packet:
error: undefined reference to 'ip6t_ext_hdr'
happens with these config options:
CONFIG_NETFILTER_XT_TARGET_LOG=y
CONFIG_IP6_NF_IPTABLES=m
ip6t_ext_hdr is fairly small and it is called in the packet path.
Make it static inline.
Reported-by: Simon Kirby <sim@netnation.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The conditional which decides to skip inactive filters does not
change with the change of loop index, so it is unnecessary to
check them many times.
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert array index from the loop bound to the loop index.
And remove the void type conversion to ip6_mc_del1_src() return
code, seem it is unnecessary, since ip6_mc_del1_src() does not
use __must_check similar attribute, no compiler will report the
warning when it is removed.
v2: enrich the commit header
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 72331bc [ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be
consistent with ipv4] the code of 'inet6_rtm_getroute()' was re-ordered
such that the reference to 'rt->dst' is incremented prior skb
allocation.
Hence, if 'alloc_skb()' fails, must drop a reference from 'rt->dst'.
Add the missing 'dst_release()' call.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Provide device string properly for USB i2400m wimax devices, also
don't OOPS when providing firmware string. From Phil Sutter.
2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.
3) Add another device ID to USB zaurus driver, from Guan Xin.
4) Loop index start in pool vector iterator is wrong causing MAC to not
get configured in bnx2x driver, fix from Dmitry Kravkov.
5) EQL driver assumes HZ=100, fix from Eric Dumazet.
6) Now that skb_add_rx_frag() can specify the truesize increment
separately, do so in f_phonet and cdc_phonet, also from Eric
Dumazet.
7) virtio_net accidently uses net_ratelimit() not only on the kernel
warning but also the statistic bump, fix from Rick Jones.
8) ip_route_input_mc() uses fixed init_net namespace, oops, use
dev_net(dev) instead. Fix from Benjamin LaHaise.
9) dev_forward_skb() needs to clear the incoming interface index of the
SKB so that it looks like a new incoming packet, also from Benjamin
LaHaise.
10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
5GHZ, fix from Stanislav Yakovlev.
11) Missing kmalloc() return value checks in orinoco, from Santosh
Nayak.
12) ath9k doesn't check for HT capabilities in the right way, it is
checking ht_supported instead of the ATH9K_HW_CAP_HT flag. Fix from
Sujith Manoharan.
13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
instructions, from Feiran Zhuang.
14) Avoid infinite loop in GARP code when registering sysfs entries.
From David Ward.
15) rose protocol uses memcpy instead of memcmp in a device address
comparison, oops. Fix from Daniel Borkmann.
16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
renamed to eth_hw_addr_random(). From Roland Stigge.
17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
that ipv4 does. Fix from Shmulik Ladkani.
18) via-rhine has an inverted bit test, causing suspend/resume
regressions. Fix from Andreas Mohr.
19) RIONET assumes 4K page size, fix from Akinobu Mita.
20) Initialization of imask register in sky2 is buggy, because bits are
"or'd" into an uninitialized local variable. Fix from Lino
Sanfilippo.
21) Fix FCOE checksum offload handling, from Yi Zou.
22) Fix VLAN processing regression in e1000, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
sky2: dont overwrite settings for PHY Quick link
tg3: Fix 5717 serdes powerdown problem
net: usb: cdc_eem: fix mtu
net: sh_eth: fix endian check for architecture independent
usb/rtl8150 : Remove duplicated definitions
rionet: fix page allocation order of rionet_active
via-rhine: fix wait-bit inversion.
ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
net: lpc_eth: Fix rename of dev_hw_addr_random
net/netfilter/nfnetlink_acct.c: use linux/atomic.h
rose_dev: fix memcpy-bug in rose_set_mac_address
Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
net/garp: avoid infinite loop if attribute already exists
x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
bonding: emit event when bonding changes MAC
mac80211: fix oper channel timestamp updation
ath9k: Use HW HT capabilites properly
MAINTAINERS: adding maintainer for ipw2x00
net: orinoco: add error handling for failed kmalloc().
net/wireless: ipw2x00: fix a typo in wiphy struct initilization
...
net/ipv6/addrconf.c:340: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
net/ipv6/addrconf.c:342: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:444: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:1337: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
net/ipv6/addrconf.c:1526: ERROR: "(foo*)" should be "(foo *)"
net/ipv6/addrconf.c:1671: ERROR: open brace '{' following function declarations go on the next line
net/ipv6/addrconf.c:1914: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:2368: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:2370: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:2416: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:2437: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:2573: ERROR: "foo * bar" should be "foo *bar"
net/ipv6/addrconf.c:3797: ERROR: "foo* bar" should be "foo *bar"
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
icmp.c:501: ERROR: "(foo*)" should be "(foo *)"
icmp.c:582: ERROR: "(foo*)" should be "(foo *)"
icmp.c:954: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib6_rules.c:26: ERROR: open brace '{' following struct go on the same line
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
exthdrs_core.c:113: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
exthdrs_core.c:114: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
exthdrs.c:726: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
exthdrs.c:741: ERROR: "(foo*)" should be "(foo *)"
exthdrs.c:741: ERROR: "(foo*)" should be "(foo *)"
exthdrs.c:744: ERROR: "(foo**)" should be "(foo **)"
exthdrs.c:746: ERROR: "(foo**)" should be "(foo **)"
exthdrs.c:748: ERROR: "(foo**)" should be "(foo **)"
exthdrs.c:750: ERROR: "(foo**)" should be "(foo **)"
exthdrs.c:755: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
exthdrs.c:896: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
datagram.c:101: ERROR: "(foo*)" should be "(foo *)"
datagram.c:521: ERROR: space required before the open parenthesis '('
datagram.c:830: WARNING: braces {} are not necessary for single statement blocks
datagram.c:849: WARNING: braces {} are not necessary for single statement blocks
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
addrconf_core.c:13: ERROR: space required before the open parenthesis '('
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sit.c:118: ERROR: "foo * bar" should be "foo *bar"
sit.c:694: ERROR: "(foo*)" should be "(foo *)"
sit.c:724: ERROR: "(foo*)" should be "(foo *)"
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv4, if an RTA_IIF attribute is specified within an RTM_GETROUTE
message, then a route is searched as if a packet was received on the
specified 'iif' interface.
However in IPv6, RTA_IIF is not interpreted in the same way:
'inet6_rtm_getroute()' always calls 'ip6_route_output()', regardless the
RTA_IIF attribute.
As a result, in IPv6 there's no way to use RTM_GETROUTE in order to look
for a route as if a packet was received on a specific interface.
Fix 'inet6_rtm_getroute()' so that RTA_IIF is interpreted as "lookup a
route as if a packet was received on the specified interface", similar
to IPv4's 'inet_rtm_getroute()' interpretation.
Reported-by: Ami Koren <amikoren@yahoo.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
Commit f2c31e32b3 (net: fix NULL dereferences in check_peer_redir() )
added a regression in rt6_fill_node(), leading to rcu_read_lock()
imbalance.
Thats because NLA_PUT() can make a jump to nla_put_failure label.
Fix this by using nla_put()
Many thanks to Ben Greear for his help
Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It used to be an int, and it got changed to a bool parameter at least
7 years ago. It happens that NF_ACCEPT and NF_DROP are 0 and 1, so
this works, but it's unclear, and the check that it's in range is not
required.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 299b0767(ipv6: Fix IPsec slowpath fragmentation problem)
In func ip6_append_data,after call skb_put(skb, fraglen + dst_exthdrlen)
the skb->len contains dst_exthdrlen,and we don't reduce dst_exthdrlen at last
This will make fraggap>0 in next "while cycle",and cause the size of skb incorrent
Fix this by reserve headroom for dst_exthdrlen.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_mc_find_dev_rcu() is called with rcu_read_lock(), so don't
need to dev_hold().
With dev_hold(), not corresponding dev_put(), will lead to leak.
[ bug introduced in 96b52e61be (ipv6: mcast: RCU conversions) ]
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 87a115783 ( ipv6: Move xfrm_lookup() call down into
icmp6_dst_alloc().) forgot to convert one error path, leading
to crashes in mld_sendpack()
Many thanks to Dave Jones for providing a very complete bug report.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit d6ddef9e641d(IPv6: Fix not join all-router mcast group
when forwarding set.) I check 'dev' after it's dereference that
leads to a Smatch complaint:
net/ipv6/addrconf.c:438 ipv6_add_dev()
warn: variable dereferenced before check 'dev' (see line 432)
net/ipv6/addrconf.c
431 /* protected by rtnl_lock */
432 rcu_assign_pointer(dev->ip6_ptr, ndev);
^^^^^^^^^^^^
Old dereference.
433
434 /* Join all-node multicast group */
435 ipv6_dev_mc_inc(dev, &in6addr_linklocal_allnodes);
436
437 /* Join all-router multicast group if forwarding is set
*/
438 if (ndev->cnf.forwarding && dev && (dev->flags &
IFF_MULTICAST))
^^^
Remove the check to avoid the complaint as 'dev' can't be NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the infrastructure to add fine timeout tuning
over nfnetlink. Now you can use the NFNL_SUBSYS_CTNETLINK_TIMEOUT
subsystem to create/delete/dump timeout objects that contain some
specific timeout policy for one flow.
The follow up patches will allow you attach timeout policy object
to conntrack via the CT target and the conntrack extension
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch defines a new interface for l4 protocol trackers:
unsigned int *(*get_timeouts)(struct net *net);
that is used to return the array of unsigned int that contains
the timeouts that will be applied for this flow. This is passed
to the l4proto->new(...) and l4proto->packet(...) functions to
specify the timeout policy.
This interface allows per-net global timeout configuration
(although only DCCP supports this by now) and it will allow
custom custom timeout configuration by means of follow-up
patches.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipt_LOG and ip6_LOG have a lot of common code, merge them
to reduce duplicate code.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When forwarding was set and a new net device is register,
we need add this device to the all-router mcast group.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/sfc/rx.c
Overlapping changes in drivers/net/ethernet/sfc/rx.c, one to change
the rx_buf->is_page boolean into a set of u16 flags, and another to
adjust how ->ip_summed is initialized.
Signed-off-by: David S. Miller <davem@davemloft.net>
Niccolo Belli reported ipsec crashes in case we handle a frame without
mac header (atm in his case)
Before copying mac header, better make sure it is present.
Bugzilla reference: https://bugzilla.kernel.org/show_bug.cgi?id=42809
Reported-by: Niccolò Belli <darkbasic@linuxsystems.it>
Tested-by: Niccolò Belli <darkbasic@linuxsystems.it>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_output() never returns NULL, so it is wrong to
check if the return value is NULL.
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one is only considered for MSG_PEEK flag and the value pointed by
it specifies where to start peeking bytes from. If the offset happens to
point into the middle of the returned skb, the offset within this skb is
put back to this very argument.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it is not easily possible to get TOS/DSCP value of packets from
an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
not actually implemented for TOS, though.
This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
(IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
the TOS field is stored from the other party's ACK.
This is needed for proxies which require DSCP transparency. One such example
is at http://zph.bratcheda.org/.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement helper inline function to get traffic class from IPv6 header.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPV6_UNICAST_IF feature is the IPv6 compliment to IP_UNICAST_IF.
Signed-off-by: Erich E. Hoover <ehoover@mines.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It went from unused, to commented out, and never changing after
that.
Just get rid of it, if someone wants it they can unearth it from
the history.
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP RST mechanism is broken in TCP md5(RFC2385). When
connection is gone, md5 key is lost, sending RST
without md5 hash is deem to ignored by peer. This can
be a problem since RST help protocal like bgp to fast
recove from peer crash.
In most case, users of tcp md5, such as bgp and ldp,
have listener on both sides to accept connection from peer.
md5 keys for peers are saved in listening socket.
There are two cases in finding md5 key when connection is
lost:
1.Passive receive RST: The message is send to well known port,
tcp will associate it with listner. md5 key is gotten from
listener.
2.Active receive RST (no sock): The message is send to ative
side, there is no socket associated with the message. In this
case, finding listener from source port, then find md5 key from
listener.
we are not loosing sercuriy here:
packet is checked with md5 hash. No RST is generated
if md5 hash doesn't match or no md5 key can be found.
Signed-off-by: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't check for NULL consistently in __xfrm6_output(). If "x" were
NULL here it would lead to an OOPs later. I asked Steffen Klassert
about this and he suggested that we remove the NULL check.
On 10/29/11, Steffen Klassert <steffen.klassert@secunet.com> wrote:
>> net/ipv6/xfrm6_output.c
>> 148
>> 149 if ((x && x->props.mode == XFRM_MODE_TUNNEL) &&
>> ^
>
> x can't be null here. It would be a bug if __xfrm6_output() is called
> without a xfrm_state attached to the skb. I think we can just remove
> this null check.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes sure we use appropriate memory barriers before
publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
tcp_v4_send_reset() without holding socket lock (upcoming patch from
Shawn Lu)
Note we also need to respect rcu grace period before its freeing, since
we can free socket without this grace period thanks to
SLAB_DESTROY_BY_RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to support proper RST messages for TCP MD5 flows, we
need to allow access to MD5 keys without locking listener socket.
This conversion is a nice cleanup, and shrinks size of timewait sockets
by 80 bytes.
IPv6 code reuses generic code found in IPv4 instead of duplicating it.
Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer use md5_add() method from struct tcp_sock_af_ops
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC5722 Section 4 was amended by Errata 3089
Our implementation did the right thing anyway...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's only used to get at neigh->primary_key, which in this context is
always going to be the same as rt->rt6i_gateway.
Signed-off-by: David S. Miller <davem@davemloft.net>
In this specific situation we know we are dealing with a gatewayed route
and therefore rt6i_gateway is not going to be in6addr_any even in future
interpretations.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now all code paths grab a local reference to the neigh, so if neigh
is not NULL we unconditionally release it at the end. The old logic
would only release if we didn't have a non-NULL 'rt'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The only semantic difference is that we now hold a reference to the
neighbour and thus have to release it.
Signed-off-by: David S. Miller <davem@davemloft.net>
In the future the ipv4/ipv6 route gateway will take on two types
of values:
1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case
the neighbour must be obtained using the destination address in
ipv4/ipv6 header as the lookup key.
2) Everything else, the actual nexthop route address.
So if the gateway is not inaddr-any we use it, otherwise we must use
the packet's destination address.
Signed-off-by: David S. Miller <davem@davemloft.net>
md5 key is added in socket through remote address.
remote address should be used in finding md5 key when
sending out reset packet.
Signed-off-by: shawnlu <shawn.lu@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition in addrconf_sysctl_forward() and
addrconf_sysctl_disable().
These functions change idev->cnf.forwarding (resp. idev->cnf.disable_ipv6)
and then try to grab the rtnl lock before performing any actions.
If that fails they restore the original value and restart the syscall.
This creates race conditions if ipv6 code tries to access
these parameters, or if multiple instances try to do the same operation.
As an example of the former, if __ipv6_ifa_notify() finds a 0 in
idev->cnf.forwarding when invoked by addrconf_ifdown() it may not free
anycast addresses, ultimately resulting in the net_device not being freed.
This patch reads the user parameters into a temporary location and only
writes the actual parameters when the rtnl lock is acquired.
Tested in 2.6.38.8.
Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
tg3: Fix single-vector MSI-X code
openvswitch: Fix multipart datapath dumps.
ipv6: fix per device IP snmp counters
inetpeer: initialize ->redirect_genid in inet_getpeer()
net: fix NULL-deref in WARN() in skb_gso_segment()
net: WARN if skb_checksum_help() is called on skb requiring segmentation
caif: Remove bad WARN_ON in caif_dev
caif: Fix typo in Vendor/Product-ID for CAIF modems
bnx2x: Disable AN KR work-around for BCM57810
bnx2x: Remove AutoGrEEEn for BCM84833
bnx2x: Remove 100Mb force speed for BCM84833
bnx2x: Fix PFC setting on BCM57840
bnx2x: Fix Super-Isolate mode for BCM84833
net: fix some sparse errors
net: kill duplicate included header
net: sh-eth: Fix build error by the value which is not defined
net: Use device model to get driver name in skb_gso_segment()
bridge: BH already disabled in br_fdb_cleanup()
net: move sock_update_memcg outside of CONFIG_INET
mwl8k: Fixing Sparse ENDIAN CHECK warning
...
In commit 4ce3c183fc (snmp: 64bit ipstats_mib for all arches), I forgot
to change the /proc/net/dev_snmp6/xxx output for IP counters.
percpu array is 64bit per counter but the folding still used the 'long'
variant, and output garbage on 32bit arches.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF="-D__CHECK_ENDIAN__" M=net
And fix flowi4_init_output() prototype for sport
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
capabilities: remove __cap_full_set definition
security: remove the security_netlink_recv hook as it is equivalent to capable()
ptrace: do not audit capability check when outputing /proc/pid/stat
capabilities: remove task_ns_* functions
capabitlies: ns_capable can use the cap helpers rather than lsm call
capabilities: style only - move capable below ns_capable
capabilites: introduce new has_ns_capabilities_noaudit
capabilities: call has_ns_capability from has_capability
capabilities: remove all _real_ interfaces
capabilities: introduce security_capable_noaudit
capabilities: reverse arguments to security_capable
capabilities: remove the task from capable LSM hook entirely
selinux: sparse fix: fix several warnings in the security server cod
selinux: sparse fix: fix warnings in netlink code
selinux: sparse fix: eliminate warnings for selinuxfs
selinux: sparse fix: declare selinux_disable() in security.h
selinux: sparse fix: move selinux_complete_init
selinux: sparse fix: make selinux_secmark_refcount static
SELinux: Fix RCU deref check warning in sel_netport_insert()
Manually fix up a semantic mis-merge wrt security_netlink_recv():
- the interface was removed in commit fd77846152 ("security: remove
the security_netlink_recv hook as it is equivalent to capable()")
- a new user of it appeared in commit a38f7907b9 ("crypto: Add
userspace configuration API")
causing no automatic merge conflict, but Eric Paris pointed out the
issue.
release idev when ip6_neigh_lookup failed in icmp6_dst_alloc
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).
We miss needed barriers, even on x86, when y is not NULL.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once upon a time netlink was not sync and we had to get the effective
capabilities from the skb that was being received. Today we instead get
the capabilities from the current task. This has rendered the entire
purpose of the hook moot as it is now functionally equivalent to the
capable() call.
Signed-off-by: Eric Paris <eparis@redhat.com>
This ensures a linear behaviour when filling /proc/net/if_inet6 thus making
ifconfig run really fast on IPv6 only addresses. In fact, with this patch and
the IPv4 one sent a while ago, ifconfig will run in linear time regardless of
address type.
IPv4 related patch: f04565ddf5
dev: use name hash for dev_seq_ops
...
Some statistics (running ifconfig > /dev/null on a different setup):
iface count / IPv6 no-patch time / IPv6 patched time / IPv4 time
----------------------------------------------------------------
6250 | 0.23 s | 0.13 s | 0.11 s
12500 | 0.62 s | 0.28 s | 0.22 s
25000 | 2.91 s | 0.57 s | 0.46 s
50000 | 11.37 s | 1.21 s | 0.94 s
128000 | 86.78 s | 3.05 s | 2.54 s
Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com>
Cc: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently Dave noticed that a test we did in ipv6_add_addr to see if we next hop
route for the interface we're adding an addres to was wrong (see commit
7ffbcecbee). for one, it never triggers, and two,
it was completely wrong to begin with. This test was meant to cover this
section of RFC 4429:
3.3 Modifications to RFC 2462 Stateless Address Autoconfiguration
* (modifies section 5.5) A host MAY choose to configure a new address
as an Optimistic Address. A host that does not know the SLLAO
of its router SHOULD NOT configure a new address as Optimistic.
A router SHOULD NOT configure an Optimistic Address.
This patch should bring us into proper compliance with the above clause. Since
we only add a SLAAC address after we've received a RA which may or may not
contain a source link layer address option, we can pass a pointer to that option
to addrconf_prefix_rcv (which may be null if the option is not present), and
only set the optimistic flag if the option was found in the RA.
Change notes:
(v2) modified the new parameter to addrconf_prefix_rcv to be a bool rather than
a pointer to make its use more clear as per request from davem.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During some debugging I needed to look into how /proc/net/ipv6_route
operated and in my digging I found its calling fib6_clean_all() which uses
"write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
found this on 2.6.32, but reading the code I believe the same basic idea
exists currently. Looking at the rtnetlink code they are only calling
"read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
reading from proc isn't the recommended way of fetching the ipv6 route
table; taking a write lock seems unnecessary and would probably cause
network performance issues.
To verify this I loaded up the ipv6 route table and then ran iperf in 3
cases:
* doing nothing
* reading ipv6 route table via proc
(while :; do cat /proc/net/ipv6_route > /dev/null; done)
* reading ipv6 route table via rtnetlink
(while :; do ip -6 route show table all > /dev/null; done)
* Load the ipv6 route table up with:
* for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done
* iperf commands:
* client: iperf -i 1 -V -c <ipv6 addr>
* server: iperf -V -s
* iperf results - 3 runs each (in Mbits/sec)
* nothing: client: 927,927,927 server: 927,927,927
* proc: client: 179,97,96,113 server: 142,112,133
* iproute: client: 928,927,928 server: 927,927,927
lock_stat shows taking the write lock is causing the slowdown. Using this
info I decided to write a version of fib6_clean_all() which replaces
write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
this new function I see the same results as with my rtnetlink iperf test.
Signed-off-by: Josh Hunt <joshhunt00@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some of the rt6_bind_neighbour() call sites, it hasn't hooked
up the rt->dst.dev pointer yet, so we'd deref a NULL pointer when
obtaining dev->ifindex for the neighbour hash function computation.
Just pass the netdevice explicitly in to fix this problem.
Reported-by: Bjarke Istrup Pedersen <gurligebis@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It just obscures that the netdevice pointer and the expires value are
implemented in the dst_entry sub-object of the ipv6 route.
And it makes grepping for dst_entry member uses much harder too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, create and use an rt6_bind_neighbour() in net/ipv6/route.c to
consolidate some common logic.
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.
Which means we need 4 different hash randoms for ipv6.
Signed-off-by: David S. Miller <davem@davemloft.net>
The route we have here is for the address being added to the interface,
ie. for input packet processing.
Therefore using that route to determine whether an output nexthop gateway
is known and resolved doesn't make any sense.
So, simply remove this test, it never triggered anyways.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-By: Neil Horman <nhorman@tuxdriver.com>
RDBG() wasn't even used, and the messages printed by RT6_DEBUG() were
far from useful. Just get rid of all this stuff, we can replace it
with something more suitable if we want.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
(Thanks to Joe Perches for suggesting coccinelle for 0/1 -> true/false).
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 8e2ec63917 ("ipv6: don't
use inetpeer to store metrics for routes.") the test in rt6_alloc_cow()
for setting the ANYCAST flag is now wrong.
'rt' will always now have a plen of 128, because it is set explicitly
to 128 by ip6_rt_copy.
So to restore the semantics of the test, check the destination prefix
length of 'ort'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't just succeed with a route that has a NULL neighbour attached.
This follows the behavior of addrconf_dst_alloc().
Allowing this kind of route to end up with a NULL neigh attached will
result in packet drops on output until the route is somehow
invalidated, since nothing will meanwhile try to lookup the neigh
again.
A statistic is bumped for the case where we see a neigh-less route on
output, but the resulting packet drop is otherwise silent in nature,
and frankly it's a hard error for this to happen and ipv6 should do
what ipv4 does which is say something in the kernel logs.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not merged with the ipv4 match into xt_rpfilter.c
to avoid ipv6 module dependency issues.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.
Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same fix as 731abb9cb2 for ipip and sit tunnel.
Commit 1c5cae815d removed an explicit call to dev_alloc_name in
ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice
will now create a valid name, however the tunnel keeps a copy of the
name in the private parms structure. Fix this by copying the name back
after register_netdevice has successfully returned.
This shows up if you do a simple tunnel add, followed by a tunnel show:
$ sudo ip tunnel add mode ipip remote 10.2.20.211
$ ip tunnel
tunl0: ip/ip remote any local any ttl inherit nopmtudisc
tunl%d: ip/ip remote 10.2.20.211 local any ttl inherit
$ sudo ip tunnel add mode sit remote 10.2.20.212
$ ip tunnel
sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16
sit%d: ioctl 89f8 failed: No such device
sit%d: ipv6/ip remote 10.2.20.212 local any ttl inherit
Cc: stable@vger.kernel.org
Signed-off-by: Ted Feng <artisdom@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no obvious reason to add a default multicast route for loopback
devices, otherwise there would be a route entry whose dst.error set to
-ENETUNREACH that would blocking all multicast packets.
====================
[ more detailed explanation ]
The problem is that the resulting routing table depends on the sequence
of interface's initialization and in some situation, that would block all
muticast packets. Suppose there are two interfaces on my computer
(lo and eth0), if we initailize 'lo' before 'eth0', the resuting routing
table(for multicast) would be
# ip -6 route show | grep ff00::
unreachable ff00::/8 dev lo metric 256 error -101
ff00::/8 dev eth0 metric 256
When sending multicasting packets, routing subsystem will return the first
route entry which with a error set to -101(ENETUNREACH).
I know the kernel will set the default ipv6 address for 'lo' when it is up
and won't set the default multicast route for it, but there is no reason to
stop 'init' program from setting address for 'lo', and that is exactly what
systemd did.
I am sure there is something wrong with kernel or systemd, currently I preferred
kernel caused this problem.
====================
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP diag get_exact handler will require them to find a
socket by provided net, [sd]addr-s, [sd]ports and device.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
like rt6_lookup, but allows caller to pass in flowi6 structure.
Will be used by the upcoming ipv6 netfilter reverse path filter
match.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It's only used in net/ipv6/route.c and the NULL device check is
superfluous for all of the existing call sites.
Just expand the __ndisc_lookup_errno() call at each location.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) x == NULL --> !x
2) x != NULL --> x
3) (x&BIT) --> (x & BIT)
4) (BIT1|BIT2) --> (BIT1 | BIT2)
5) proper argument and struct member alignment
Signed-off-by: David S. Miller <davem@davemloft.net>
While parsing through IPv6 extension headers, fragment headers are
skipped making them invisible to the caller. This reports the
fragment offset of the last header in order to make it possible to
determine whether the packet is fragmented and, if so whether it is
a first or last fragment.
Signed-off-by: Jesse Gross <jesse@nicira.com>
This reverts commit 81d54ec847.
If we take the "try_again" goto, due to a checksum error,
the 'len' has already been truncated. So we won't compute
the same values as the original code did.
Reported-by: paul bilke <fsmail@conspiracy.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need not to used 'delta' flag when add single-source to interface
filter source list.
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>
Igor Maravic reported an error caused by jump_label_dec() being called
from IRQ context :
BUG: sleeping function called from invalid context at kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
1 lock held by swapper/0:
#0: (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
Call Trace:
<IRQ> [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
[<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
[<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8109a37f>] ? local_clock+0x6f/0x80
[<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
[<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
[<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
[<ffffffff81566525>] net_disable_timestamp+0x15/0x20
[<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
[<ffffffff81557b00>] __sk_free+0x80/0x200
[<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff81557cba>] sock_wfree+0x3a/0x70
[<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
[<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
[<ffffffff8155c119>] kfree_skb+0x49/0x170
[<ffffffff815e936e>] arp_error_report+0x3e/0x90
[<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
[<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
[<ffffffff81578d20>] ? neigh_update+0x640/0x640
[<ffffffff81073558>] __do_softirq+0xc8/0x3a0
Since jump_label_{inc|dec} must be called from process context only,
we must defer jump_label_dec() if net_disable_timestamp() is called
from interrupt context.
Reported-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to set np->mcast_hops to it's default value at this moment
otherwise when we use it and found it's value is -1, the logic to
get default hop limit doesn't take multicast into account and will
return wrong hop limit(IPV6_DEFAULT_HOPLIMIT) which is for unicast.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>