For mangling IPv6 packets the protocol header offset needs to be known
by the NAT packet mangling functions. Add a so far unused protoff argument
and convert the conntrack and NAT helpers to use it in preparation of
IPv6 NAT.
Signed-off-by: Patrick McHardy <kaber@trash.net>
IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
(in case of forwarded packets) refragments them at POST_ROUTING
independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
the local route without caring about the original fragment sizes,
thereby breaking PMTUD.
This patch fixes this by keeping track of the largest received fragment
with IP_DF set and generates an ICMP fragmentation required error during
refragmentation if that size exceeds the MTU.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Change since v1:
* Fixed inuse counters access spotted by Eric
In patch eea68e2f (packet: Report socket mclist info via diag module) I've
introduced a "scheduling in atomic" problem in packet diag module -- the
socket list is traversed under rcu_read_lock() while performed under it sk
mclist access requires rtnl lock (i.e. -- mutex) to be taken.
[152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
[152363.820573] 4 locks held by crtools/12517:
[152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e
[152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a
[152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab
[152363.820693] #3: (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af
Similar thing was then re-introduced by further packet diag patches (fanount
mutex and pgvec mutex for rings) :(
Apart from being terribly sorry for the above, I propose to change the packet
sk list protection from spinlock to mutex. This lock currently protects two
modifications:
* sklist
* prot inuse counters
The sklist modifications can be just reprotected with mutex since they already
occur in a sleeping context. The inuse counters modifications are trickier -- the
__this_cpu_-s are used inside, thus requiring the caller to handle the potential
issues with contexts himself. Since packet sockets' counters are modified in two
places only (packet_create and packet_release) we only need to protect the context
from being preempted. BH disabling is not required in this case.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This is the first batch of Netfilter and IPVS updates for your
net-next tree. Mostly cleanups for the Netfilter side. They are:
* Remove unnecessary RTNL locking now that we have support
for namespace in nf_conntrack, from Patrick McHardy.
* Cleanup to eliminate unnecessary goto in the initialization
path of several Netfilter tables, from Jean Sacren.
* Another cleanup from Wu Fengguang, this time to PTR_RET instead
of if IS_ERR then return PTR_ERR.
* Use list_for_each_entry_continue_rcu in nf_iterate, from
Michael Wang.
* Add pmtu_disc sysctl option to disable PMTU in their tunneling
transmitter, from Julian Anastasov.
* Generalize application protocol registration in IPVS and modify
IPVS FTP helper to use it, from Julian Anastasov.
* update Kconfig. The IPVS FTP helper depends on the Netfilter FTP
helper for NAT support, from Julian Anastasov.
* Add logic to update PMTU for IPIP packets in IPVS, again
from Julian Anastasov.
* A couple of sparse warning fixes for IPVS and Netfilter from
Claudiu Ghioc and Patrick McHardy respectively.
Patrick's IPv6 NAT changes will follow after this batch, I need
to flush this batch first before refreshing my tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso discovered that avahi and
potentially NetworkManager accept spoofed Netlink messages because of a
kernel bug. The kernel passes all-zero SCM_CREDENTIALS ancillary data
to the receiver if the sender did not provide such data, instead of not
including any such data at all or including the correct data from the
peer (as it is the case with AF_UNIX).
This bug was introduced in commit 16e5726269
(af_unix: dont send SCM_CREDENTIALS by default)
This patch forces passing credentials for netlink, as
before the regression.
Another fix would be to not add SCM_CREDENTIALS in
netlink messages if not provided by the sender, but it
might break some programs.
With help from Florian Weimer & Petr Matousek
This issue is designated as CVE-2012-3520
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
This is a batch of updates intended for 3.7. The ath9k, mwifiex,
and b43 drivers get the bulk of the commits this time, with a handful
of other driver bits thrown-in. It is mostly just minor fixes and
cleanups, etc.
Also included is a Bluetooth pull, with a lot of refactoring.
Gustavo says:
"These are the changes I queued for 3.7. There are a many
small fixes/improvements by Andre Guedes. A l2cap channel
refcounting refactor by Jaganath. Bluetooth sockets now
appears in /proc/net, by Masatake Yamato and Sachin Kamat
changes ours drivers to use devm_kzalloc()."
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sematically speaking, xfrm_mgr.acquire is called when kernel intends to ask
user space IKE daemon to negotiate SAs with peers. IOW the direction will
*always* be XFRM_POLICY_OUT, so remove int dir for clarity.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I got the following compile error:
In file included from include/net/sctp/checksum.h:46:0,
from net/ipv4/netfilter/nf_nat_proto_sctp.c:14:
include/net/sctp/sctp.h: In function ‘sctp_dbg_objcnt_init’:
include/net/sctp/sctp.h:370:88: error: parameter name omitted
include/net/sctp/sctp.h: In function ‘sctp_dbg_objcnt_exit’:
include/net/sctp/sctp.h:371:88: error: parameter name omitted
which is caused by
commit 13d782f6b4
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon Aug 6 08:45:15 2012 +0000
sctp: Make the proc files per network namespace.
This patch could fix it.
Cc: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add struct net as a parameter to sctp_verify_param so it can be passed
to sctp_verify_ext_param where struct net will be needed when the sctp
tunables become per net tunables.
Add struct net as a parameter to sctp_verify_init so struct net can be
passed to sctp_verify_param.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a handle of state machine functions primarily those dealing
with processing INIT packets where there is neither a valid endpoint nor
a valid assoication from which to derive a struct net. Therefore add
struct net * to the parameter list of sctp_state_fn_t and update all of
the state machine functions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct net will be needed shortly when the tunables are made per network
namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trickles up through sctp_sm_lookup_event up to sctp_do_sm
and up further into sctp_primitiv_NAME before the code reaches
places where struct net can be reliably found.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start with an empty sctp_net_table that will be populated as the various
tunable sysctls are made per net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Convert all of the files under /proc/net/sctp to be per
network namespace.
- Don't print anything for /proc/net/sctp/snmp except in
the initial network namespaces as the snmp counters still
have to be converted to be per network namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Kill sctp_get_ctl_sock, it is useless now.
- Pass struct net where needed so net->sctp.ctl_sock is accessible.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Move the address lists into struct net
- Add per network namespace initialization and cleanup
- Pass around struct net so it is everywhere I need it.
- Rename all of the global variable references into references
to the variables moved into struct net
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use struct net in the hash calculation
- Use sock_net(association.base.sk) in the association lookups.
- On receive calculate the network namespace from skb->dev.
- Pass struct net from receive down to the functions that actually
do the association lookup.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use struct net in the hash calculation
- Use sock_net(endpoint.base.sk) in the endpoint lookups.
- On receive calculate the network namespace from skb->dev.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add struct net into the port hash table hash calculation
- Add struct net inot the struct sctp_bind_bucket so there
is a memory of which network namespace a port is allocated in.
No need for a ref count because sctp_bind_bucket only exists
when there are sockets in the hash table and sockets can not
change their network namspace, and sockets already ref count
their network namespace.
- Add struct net into the key comparison when we are testing
to see if we have found the port hash table entry we are
looking for.
With these changes lookups in the port hash table becomes
safe to use in multiple network namespaces.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
llc_station_init() creates and processes an event skb with no effect
other than to change the state from DOWN to UP. Allocation failure is
reported, but then ignored by its caller, llc2_init(). Remove this
possibility by simply initialising the state as UP.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
One condition before codel_Newton_step() was not good if
we never left the dropping state for a flow. As a result
rec_inv_sqrt was 0, instead of the ~0 initial value.
codel control law was then set to a very aggressive mode, dropping
many packets before reaching 'target' and recovering from this problem.
To keep codel_vars_init() as efficient as possible, refine
the condition to make sure rec_inv_sqrt initial value is correct
Many thanks to Anton Mich for discovering the issue and suggesting
a fix.
Reported-by: Anton Mich <lp2s1h@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_send_skb() can send orphaned skb, so we must pass the net pointer to
avoid possible NULL dereference in error path.
Bug added by commit 3a7c384ffd (ipv4: tcp: unicast_sock should not
land outside of TCP stack)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disabling PMTU discovery can increase the output packet
rate but some users have enough resources and prefer to fragment
than to drop traffic. By default, we copy the DF bit but if
pmtu_disc is disabled we do not send FRAG_NEEDED messages anymore.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Get rid of the ftp_app pointer and allow applications
to be registered without adding fields in the netns_ipvs structure.
v2: fix coding style as suggested by Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
As pointed out, there are places, that access net->loopback_dev->ifindex
and after ifindex generation is made per-net this value becomes constant
equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
it where appropriate.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Strictly speaking this is only _really_ required for checkpoint-restore to
make loopback device always have the same index.
This change appears to be safe wrt "ifindex should be unique per-system"
concept, as all the ifindex usage is either already made per net namespace
of is explicitly limited with init_net only.
There are two cool side effects of this. The first one -- ifindices of
devices in container are always small, regardless of how many containers
we've started (and re-started) so far. The second one is -- we can speed
up the loopback ifidex access as shown in the next patch.
v2: Place ifindex right after dev_base_seq : avoid two holes and use the
same cache line, dirtied in list_netdevice()/unlist_netdevice()
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric noticed, that when there will be devices with equal indices, some
hash functions that use them will become less effective as they could.
Fix this in advance by mixing the net_device address into the hash value
instead of the device index.
This is true for arp and ndisc hash fns. The netlabel, can and llc ones
are also ifindex-based, but that three are init_net-only, thus will not
be affected.
Many thanks to David and Eric for the hash32_ptr implementation!
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating on network performance problems, I found this little
gem :
$ nm -v vmlinux | grep -1 dst_default_metrics
ffffffff82736540 b busy.46605
ffffffff82736560 B dst_default_metrics
ffffffff82736598 b dst_busy_list
Apparently, declaring a const array without initializer put it in
(writeable) bss section, in middle of possibly often dirtied cache
lines.
Since we really want dst_default_metrics be const to avoid any possible
false sharing and catch any buggy writes, I force a null initializer.
ffffffff818a4c20 R dst_default_metrics
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Avoid dirtying neighbour's confirmed field.
TCP workloads hits this cache line for each incoming ACK.
Lets write n->confirmed only if there is a jiffie change.
2) Optimize neigh_hh_output() for the common Ethernet case, were
hh_len is less than 16 bytes. Replace the memcpy() call
by two inlined 64bit load/stores on x86_64.
Bench results using udpflood test, with -C option (MSG_CONFIRM flag
added to sendto(), to reproduce the n->confirmed dirtying on UDP)
24 threads doing 1.000.000 UDP sendto() on dummy device, 4 runs.
before : 2.247s, 2.235s, 2.247s, 2.318s
after : 1.884s, 1.905s, 1.891s, 1.895s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid two instructions to reload dev->nd_net->mib.ip_statistics pointer,
unsing a temp variable, in ip_rcv(), ip_output() paths for example.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 needs a cookie in dst_check() call.
We need to add rx_dst_cookie and provide a family independent
sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to check the 'Role' parameter from the LE Connection
Complete Event in order to properly set 'out' and 'link_mode'
fields from hci_conn structure.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
lsof command can tell the type of socket processes are using.
Internal lsof uses inode numbers on socket fs to resolve the type of
sockets. Files under /proc/net/, such as tcp, udp, unix, etc provides
such inode information.
Unfortunately bluetooth related protocols don't provide such inode
information. This patch series introduces /proc/net files for the protocols.
This patch against af_bluetooth.c provides facility to the implementation
of protocols. This patch extends bt_sock_list and introduces two exported
function bt_procfs_init, bt_procfs_cleanup.
The type bt_sock_list is already used in some of implementation of
protocols. bt_procfs_init prepare seq_operations which converts
protocol own bt_sock_list data to protocol own proc entry when the
entry is accessed.
What I, lsof user, need is just inode number of bluetooth
socket. However, people may want more information. The bt_procfs_init
takes a function pointer for customizing the show handler of
seq_operations.
In v4 patch, __acquires and __releases attributes are added to suppress
sparse warning. Suggested by Andrei Emeltchenko.
In v5 patch, linux/proc_fs.h is included to use PDE. Build error is
reported by Fengguang Wu.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Move the l2cap channel list chan->global_l under the refcnt
protection and free it based on the refcnt.
Signed-off-by: Jaganath Kanakkassery <jaganath.k@samsung.com>
Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>
Reviewed-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Refactor the code in order to use the l2cap_chan_destroy()
from l2cap_chan_put() under the refcnt protection.
Signed-off-by: Jaganath Kanakkassery <jaganath.k@samsung.com>
Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>
Reviewed-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Return values are never used because callers hci_proto_connect_cfm
and hci_proto_disconn_cfm return void.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
AMP status codes copied from Bluez patch sent by Peter Krystad
<pkrystad@codeaurora.org>.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Change spaces to tabs in smp code
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Use the same style for refcnt printing through all Bluetooth code
taking the reference the l2cap_chan refcnt printing.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Add check that HCI controller is BR/EDR. AMP controller shall not be
managed by mgmt interface and consequently user space.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch removes the struct adv_entry since it is not used anymore.
This struct should have been removed in commit 479453d (Bluetooth:
Remove advertising cache).
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>