Commit Graph

665806 Commits

Author SHA1 Message Date
David S. Miller
a01aa920b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:

1) Check for ct->status bit instead of using nfct_nat() from IPVS and
   Netfilter codebase, patch from Florian Westphal.

2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

4) Introduce nft_is_base_chain() helper function.

5) Enforce expectation limit from userspace conntrack helper,
   from Gao Feng.

6) Add nf_ct_remove_expect() helper function, from Gao Feng.

7) NAT mangle helper function return boolean, from Gao Feng.

8) ctnetlink_alloc_expect() should only work for conntrack with
   helpers, from Gao Feng.

9) Add nfnl_msg_type() helper function to nfnetlink to build the
   netlink message type.

10) Get rid of unnecessary cast on void, from simran singhal.

11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

12) Use list_prev_entry() from nf_tables, from simran signhal.

13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

24) Add event mask support to nft_ct, also from Florian.

25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

29) Shrink size of nf_conntrack_ecache structure, from Florian.

30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

31) Register SYNPROXY hooks on demand, from Florian Westphal.

32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:47:53 -04:00
David S. Miller
edd7f4efa8 Merge branch 'bpf-samples-skb_mode-bug-fixes'
Jesper Dangaard Brouer says:

====================
samples/bpf: two bug fixes to XDP_FLAGS_SKB_MODE attaching

Two small bugfixes for:
 commit 3993f2cb98 ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:42:38 -04:00
Jesper Dangaard Brouer
f76254a845 samples/bpf: fix XDP_FLAGS_SKB_MODE detach for xdp_tx_iptunnel
The xdp_tx_iptunnel program can be terminated in two ways, after
N-seconds or via Ctrl-C SIGINT.  The SIGINT code path does not
handle detatching the correct XDP program, in-case the program
was attached with XDP_FLAGS_SKB_MODE.

Fix this by storing the XDP flags as a global variable, which is
available for the SIGINT handler function.

Fixes: 3993f2cb98 ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:42:37 -04:00
Jesper Dangaard Brouer
6387d0111c samples/bpf: fix SKB_MODE flag to be a 32-bit unsigned int
The kernel side of XDP_FLAGS_SKB_MODE is unsigned, and the rtnetlink
IFLA_XDP_FLAGS is defined as NLA_U32. Thus, userspace programs under
samples/bpf/ should use the correct type.

Fixes: 3993f2cb98 ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:42:37 -04:00
David S. Miller
d74a32acd5 Merge branch 'xdp-netlink-ext-ack'
Jakub Kicinski says:

====================
xdp: use netlink extended ACK reporting

This series is an attempt to make XDP more user friendly by
enabling exploiting the recently added netlink extended ACK
reporting to carry messages to user space.

David Ahern's iproute2 ext ack patches for ip link are sufficient
to show the errors like this:

Error: nfp: MTU too large w/ XDP enabled

Where the message is coming directly from the driver.  There could
still be a bit of a leap for a complete novice from the message
above to the right settings, but it's a big improvement over the
standard "Invalid argument" message.

v1/non-rfc:
 - add a separate macro in patch 1;
 - add KBUILD_MODNAME as part of the message (Daniel);
 - don't print the error to logs in patch 1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:48 -04:00
Jakub Kicinski
9861ce039c virtio_net: make use of extended ack message reporting
Try to carry error messages to the user via the netlink extended
ack message attribute.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:48 -04:00
Jakub Kicinski
d957c0f711 nfp: make use of extended ack message reporting
Try to carry error messages to the user via the netlink extended
ack message attribute.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:47 -04:00
Jakub Kicinski
ddf9f97076 xdp: propagate extended ack to XDP setup
Drivers usually have a number of restrictions for running XDP
- most common being buffer sizes, LRO and number of rings.
Even though some drivers try to be helpful and print error
messages experience shows that users don't often consult
kernel logs on netlink errors.  Try to use the new extended
ack mechanism to carry the message back to user space.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:47 -04:00
Jakub Kicinski
45d9b378e8 netlink: add NULL-friendly helper for setting extended ACK message
As we propagate extended ack reporting throughout various paths in
the kernel it may be that the same function is called with the
extended ack parameter passed as NULL.  One place where that happens
is in drivers which have a centralized reconfiguration function
called both from ndos and from ethtool_ops.  Add a new helper for
setting the error message in such conditions.

Existing helper is left as is to encourage propagating the ext act
fully wherever possible.  It also makes it clear in the code which
messages may be lost due to ext ack being NULL.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:35:47 -04:00
Liping Zhang
8eeef23504 netfilter: nf_ct_ext: invoke destroy even when ext is not attached
For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table,
then remove it from the nat_bysource_table via nat_extend->destroy.

But now, the nat extension is attached on demand, so if the nat extension
is not attached, we will not be notified when the ct is destroyed, i.e.
we may fail to remove ct from the nat_bysource_table.

So just keep it simple, even if the extension is not attached, we will
still invoke the related ext->destroy. And this will also preserve the
flexibility for the future extension.

Fixes: 9a08ecfe74 ("netfilter: don't attach a nat extension by default")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:48:49 +02:00
Pablo Neira Ayuso
d1908ca8dc Merge tag 'ipvs3-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Third Round of IPVS Updates for v4.12

please consider these enhancements to IPVS for v4.12.
If it is too late for v4.12 then please consider them for v4.13.

* Remove unused function
* Correct comparison of unsigned value
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:46:50 +02:00
Florian Westphal
0e72f55f35 netfilter: snmp: avoid stack size warning
net/ipv4/netfilter/nf_nat_snmp_basic.c:1158:1: warning: the frame size
of 1160 bytes is larger than 1024 bytes

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:43:58 +02:00
Florian Westphal
039b40ee58 netfilter: nf_queue: only call synchronize_net twice if nf_queue is active
nf_unregister_net_hook(s) can avoid a second call to synchronize_net,
provided there is no nfqueue active in that net namespace (which is
the common case).

This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally
this gets called during netns cleanup so no packets should be queued.

For the rare case of base chain being unregistered or module removal
while nfqueue is in use the extra hiccup due to the packet drops isn't
a big deal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:12 +02:00
Florian Westphal
c83fa19603 netfilter: nf_log: don't call synchronize_rcu in nf_log_unset
nf_log_unregister() (which is what gets called in the logger backends
module exit paths) does a (required, module is removed) synchronize_rcu().

But nf_log_unset() is only called from pernet exit handlers. It doesn't
free any memory so there appears to be no need to call synchronize_rcu.

v2: Liping Zhang points out that nf_log_unregister() needs to be called
after pernet unregister, else rmmod would become unsafe.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:07 +02:00
Florian Westphal
933bd83ed6 netfilter: batch synchronize_net calls during hook unregister
synchronize_net is expensive and slows down netns cleanup a lot.

We have two APIs to unregister a hook:
nf_unregister_net_hook (which calls synchronize_net())
and
nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop)

Make nf_unregister_net_hook a wapper around new helper
__nf_unregister_net_hook, which unlinks the hook but does not free it.

Then, we can call that helper in nf_unregister_net_hooks and then
call synchronize_net() only once.

Andrey Konovalov reports this change improves syzkaller fuzzing speed at
least twice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:18:54 +02:00
Abhishek Shah
733336262b net: phy: Allow BCM5481x PHYs to setup internal TX/RX clock delay
This patch allows users to enable/disable internal TX and/or RX
clock delay for BCM5481x series PHYs so as to satisfy RGMII timing
specifications.

On a particular platform, whether TX and/or RX clock delay is required
depends on how PHY connected to the MAC IP. This requirement can be
specified through "phy-mode" property in the platform device tree.

Signed-off-by: Abhishek Shah <abhishek.shah@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:56:19 -04:00
Colin Ian King
d832565032 net: sunhme: fix spelling mistakes: "ParityErro" -> "ParityError"
trivial fix to spelling mistakes in printk message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:55:47 -04:00
Scott Wood
9b70de6d02 bnx2x: Align RX buffers
The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.

Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.

This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:

Unable to handle kernel paging request for unaligned access at address 0xc00000ffc43af656
Faulting instruction address: 0xc00000000080ef8c
Oops: Kernel access of bad area, sig: 7 [#1]
SMP NR_CPUS=2048
NUMA
PowerNV
Modules linked in: vmx_crypto powernv_rng rng_core powernv_op_panel leds_powernv led_class nfsd ip_tables x_tables autofs4 xfs lpfc bnx2x mdio libcrc32c crc_t10dif crct10dif_generic crct10dif_common
CPU: 104 PID: 0 Comm: swapper/104 Not tainted 4.11.0-rc8-00088-g4c761da #2
task: c00000ffd4892400 task.stack: c00000ffd4920000
NIP: c00000000080ef8c LR: c00000000080eee8 CTR: c0000000001f8320
REGS: c00000ffffc33710 TRAP: 0600   Not tainted  (4.11.0-rc8-00088-g4c761da)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
  CR: 24082042  XER: 00000000
CFAR: c00000000080eea0 DAR: c00000ffc43af656 DSISR: 00000000 SOFTE: 1
GPR00: c000000000907f64 c00000ffffc33990 c000000000dd3b00 c00000ffcaf22100
GPR04: c00000ffcaf22e00 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000b80008 c00000ffc43af636 c00000ffc43af656 0000000000000000
GPR12: c0000000001f6f00 c00000000fe1a000 000000000000049f 000000000000c51f
GPR16: 00000000ffffef33 0000000000000000 0000000000008a43 0000000000000001
GPR20: c00000ffc58a90c0 0000000000000000 000000000000dd86 0000000000000000
GPR24: c000007fd0ed10c0 00000000ffffffff 0000000000000158 000000000000014a
GPR28: c00000ffc43af010 c00000ffc9144000 c00000ffcaf22e00 c00000ffcaf22100
NIP [c00000000080ef8c] __skb_clone+0xdc/0x140
LR [c00000000080eee8] __skb_clone+0x38/0x140
Call Trace:
[c00000ffffc33990] [c00000000080fb74] skb_clone+0x74/0x110 (unreliable)
[c00000ffffc339c0] [c000000000907f64] packet_rcv+0x144/0x510
[c00000ffffc33a40] [c000000000827b64] __netif_receive_skb_core+0x5b4/0xd80
[c00000ffffc33b00] [c00000000082b2bc] netif_receive_skb_internal+0x2c/0xc0
[c00000ffffc33b40] [c00000000082c49c] napi_gro_receive+0x11c/0x260
[c00000ffffc33b80] [d000000066483d68] bnx2x_poll+0xcf8/0x17b0 [bnx2x]
[c00000ffffc33d00] [c00000000082babc] net_rx_action+0x31c/0x480
[c00000ffffc33e10] [c0000000000d5a44] __do_softirq+0x164/0x3d0
[c00000ffffc33f00] [c0000000000d60a8] irq_exit+0x108/0x120
[c00000ffffc33f20] [c000000000015b98] __do_irq+0x98/0x200
[c00000ffffc33f90] [c000000000027f14] call_do_irq+0x14/0x24
[c00000ffd4923a90] [c000000000015d94] do_IRQ+0x94/0x110
[c00000ffd4923ae0] [c000000000008d90] hardware_interrupt_common+0x150/0x160

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:50:19 -04:00
Arkadi Sharshevsky
58073b32b0 net: bridge: Fix improper taking over HW learned FDB
Commit 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb
entries") added the ability to "take over an entry which was previously
learned via HW when it shows up from a SW port".

However, if an entry was learned via HW and then a control packet
(e.g., ARP request) was trapped to the CPU, the bridge driver will
update the entry and remove the externally learned flag, although the
entry is still present in HW. Instead, only clear the externally learned
flag in case of roaming.

Fixes: 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb entries")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Arkadi Sharashevsky <arkadis@mellanox.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:46:32 -04:00
WANG Cong
ba3f571d5d ipv4: get rid of ip_ra_lock
After commit 1215e51eda ("ipv4: fix a deadlock in ip_ra_control")
we always take RTNL lock for ip_ra_control() which is the only place
we update the list ip_ra_chain, so the ip_ra_lock is no longer needed.

As Eric points out, BH does not need to disable either, RCU readers
don't care.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:44:04 -04:00
Jesper Dangaard Brouer
5010e94842 samples/bpf: bpf_load.c detect and abort if ELF maps section size is wrong
The struct bpf_map_def was extended in commit fb30d4b712 ("bpf: Add tests
for map-in-map") with member unsigned int inner_map_idx.  This changed the size
of the maps section in the generated ELF _kern.o files.

Unfortunately the loader in bpf_load.c does not detect or handle this.  Thus,
older _kern.o files became incompatible, and caused hard-to-debug errors
where the syscall validation rejected BPF_MAP_CREATE request.

This patch only detect the situation and aborts load_bpf_file(). It also
add code comments warning people that read this loader for inspiration
for these pitfalls.

Fixes: fb30d4b712 ("bpf: Add tests for map-in-map")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:59 -04:00
Dan Carpenter
39f3709599 lwtunnel: fix error path in lwtunnel_fill_encap()
We recently added a check to see if nla_nest_start() fails.  There are
two issues with that.  First, if it fails then I don't think we should
call nla_nest_cancel().  Second, it's slightly convoluted but the
current code returns success but we should return -EMSGSIZE instead.

Fixes: a50fe0ffd7 ("lwtunnel: check return value of nla_nest_start")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:29 -04:00
Dan Carpenter
77041e89ce liquidio: silence a locking static checker warning
Presumably we never hit this return, but static checkers complain that
we need to unlock so we may as well fix that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:07 -04:00
Dan Carpenter
66117a9d9a qed: Unlock on error in qed_vf_pf_acquire()
My static checker complains that we're holding a mutex on this error
path.  Let's goto exit instead of returning directly.

Fixes: b0bccb69eb ("qed: Change locking scheme for VF channel")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:40:45 -04:00
David S. Miller
1c942c941d Merge branch 'hns-deferred-probe'
lipeng says:

====================
net: hns: bug fix for HNS driver

This patchset add support defered dsaf probe when mdio and
mbigen module is not insmod.

For more details, please refer to individual patch.

change log:
V4 - > V5:
1. Float on net-next;
2. Delete patch "net: hns: fixed bug that skb used after kfree"
   from this patchset;

V3 -> V4:
1. Delete redundant commit message;
2. Add Reviewed-by: Matthias Brugger <mbrugger@suse.com>;

V2 -> V3:
1. Check return value when  platform_get_irq in hns_rcb_get_cfg;

V1 -> V2:
1. Return appropriate errno in hns_mac_register_phy;
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:25 -04:00
lipeng
804ffe5c61 net: hns: support deferred probe when no mdio
In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

We check for probe deferral in the mac init, so we not init DSAF
when there is no mdio, and free all resource, to later learn that
we need to defer the probe.

Signed-off-by: lipeng <lipeng321@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:24 -04:00
lipeng
2fdd6bafe3 net: hns: support deferred probe when can not obtain irq
In the hip06 and hip07 SoCs, the interrupt lines from the
DSAF controllers are connected to mbigen hw module.
The mbigen module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

Signed-off-by: lipeng <lipeng321@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:24 -04:00
David S. Miller
ba1d82e681 Merge branch 'nfp-XDP_TX-optimizations'
Jakub Kicinski says:

====================
nfp: optimize XDP TX and small fixes

This series optimizes the nfp XDP TX performance a little bit.
I run quick tests on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Single core/queue performance for both touch and drop and touch and
forward is above 20Mpps @64B packets, drop being 2Mpps faster.
I think this is max for a single queue on the low power NFPs.

There are also a few minor fixes included for code in net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:01 -04:00
Jakub Kicinski
dbf637ff39 nfp: provide 256 bytes of XDP headroom in all configurations
For legacy reasons NFP FW may be compiled to DMA packets to a constant
offset into the buffer and use the space before it for metadata.  This
ensures that packets data always start at a certain offset regardless of
the amount of preceding metadata.

If rx offset is set to 0 there may still be up to 64 bytes of metadata
but metadata will start at the beginning of the buffer, instead of:

    data_start_offset = rx_offset - meta_len

Even though we make the buffers larger to accommodate up to 64 bytes of
metadata, if there is only N bytes of metadata, we will end up with
N bytes of headroom and 64 - N bytes of tailroom.  Therefore we can't
rely on that space for XDP headroom.  Make sure we always allocate
full 256 bytes.  This, unfortunately, means we can't fit the headroom
on an u8 any more.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski
85cb207ee3 nfp: don't completely refuse to work with old flashes
Right now the required Service Process ABI version is still tied
to max ID of known commands.  For new NSP commands we are adding
we are checking if NSP version is recent enough on command-by-command
basis.  The driver doesn't have to force the device to have the
very latest flash, anything newer than 0.8 should do.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski
d38df0d364 nfp: avoid reading TX queue indexes from the device
Reading TX queue indexes from the device memory on each interrupt
is expensive.  It's doubly expensive with XDP running since we have
two TX rings to check there.  If the software indexes indicate that
the TX queue is completely empty, however, we don't need to look at
the device completion index at all.

The queuing CPU is doing a wmb() before kicking the device TX so
we should be safe to assume on the CPU handling the completions will
never see old value of the software copy of the index.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski
92e68195eb nfp: do simple XDP TX buffer recycling
On the RX path we follow the "drop if allocation of replacement
buffer fails" rule.  With XDP we extended that to the TX action,
so if XDP prog returned TX but allocation of replacement RX buffer
failed, we will drop the packet.

To improve our XDP TX performance extend the idea of rings being
always full to XDP TX rings.  Pre-fill the XDP TX rings with RX
buffers, and when XDP prog returns TX action swap the RX buffer
with the next buffer from the TX ring.

XDP TX complete will no longer free the buffers but let them
sit on the TX ring and wait for swap with RX buffer, instead.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski
d78005a50f nfp: drop rx_ring param from buffer allocation
We will soon allocate RX buffers for caching on XDP TX rings.
The rx_ring parameter passed to nfp_net_rx_alloc_one() is not
actually used, remove it.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski
46c505188b nfp: replace -ENOTSUPP with -EOPNOTSUPP
As Or points out in commit 423b3aecf2 ("net/mlx4: Change ENOTSUPP
to EOPNOTSUPP"), ENOTSUPP is NFS specific error.  Replace it with
EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Willem de Bruijn
1d11e732e7 virtio-net: use netif_tx_napi_add for tx napi
Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:33:04 -04:00
David Howells
b5082df801 net: Initialise init_net.count to 1
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:32:16 -04:00
Girish Moodalbail
5e0740c445 geneve: fix incorrect setting of UDP checksum flag
Creating a geneve link with 'udpcsum' set results in a creation of link
for which UDP checksum will NOT be computed on outbound packets, as can
be seen below.

11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
    geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum

Similarly, creating a link with 'noudpcsum' set results in a creation
of link for which UDP checksum will be computed on outbound packets.

Fixes: 9b4437a5b8 ("geneve: Unify LWT and netdev handling.")
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:31:16 -04:00
David S. Miller
c4879789b1 Merge branch 'vxlan-disabled-ipv6'
Jiri Benc says:

====================
vxlan: do not error out on disabled IPv6

This patchset fixes a bug with metadata based tunnels when booted with
ipv6.disable=1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Jiri Benc
baf4d78607 vxlan: do not output confusing error message
The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.

Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Jiri Benc
d074bf9600 vxlan: correctly handle ipv6.disable module parameter
When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

Fixes: b1be00a6c3 ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Andy Shevchenko
90a1bb9816 bnx2x: Get rid of useless temporary variable
Replace pattern

 int status;
 ...
 status = func(...);
 return status;

by

 return func(...);

No functional change intented.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
Andy Shevchenko
b77f016726 bnx2x: Reuse bnx2x_null_format_ver()
Reuse bnx2x_null_format_ver() in functions where it's appropriated
instead of open coded variant.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
Andy Shevchenko
55b218c12c bnx2x: Replace custom scnprintf()
Use scnprintf() when printing version instead of custom open coded variants.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
David S. Miller
4750c7be5b linux-can-next-for-4.12-20170427
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlkBofATHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAe/sog7Dgc9qY7CACnNQLrwm8fIV72DD9+pZ1QsNwvLHCr
 8ZiHSd7m/VI/FESje4Uc3D9vKi7K6MGX9/lWi70QY+2IPrTfRTrYEO5I4U3+1czc
 zQ/NnQGsKBoAJdPLoXp0LkQVAXRSXUoKiaiU0VdktaAzGewixz9YxGHAcCqcu6oM
 1+MftSR0iCP4mkcOyVMcnipcI9L8TzEOz2f3XB/RtMiRGcEsUekAHfdpH6NSHbAq
 mG5pOwENqNn4i1ZwGg16u3p+U4YfWAF9Y+/wuev6Ob4Kla2tYnuZNEAqf9ZubY93
 NBsT3pGjcbzbbihlKvu64NUlGiloQ8AGAYvttinrfx1tnsDgvP2m6nQP
 =mHEY
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.12-20170427' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-04-25

this is a pull request of 1 patch for net-next/master.

This patch by Oliver Hartkopp fixes the build of the broad cast manager
with CONFIG_PROC_FS disabled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:25:59 -04:00
Chenbo Feng
5d4e344328 bpf: Fix inaccurate helper function description
The description inside uapi/linux/bpf.h about bpf_get_socket_uid
helper function is no longer valid. It returns overflowuid rather
than 0 when failed.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:24:58 -04:00
Davide Caratti
d68be71ea1 tcp: fix access to sk->sk_state in tcp_poll()
avoid direct access to sk->sk_state when tcp_poll() is called on a socket
using active TCP fastopen with deferred connect. Use local variable
'state', which stores the result of sk_state_load(), like it was done in
commit 00fd38d938 ("tcp: ensure proper barriers in lockless contexts").

Fixes: 19f6d3f3c8 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:24:16 -04:00
Eric Dumazet
d1f496fd8f bpf: restore skb->sk before pskb_trim() call
While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
in skb_try_coalesce() reported by Andrey, I found that we had an skb
with skb->sk set but no skb->destructor.

This invalidated heuristic found in commit 158f323b98 ("net: adjust
skb->truesize in pskb_expand_head()") and in cited patch.

Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
restrain the temporary setting to a minimal section.

[1] https://patchwork.ozlabs.org/patch/755570/
    net: adjust skb->truesize in ___pskb_trim()

Fixes: 8f917bba00 ("bpf: pass sk to helper functions")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:23:16 -04:00
Alexandre Belloni
ae3696c167 net: macb: fix phy interrupt parsing
Since 83a77e9ec4, the phydev irq is explicitly set to PHY_POLL when
there is no pdata. It doesn't work on DT enabled platforms because the
phydev irq is already set by libphy before.

Fixes: 83a77e9ec4 ("net: macb: Added PCI wrapper for Platform Driver.")
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:21:49 -04:00
David S. Miller
8dd5b698c2 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-04-30

This series contains updates to i40e and i40evf only.

Jake provides majority of the changes in this series, starting with the
renaming of a flag to avoid confusion.  Then renamed a variable to a
more meaningful name to clarify what is actually being done and to
reduce confusion.  Amortizes the wait time when initializing or disabling
lots of VFs by using i40e_reset_all_vfs() and
i40e_vsi_stop_rings_no_wait().  Cleaned up a unnecessary delay since
pci_disable_sriov() already has its own delay, so need to add a additional
delay when removing VFs.  Avoid using the same name flags for both
vsi->state and pf->state, to make code review easier and assist future
work to use the correct state field when checking bits.  Use
DECLARE_BITMAP() to ensure that we always allocate enough space for flags.
Replace hw_disabled_flags with the new _AUTO_DISABLED flags, which are
more readable because we are not setting an *_ENABLED flag to
disable the feature.

Alex corrects a oversight where we were not reprogramming the ports
after a reset, which was causing us to lose all of the receive tunnel
offloads.

Arnd Bergmann moves the declaration of a local variable to avoid a
warning seen on architectures with larger pages about an unused variable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 11:33:37 -04:00
David S. Miller
5b36d8f5e5 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2017-04-30

This series contains updates to e1000e only.

Jarod Wilson fixes an issue where the workaround for 82574 & 82583
is needed for i218 as well, so set the appropriate flags.

Sasha adds support for the upcoming new i219 devices for the client
platform (CannonLake), which includes the support for 38.4MHz frequency
to support PTP on CannonLake.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 11:30:13 -04:00