kernel-ark

Author	SHA1	Message	Date
Marcel Holtmann	71653eb64b	Bluetooth: Add selftest for ECDH key generation Since the ECDH key generation takes a different path, it needs to be tested as well. For this generate the public debug key from the private debug key and compare both. This also moves the seeding of the private key into the SMP calling code to allow for easier re-use of the ECDH key generation helper. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2017-04-30 16:52:43 +03:00
Marcel Holtmann	f958315358	Bluetooth: zero kpp input for key generation When generating new ECDH keys with kpp, the shared secret input needs to be set to NULL. Fix this by including kpp_request_set_input call. Fixes: `58771c1c` ("Bluetooth: convert smp and selftest to crypto kpp API") Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2017-04-30 16:52:39 +03:00
Szymon Janc	ab89f0bdd6	Bluetooth: Fix user channel for 32bit userspace on 64bit kernel Running 32bit userspace on 64bit kernel results in MSG_CMSG_COMPAT being defined as 0x80000000. This results in sendmsg failure if used from 32bit userspace running on 64bit kernel. Fix this by accounting for MSG_CMSG_COMPAT in flags check in hci_sock_sendmsg. Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Signed-off-by: Marko Kiiskila <marko@runtime.io> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org	2017-04-30 12:22:14 +02:00
Salvatore Benedetto	763d9a302a	Bluetooth: allocate data for kpp on heap Bluetooth would crash when computing ECDH keys with kpp if VMAP_STACK is enabled. Fix by allocating data passed to kpp on heap. Fixes: `58771c1c` ("Bluetooth: convert smp and selftest to crypto kpp API") Signed-off-by: Salvatore Benedetto <salvatore.benedetto@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-04-30 12:22:05 +02:00
Linus Torvalds	0e9117882d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Just a couple more stragglers, I really hope this is it. 1) Don't let frags slip down into the GRO segmentation handlers, from Steffen Klassert. 2) Truesize under-estimation triggers warnings in TCP over loopback with socket filters, 2 part fix from Eric Dumazet. 3) Fix undesirable reset of bonding MTU to ETH_HLEN on slave removal, from Paolo Abeni. 4) If we flush the XFRM policy after garbage collection, it doesn't work because stray entries can be created afterwards. Fix from Xin Long. 5) Hung socket connection fixes in TIPC from Parthasarathy Bhuvaragan. 6) Fix GRO regression with IPSEC when netfilter is disabled, from Sabrina Dubroca. 7) Fix cpsw driver Kconfig dependency regression, from Arnd Bergmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: hso: register netdev later to avoid a race condition net: adjust skb->truesize in ___pskb_trim() tcp: do not underestimate skb->truesize in tcp_trim_head() bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal ipv4: Don't pass IP fragments to upper layer GRO handlers. cpsw/netcp: refine cpts dependency tipc: close the connection if protocol messages contain errors tipc: improve error validations for sockets in CONNECTING state tipc: Fix missing connection request handling xfrm: fix GRO for !CONFIG_NETFILTER xfrm: do the garbage collection after flushing policy	2017-04-28 14:13:16 -07:00
Eric Dumazet	c21b48cc1b	net: adjust skb->truesize in ___pskb_trim() Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in skb_try_coalesce() using syzkaller and a filter attached to a TCP socket. As we did recently in commit `158f323b98` ("net: adjust skb->truesize in pskb_expand_head()") we can adjust skb->truesize from ___pskb_trim(), via a call to skb_condense(). If all frags were freed, then skb->truesize can be recomputed. This call can be done if skb is not yet owned, or destructor is sock_edemux(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 16:06:47 -04:00
Eric Dumazet	7162fb242c	tcp: do not underestimate skb->truesize in tcp_trim_head() Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in skb_try_coalesce() using syzkaller and a filter attached to a TCP socket over loopback interface. I believe one issue with looped skbs is that tcp_trim_head() can end up producing skb with under estimated truesize. It hardly matters for normal conditions, since packets sent over loopback are never truncated. Bytes trimmed from skb->head should not change skb truesize, since skb->head is not reallocated. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 16:05:22 -04:00
Steffen Klassert	9b83e03198	ipv4: Don't pass IP fragments to upper layer GRO handlers. Upper layer GRO handlers can not handle IP fragments, so exit GRO processing in this case. This fixes ESP GRO because the packet must be reassembled before we can decapsulate, otherwise we get authentication failures. It also aligns IPv4 to IPv6 where packets with fragmentation headers are not passed to upper layer GRO handlers. Fixes: `7785bba299` ("esp: Add a software GRO codepath") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 16:00:38 -04:00
David S. Miller	cd5487fb94	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-04-28 Just one patch to fix a misplaced spin_unlock_bh in an error path. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 15:43:24 -04:00
David S. Miller	5577e67956	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-04-28 1) Do garbage collecting after a policy flush to remove old bundles immediately. From Xin Long. 2) Fix GRO if netfilter is not defined. From Sabrina Dubroca. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 15:42:11 -04:00
David S. Miller	cec3819198	Another set of patches for -next: * API support for concurrent scheduled scan requests * API changes for roaming reporting * BSS max idle support in mac80211 * API changes for TX status reporting in mac80211 * API changes for RX rate reporting in mac80211 * rewrite monitor logic to prepare for BPF filters * bugfix for rare devices without 2.4 GHz support * a bugfix for recent DFS changes * some further cleanups The API changes are actually at a nice time, since it's typically quiet just before the merge window, and trees can be synchronized easily during it. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlkDggoACgkQa3t4Rpy0 AB2kpQ/+PUTNPKklYvV2TVRyT71oGoE4WNzPe4v5cgpbBkk48qCOD1ncAHCo6q2s L90gh9yOXnjht3pobvjd0PlYHy5faq4VyDohBdyId3YlR78FYyk41KSMkp4BAKn0 aeTI3QeA0FOsXECiagqG6pwYpMJM1nQFhFbL1AvVIf1MxBGYNgh+iVLzk2tyTGXB OYdJBHTjgKW3nuIRYgtUoLQaWNUlUhK+5wpypb3wkgn41DEq4sL4ay5VgNKSd0AE 5AkCrhLbiZlR4xrxivqOuS3nNPPIDOq5imuRvMMbQDChZ/4p60l0f1VIo6MR6UAQ N5Cn2ReD0bV+GFVaWmpDnmdQJIoLLYHWdlX362XdVbQCFOWOfsaD9zM2j5wXMeNV YBCMYa7Lt52ewjz4BABsAtH4/ZFReKCDmkFOMyakA2LnWzUxxqjWourcAM7XV2Kc RZIcM36Xx6yhsSReOzzd/HV9CUjqF8xuKrMKd/vWxBwWiaWdywtWqhEksxi0aOvq LUnCqgvcIxCh9dd8ygcfNdGbpZVqPVxmPdixRQbNL50M7gXtqUwZgnHHUdExgs8E 8sP+ua8H9RlXVGItuBFURShaV4hToJrxKw0xVjtVkpVXkqibgOIjxRv2mh+nkKXr tq8VWxnrKOH4nAVIyZJonoXp0Hi0vYLyt7bAwAj01CoGyZZdqUw= =dYTw -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2017-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Another set of patches for -next: * API support for concurrent scheduled scan requests * API changes for roaming reporting * BSS max idle support in mac80211 * API changes for TX status reporting in mac80211 * API changes for RX rate reporting in mac80211 * rewrite monitor logic to prepare for BPF filters * bugfix for rare devices without 2.4 GHz support * a bugfix for recent DFS changes * some further cleanups The API changes are actually at a nice time, since it's typically quiet just before the merge window, and trees can be synchronized easily during it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 14:41:15 -04:00
Parthasarathy Bhuvaragan	c1be775628	tipc: close the connection if protocol messages contain errors When a socket is shutting down, we notify the peer node about the connection termination by reusing an incoming message if possible. If the last received message was a connection acknowledgment message, we reverse this message and set the error code to TIPC_ERR_NO_PORT and send it to peer. In tipc_sk_proto_rcv(), we never check for message errors while processing the connection acknowledgment or probe messages. Thus this message performs the usual flow control accounting and leaves the session hanging. In this commit, we terminate the connection when we receive such error messages. Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan	4e0df4951e	tipc: improve error validations for sockets in CONNECTING state Until now, the checks for sockets in CONNECTING state was based on the assumption that the incoming message was always from the peer's accepted data socket. However an application using a non-blocking socket sends an implicit connect, this socket which is in CONNECTING state can receive error messages from the peer's listening socket. As we discard these messages, the application socket hangs as there due to inactivity. In addition to this, there are other places where we process errors but do not notify the user. In this commit, we process such incoming error messages and notify our users about them using sk_state_change(). Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan	42b531de17	tipc: Fix missing connection request handling In filter_connect, we use waitqueue_active() to check for any connections to wakeup. But waitqueue_active() is missing memory barriers while accessing the critical sections, leading to inconsistent results. In this commit, we replace this with an SMP safe wq_has_sleeper() using the generic socket callback sk_data_ready(). Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-28 12:20:42 -04:00
Pablo Neira Ayuso	1a41dbce0d	Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.11 I would also like it considered for stable. * Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled to avoid oops caused by IPVS accesing IPv6 routing code in such circumstances. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 16:22:11 +02:00
Dan Carpenter	7dde07e9c5	netfilter: x_tables: unlock on error in xt_find_table_lock() According to my static checker we should unlock here before the return. That seems reasonable to me as well. Fixes" `b9e69e1273` ("netfilter: xtables: don't hook tables by default") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 15:49:48 +02:00
Arend Van Spriel	b34939b983	cfg80211: add request id to cfg80211_sched_scan_*() api Have proper request id filled in the SCHED_SCAN_RESULTS and SCHED_SCAN_STOPPED notifications toward user-space by having the driver provide it through the api. Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> Reviewed-by: Franky Lin <franky.lin@broadcom.com> Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 14:51:43 +02:00
Avraham Stern	e38a017bf0	mac80211: Add support for BSS max idle period element Parse the BSS max idle period element and set the BSS configuration accordingly so the driver can use this information to configure the max idle period and to use protected management frames for keep alive when required. The BSS max idle period element is defined in IEEE802.11-2016, section 9.4.2.79 Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 12:28:45 +02:00
Avraham Stern	29ce6ecbb8	cfg80211: unify cfg80211_roamed() and cfg80211_roamed_bss() cfg80211_roamed() and cfg80211_roamed_bss() take the same arguments except that cfg80211_roamed() requires the BSSID and cfg80211_roamed_bss() requires the bss entry. Unify the two functions by using a struct for driver initiated roaming information so that either the BSSID or the bss entry can be passed as an argument to the unified function. Signed-off-by: Avraham Stern <avraham.stern@intel.com> [modified the ath6k, brcm80211, rndis and wlan-ng drivers accordingly] Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [modify brcmfmac to remove the useless cast, spotted by Arend] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 12:28:44 +02:00
Mohammed Shafi Shajakhan	21a8e9dd52	mac80211: Fix possible sband related NULL pointer de-reference Existing API 'ieee80211_get_sdata_band' returns default 2 GHz band even if the channel context configuration is NULL. This crashes for chipsets which support 5 Ghz alone when it tries to access members of 'sband'. Channel context configuration can be NULL in multivif case and when channel switch is in progress (or) when it fails. Fix this by replacing the API 'ieee80211_get_sdata_band' with 'ieee80211_get_sband' which returns a NULL pointer for sband when the channel configuration is NULL. An example scenario is as below: In multivif mode (AP + STA) with drivers like ath10k, when we do a channel switch in the AP vif (which has a number of clients connected) and a STA vif which is connected to some other AP, when the channel switch in AP vif fails, while the STA vifs tries to connect to the other AP, there is a window where the channel context is NULL/invalid and this results in a crash while the clients connected to the AP vif tries to reconnect and this race is very similar to the one investigated by Michal in https://patchwork.kernel.org/patch/3788161/ and this does happens with hardware that supports 5Ghz alone after long hours of testing with continuous channel switch on the AP vif ieee80211 phy0: channel context reservation cannot be finalized because some interfaces aren't switching wlan0: failed to finalize CSA, disconnecting wlan0-1: deauthenticating from 8c:fd:f0:01:54:9c by local choice (Reason: 3=DEAUTH_LEAVING) WARNING: CPU: 1 PID: 19032 at net/mac80211/ieee80211_i.h:1013 sta_info_alloc+0x374/0x3fc [mac80211] [<bf77272c>] (sta_info_alloc [mac80211]) [<bf78776c>] (ieee80211_add_station [mac80211])) [<bf73cc50>] (nl80211_new_station [cfg80211]) Unable to handle kernel NULL pointer dereference at virtual address 00000014 pgd = d5f4c000 Internal error: Oops: 17 [#1] PREEMPT SMP ARM PC is at sta_info_alloc+0x380/0x3fc [mac80211] LR is at sta_info_alloc+0x37c/0x3fc [mac80211] [<bf772738>] (sta_info_alloc [mac80211]) [<bf78776c>] (ieee80211_add_station [mac80211]) [<bf73cc50>] (nl80211_new_station [cfg80211])) Cc: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 12:28:44 +02:00
Paolo Abeni	1442f6f7c1	ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled When creating a new ipvs service, ipv6 addresses are always accepted if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family is not explicitly checked. This allows the user-space to configure ipvs services even if the system is booted with ipv6.disable=1. On specific configuration, ipvs can try to call ipv6 routing code at setup time, causing the kernel to oops due to fib6_rules_ops being NULL. This change addresses the issue adding a check for the ipv6 module being enabled while validating ipv6 service operations and adding the same validation for dest operations. According to git history, this issue is apparently present since the introduction of ipv6 support, and the oops can be triggered since commit `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Fixes: `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:04:35 +02:00
Aaron Conole	fb90e8dedb	ipvs: change comparison on sync_refresh_period The sync_refresh_period variable is unsigned, so it can never be < 0. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Aaron Conole	65ba101ebc	ipvs: remove unused function ip_vs_set_state_timeout There are no in-tree callers of this function and it isn't exported. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Felix Fietkau	5fe49a9d11	mac80211: add ieee80211_tx_status_ext This allows the driver to pass in struct ieee80211_tx_status directly. Make ieee80211_tx_status_noskb a wrapper around it. As with ieee80211_tx_status_noskb, there is no _ni variant of this call, because it probably won't be needed. Even if the driver won't provide any extra status info other than what's in struct ieee80211_tx_info already, it can optimize status reporting this way by passing in the station pointer. Signed-off-by: Felix Fietkau <nbd@nbd.name> [use C99 initializers] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 11:08:21 +02:00
Felix Fietkau	eefebd3164	mac80211: move ieee80211_tx_status_noskb below ieee80211_tx_status Makes further cleanups more readable Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:57:44 +02:00
Felix Fietkau	18fb84d986	mac80211: make rate control tx status API more extensible Rename .tx_status_noskb to .tx_status_ext and pass a new on-stack struct ieee80211_tx_status instead of struct ieee80211_tx_info. This struct can be used to pass extra information, e.g. for dynamic tx power control Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:57:33 +02:00
Johannes Berg	dcba665b1f	mac80211: use bitfield macros for encoded rate Instead of hand-coding the bit manipulations, use the bitfield macros to generate the code for the encoded bitrate. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:41:58 +02:00
Johannes Berg	8613c94815	mac80211: rename ieee80211_rx_status::vht_nss to just nss This field will need to be used again for HE, so rename it now. Again, mostly done with this spatch: @@ expression status; @@ -status->vht_nss +status->nss @@ expression status; @@ -status.vht_nss +status.nss Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:41:53 +02:00
Johannes Berg	da6a4352e7	mac80211: separate encoding/bandwidth from flags We currently use a lot of flags that are mutually incompatible, separate this out into actual encoding and bandwidth enum values. Much of this again done with spatch, with manual post-editing, mostly to add the switch statements and get rid of the conversions. @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_80MHZ +status->bw = RATE_INFO_BW_80 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_40MHZ +status->bw = RATE_INFO_BW_40 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_20MHZ +status->bw = RATE_INFO_BW_20 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_160MHZ +status->bw = RATE_INFO_BW_160 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_5MHZ +status->bw = RATE_INFO_BW_5 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_10MHZ +status->bw = RATE_INFO_BW_10 @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_VHT +status->encoding = RX_ENC_VHT @@ expression status; @@ -status->enc_flags \|= RX_ENC_FLAG_HT +status->encoding = RX_ENC_HT @@ expression status; @@ -status.enc_flags \|= RX_ENC_FLAG_VHT +status.encoding = RX_ENC_VHT @@ expression status; @@ -status.enc_flags \|= RX_ENC_FLAG_HT +status.encoding = RX_ENC_HT @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_HT) +(status->encoding == RX_ENC_HT) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_VHT) +(status->encoding == RX_ENC_VHT) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_5MHZ) +(status->bw == RATE_INFO_BW_5) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_10MHZ) +(status->bw == RATE_INFO_BW_10) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_40MHZ) +(status->bw == RATE_INFO_BW_40) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_80MHZ) +(status->bw == RATE_INFO_BW_80) @@ expression status; @@ -(status->enc_flags & RX_ENC_FLAG_160MHZ) +(status->bw == RATE_INFO_BW_160) Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:41:45 +02:00
Johannes Berg	7fdd69c5af	mac80211: clean up rate encoding bits in RX status In preparation for adding support for HE rates, clean up the driver report encoding for rate/bandwidth reporting on RX frames. Much of this patch was done with the following spatch: @@ expression status; @@ -status->flag & (RX_FLAG_HT \| RX_FLAG_VHT) +status->enc_flags & (RX_ENC_FLAG_HT \| RX_ENC_FLAG_VHT) @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_SHORTPRE +status->enc_flags op RX_ENC_FLAG_SHORTPRE @@ expression status; @@ -status->flag & RX_FLAG_SHORTPRE +status->enc_flags & RX_ENC_FLAG_SHORTPRE @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_HT +status->enc_flags op RX_ENC_FLAG_HT @@ expression status; @@ -status->flag & RX_FLAG_HT +status->enc_flags & RX_ENC_FLAG_HT @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_40MHZ +status->enc_flags op RX_ENC_FLAG_40MHZ @@ expression status; @@ -status->flag & RX_FLAG_40MHZ +status->enc_flags & RX_ENC_FLAG_40MHZ @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_SHORT_GI +status->enc_flags op RX_ENC_FLAG_SHORT_GI @@ expression status; @@ -status->flag & RX_FLAG_SHORT_GI +status->enc_flags & RX_ENC_FLAG_SHORT_GI @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_HT_GF +status->enc_flags op RX_ENC_FLAG_HT_GF @@ expression status; @@ -status->flag & RX_FLAG_HT_GF +status->enc_flags & RX_ENC_FLAG_HT_GF @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_VHT +status->enc_flags op RX_ENC_FLAG_VHT @@ expression status; @@ -status->flag & RX_FLAG_VHT +status->enc_flags & RX_ENC_FLAG_VHT @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_STBC_MASK +status->enc_flags op RX_ENC_FLAG_STBC_MASK @@ expression status; @@ -status->flag & RX_FLAG_STBC_MASK +status->enc_flags & RX_ENC_FLAG_STBC_MASK @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_LDPC +status->enc_flags op RX_ENC_FLAG_LDPC @@ expression status; @@ -status->flag & RX_FLAG_LDPC +status->enc_flags & RX_ENC_FLAG_LDPC @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_10MHZ +status->enc_flags op RX_ENC_FLAG_10MHZ @@ expression status; @@ -status->flag & RX_FLAG_10MHZ +status->enc_flags & RX_ENC_FLAG_10MHZ @@ assignment operator op; expression status; @@ -status->flag op RX_FLAG_5MHZ +status->enc_flags op RX_ENC_FLAG_5MHZ @@ expression status; @@ -status->flag & RX_FLAG_5MHZ +status->enc_flags & RX_ENC_FLAG_5MHZ @@ assignment operator op; expression status; @@ -status->vht_flag op RX_VHT_FLAG_80MHZ +status->enc_flags op RX_ENC_FLAG_80MHZ @@ expression status; @@ -status->vht_flag & RX_VHT_FLAG_80MHZ +status->enc_flags & RX_ENC_FLAG_80MHZ @@ assignment operator op; expression status; @@ -status->vht_flag op RX_VHT_FLAG_160MHZ +status->enc_flags op RX_ENC_FLAG_160MHZ @@ expression status; @@ -status->vht_flag & RX_VHT_FLAG_160MHZ +status->enc_flags & RX_ENC_FLAG_160MHZ @@ assignment operator op; expression status; @@ -status->vht_flag op RX_VHT_FLAG_BF +status->enc_flags op RX_ENC_FLAG_BF @@ expression status; @@ -status->vht_flag & RX_VHT_FLAG_BF +status->enc_flags & RX_ENC_FLAG_BF @@ assignment operator op; expression status, STBC; @@ -status->flag op STBC << RX_FLAG_STBC_SHIFT +status->enc_flags op STBC << RX_ENC_FLAG_STBC_SHIFT @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_SHORTPRE +status.enc_flags op RX_ENC_FLAG_SHORTPRE @@ expression status; @@ -status.flag & RX_FLAG_SHORTPRE +status.enc_flags & RX_ENC_FLAG_SHORTPRE @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_HT +status.enc_flags op RX_ENC_FLAG_HT @@ expression status; @@ -status.flag & RX_FLAG_HT +status.enc_flags & RX_ENC_FLAG_HT @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_40MHZ +status.enc_flags op RX_ENC_FLAG_40MHZ @@ expression status; @@ -status.flag & RX_FLAG_40MHZ +status.enc_flags & RX_ENC_FLAG_40MHZ @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_SHORT_GI +status.enc_flags op RX_ENC_FLAG_SHORT_GI @@ expression status; @@ -status.flag & RX_FLAG_SHORT_GI +status.enc_flags & RX_ENC_FLAG_SHORT_GI @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_HT_GF +status.enc_flags op RX_ENC_FLAG_HT_GF @@ expression status; @@ -status.flag & RX_FLAG_HT_GF +status.enc_flags & RX_ENC_FLAG_HT_GF @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_VHT +status.enc_flags op RX_ENC_FLAG_VHT @@ expression status; @@ -status.flag & RX_FLAG_VHT +status.enc_flags & RX_ENC_FLAG_VHT @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_STBC_MASK +status.enc_flags op RX_ENC_FLAG_STBC_MASK @@ expression status; @@ -status.flag & RX_FLAG_STBC_MASK +status.enc_flags & RX_ENC_FLAG_STBC_MASK @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_LDPC +status.enc_flags op RX_ENC_FLAG_LDPC @@ expression status; @@ -status.flag & RX_FLAG_LDPC +status.enc_flags & RX_ENC_FLAG_LDPC @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_10MHZ +status.enc_flags op RX_ENC_FLAG_10MHZ @@ expression status; @@ -status.flag & RX_FLAG_10MHZ +status.enc_flags & RX_ENC_FLAG_10MHZ @@ assignment operator op; expression status; @@ -status.flag op RX_FLAG_5MHZ +status.enc_flags op RX_ENC_FLAG_5MHZ @@ expression status; @@ -status.flag & RX_FLAG_5MHZ +status.enc_flags & RX_ENC_FLAG_5MHZ @@ assignment operator op; expression status; @@ -status.vht_flag op RX_VHT_FLAG_80MHZ +status.enc_flags op RX_ENC_FLAG_80MHZ @@ expression status; @@ -status.vht_flag & RX_VHT_FLAG_80MHZ +status.enc_flags & RX_ENC_FLAG_80MHZ @@ assignment operator op; expression status; @@ -status.vht_flag op RX_VHT_FLAG_160MHZ +status.enc_flags op RX_ENC_FLAG_160MHZ @@ expression status; @@ -status.vht_flag & RX_VHT_FLAG_160MHZ +status.enc_flags & RX_ENC_FLAG_160MHZ @@ assignment operator op; expression status; @@ -status.vht_flag op RX_VHT_FLAG_BF +status.enc_flags op RX_ENC_FLAG_BF @@ expression status; @@ -status.vht_flag & RX_VHT_FLAG_BF +status.enc_flags & RX_ENC_FLAG_BF @@ assignment operator op; expression status, STBC; @@ -status.flag op STBC << RX_FLAG_STBC_SHIFT +status.enc_flags op STBC << RX_ENC_FLAG_STBC_SHIFT @@ @@ -RX_FLAG_STBC_SHIFT +RX_ENC_FLAG_STBC_SHIFT Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-28 10:41:38 +02:00
Trond Myklebust	ed6473ddc7	NFSv4: Fix callback server shutdown We want to use kthread_stop() in order to ensure the threads are shut down before we tear down the nfs_callback_info in nfs_callback_down. Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com> Reported-by: Kinglong Mee <kinglongmee@gmail.com> Fixes: `bb6aeba736` ("NFSv4.x: Switch to using svc_set_num_threads()...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-27 18:00:16 -04:00
Trond Myklebust	9e0d87680d	SUNRPC: Refactor svc_set_num_threads() Refactor to separate out the functions of starting and stopping threads so that they can be used in other helpers. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-27 17:59:58 -04:00
Wei Yongjun	adeb45cbb5	fib_rules: fix error return code Fix to return error code -EINVAL from the error handling case instead of 0, as done elsewhere in this function. Fixes: `622ec2c9d5` ("net: core: add UID to flows, rules, and routes") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 16:35:57 -04:00
Mike Manning	99f906e9ad	bridge: add per-port broadcast flood flag Support for l2 multicast flood control was added in commit `b6cb5ac833` ("net: bridge: add per-port multicast flood flag"). It allows broadcast as it was introduced specifically for unknown multicast flood control. But as broadcast is a special case of multicast, this may also need to be disabled. For this purpose, introduce a flag to disable the flooding of received l2 broadcasts. This approach is backwards compatible and provides flexibility in filtering for the desired packet types. Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Mike Manning <mmanning@brocade.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 16:34:29 -04:00
Gao Feng	06b4fc520d	net: fib: Decrease one unnecessary rt cache flush in fib_disable_ip The func fib_flush already flushes the rt cache if necessary, so it is not necessary to invoke rt_cache_flush again in fib_disable_ip. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 16:33:30 -04:00
Guillaume Nault	1514dc857f	l2tp: remove useless device duplication test in l2tp_eth_create() There's no need to verify that cfg->ifname is unique at this point. register_netdev() will return -EEXIST if asked to create a device with a name that's alrealy in use. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 16:32:13 -04:00
Zhang Shengju	0575c86b5d	net: remove unnecessary carrier status check Since netif_carrier_on() will do nothing if device's carrier is already on, so it's unnecessary to do carrier status check. It's the same for netif_carrier_off(). Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 16:31:05 -04:00
Linus Torvalds	f56fc7bdaa	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - fix orangefs handling of faults on write() - I'd missed that one back when orangefs was going through review. - readdir counterpart of "9p: cope with bogus responses from server in p9_client_{read,write}" - server might be lying or broken, and we'd better not overrun the kmalloc'ed buffer we are copying the results into. - NFS O_DIRECT read/write can leave iov_iter advanced by too much; that's what had been causing iov_iter_pipe() warnings davej had been seeing. - statx_timestamp.tv_nsec type fix (s32 -> u32). That one really should go in before 4.11. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: uapi: change the type of struct statx_timestamp.tv_nsec to unsigned fix nfs O_DIRECT advancing iov_iter too much p9_client_readdir() fix orangefs_bufmap_copy_from_iovec(): fix EFAULT handling	2017-04-27 11:09:37 -07:00
Eric Dumazet	4b726e81da	tcp: tcp_rack_reo_timeout() must update tp->tcp_mstamp I wrongly assumed tp->tcp_mstamp was up to date at the time tcp_rack_reo_timeout() was called. It is not true, since we only update tcp->tcp_mstamp when receiving a packet (as initially done in commit `69e996c58a` ("tcp: add tp->tcp_mstamp field") tcp_rack_reo_timeout() being called by a timer and not an incoming packet, we need to refresh tp->tcp_mstamp Fixes: `7c1c730859` ("tcp: do not pass timestamp to tcp_rack_detect_loss()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-27 11:46:15 -04:00
Sabrina Dubroca	cfcf99f987	xfrm: fix GRO for !CONFIG_NETFILTER In xfrm_input() when called from GRO, async == 0, and we end up skipping the processing in xfrm4_transport_finish(). GRO path will always skip the NF_HOOK, so we don't need the special-case for !NETFILTER during GRO processing. Fixes: `7785bba299` ("esp: Add a software GRO codepath") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-04-27 12:20:19 +02:00
Oliver Hartkopp	c2701b370e	can: fix CAN BCM build with CONFIG_PROC_FS disabled The introduced namespace support moved the BCM variables for procfs into a per-net data structure. This leads to a build failure with disabled procfs: on x86_64: when CONFIG_PROC_FS is not enabled: ../net/can/bcm.c:1541:14: error: 'struct netns_can' has no member named 'bcmproc_dir' ../net/can/bcm.c:1601:14: error: 'struct netns_can' has no member named 'bcmproc_dir' ../net/can/bcm.c:1696:11: error: 'struct netns_can' has no member named 'bcmproc_dir' ../net/can/bcm.c:1707:15: error: 'struct netns_can' has no member named 'bcmproc_dir' http://marc.info/?l=linux-can&m=149321842526524&w=2 Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-27 09:34:13 +02:00
David S. Miller	b1513c3531	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 22:39:08 -04:00
Luca Coelho	e5f2e0671e	mac80211: make multicast variable a bool in ieee80211_accept_frame() The multicast variable in the ieee80211_accept_frame() function is treated as a boolean, but defined as int. Fix that. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:44 +02:00
Johannes Berg	4a19906823	mac80211: disentangle iflist_mtx and chanctx_mtx At least on iwlwifi, sometimes lockdep complains that we can lock chanctx_mtx -> mvm.mutex -> iflist_mtx (due to iterate_interfaces) and iflist_mtx -> chanctx_mtx Remove the latter dependency in mac80211 by using the RTNL that we already hold in one case, and can relatively easily achieve in the other case. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:44 +02:00
Emmanuel Grumbach	cf147085fd	mac80211: don't parse encrypted management frames in ieee80211_frame_acked ieee80211_frame_acked is called when a frame is acked by the peer. In case this is a management frame, we check if this an SMPS frame, in which case we can update our antenna configuration. When we parse the management frame we look at the category in case it is an action frame. That byte sits after the IV in case the frame was encrypted. This means that if the frame was encrypted, we basically look at the IV instead of looking at the category. It is then theorically possible that we think that an SMPS action frame was acked where really we had another frame that was encrypted. Since the only management frame whose ack needs to be tracked is the SMPS action frame, and that frame is not a robust management frame, it will never be encrypted. The easiest way to fix this problem is then to not look at frames that were encrypted. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:43 +02:00
Arend Van Spriel	3a3ecf1d59	cfg80211: add request id parameter to .sched_scan_stop() signature For multiple scheduled scan support the driver needs to know which scheduled scan request is being stopped. Pass the request id in the .sched_scan_stop() callback. Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> Reviewed-by: Franky Lin <franky.lin@broadcom.com> Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:40 +02:00
Arend Van Spriel	3007e3529c	nl80211: add support for BSSIDs in scheduled scan matchsets This patch allows for the scheduled scan request to specify matchsets for specific BSSIDs. Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> Reviewed-by: Franky Lin <franky.lin@broadcom.com> Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> [docs, netlink policy fix] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:39 +02:00
Arend Van Spriel	ca986ad9bc	nl80211: allow multiple active scheduled scan requests This patch implements the idea to have multiple scheduled scan requests running concurrently. It mainly illustrates how to deal with the incoming request from user-space in terms of backward compatibility. In order to use multiple scheduled scans user-space needs to provide a flag attribute NL80211_ATTR_SCHED_SCAN_MULTI to indicate support. If not the request is treated as a legacy scan. Drivers currently supporting scheduled scan are now indicating they support a single scheduled scan request. This obsoletes WIPHY_FLAG_SUPPORTS_SCHED_SCAN. Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> Reviewed-by: Franky Lin <franky.lin@broadcom.com> Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> [clean up netlink destroy path to avoid allocations, code cleanups] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:38 +02:00
Johannes Berg	ab81007a7b	cfg80211: simplify netlink socket owner interface deletion There's no need to allocate a portid structure and then, for each of those, walk the interfaces - we can just add a flag to each interface and walk those directly. Due to padding in the struct, we can even do it without any memory cost, and it even simplifies the code. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-26 23:17:35 +02:00
Jamie Bainbridge	105f5528b9	ipv6: check raw payload size correctly in ioctl In situations where an skb is paged, the transport header pointer and tail pointer can be the same because the skb contents are in frags. This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a length of 0 when the length to receive is actually greater than zero. skb->len is already correctly set in ip6_input_finish() with pskb_pull(), so use skb->len as it always returns the correct result for both linear and paged data. Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:59:35 -04:00
Wei Wang	c120144407	tcp: memset ca_priv data to 0 properly Always zero out ca_priv data in tcp_assign_congestion_control() so that ca_priv data is cleared out during socket creation. Also always zero out ca_priv data in tcp_reinit_congestion_control() so that when cc algorithm is changed, ca_priv data is cleared out as well. We should still zero out ca_priv data even in TCP_CLOSE state because user could call connect() on AF_UNSPEC to disconnect the socket and leave it in TCP_CLOSE state and later call setsockopt() to switch cc algorithm on this socket. Fixes: `2b0a8c9ee` ("tcp: add CDG congestion control") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:58:32 -04:00
WANG Cong	199ab00f3c	ipv6: check skb->protocol before lookup for nexthop Andrey reported a out-of-bound access in ip6_tnl_xmit(), this is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4 neigh key as an IPv6 address: neigh = dst_neigh_lookup(skb_dst(skb), &ipv6_hdr(skb)->daddr); if (!neigh) goto tx_err_link_failure; addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE addr_type = ipv6_addr_type(addr6); if (addr_type == IPV6_ADDR_ANY) addr6 = &ipv6_hdr(skb)->daddr; memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr)); Also the network header of the skb at this point should be still IPv4 for 4in6 tunnels, we shold not just use it as IPv6 header. This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it is, we are safe to do the nexthop lookup using skb_dst() and ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which dest address we can pick here, we have to rely on callers to fill it from tunnel config, so just fall to ip6_route_output() to make the decision. Fixes: `ea3dc9601b` ("ip6_tunnel: Add support for wildcard tunnel endpoints.") Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:51:26 -04:00
Myungho Jung	9899886d5e	net: core: Prevent from dereferencing null pointer when releasing SKB Added NULL check to make __dev_kfree_skb_irq consistent with kfree family of functions. Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289 Signed-off-by: Myungho Jung <mhjungk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:47:14 -04:00
Eric Dumazet	645f4c6f2e	tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps Some devices or distributions use HZ=100 or HZ=250 TCP receive buffer autotuning has poor behavior caused by this choice. Since autotuning happens after 4 ms or 10 ms, short distance flows get their receive buffer tuned to a very high value, but after an initial period where it was frozen to (too small) initial value. With tp->tcp_mstamp introduction, we can switch to high resolution timestamps almost for free (at the expense of 8 additional bytes per TCP structure) Note that some TCP stacks use usec TCP timestamps where this patch makes even more sense : Many TCP flows have < 500 usec RTT. Hopefully this finer TS option can be standardized soon. Tested: HZ=100 kernel ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 & Peer without patch : lpaa24:~# ss -tmi dst lpaa23 ... skmem:(r0,rb8388608,...) rcv_rtt:10 rcv_space:3210000 minrtt:0.017 Peer with the patch : lpaa23:~# ss -tmi dst lpaa24 ... skmem:(r0,rb428800,...) rcv_rtt:0.069 rcv_space:30000 minrtt:0.017 We can see saner RCVBUF, and more precise rcv_rtt information. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:39 -04:00
Eric Dumazet	a6db50b81e	tcp: remove ack_time from struct tcp_sacktag_state It is no longer needed, everything uses tp->tcp_mstamp instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:38 -04:00
Eric Dumazet	7e0ca8a4c1	tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue() Following patch will remove ack_time from struct tcp_sacktag_state Same info is now found in tp->tcp_mstamp Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:38 -04:00
Eric Dumazet	d2329f102d	tcp: do not pass timestamp to tcp_rack_advance() No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:38 -04:00
Eric Dumazet	88d5c65098	tcp: do not pass timestamp to tcp_rate_gen() No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:38 -04:00
Eric Dumazet	1317a9d69f	tcp: do not pass timestamp to tcp_fastretrans_alert() Not used anymore now tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:38 -04:00
Eric Dumazet	efab8f8582	tcp: do not pass timestamp to tcp_rack_identify_loss() Not used anymore now tp->tcp_mstamp holds the information. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:37 -04:00
Eric Dumazet	128eda86be	tcp: do not pass timestamp to tcp_rack_mark_lost() This is no longer used, since tcp_rack_detect_loss() takes the timestamp from tp->tcp_mstamp Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:37 -04:00
Eric Dumazet	7c1c730859	tcp: do not pass timestamp to tcp_rack_detect_loss() We can use tp->tcp_mstamp as it contains a recent timestamp. This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:37 -04:00
Eric Dumazet	69e996c58a	tcp: add tp->tcp_mstamp field We want to use precise timestamps in TCP stack, but we do not want to call possibly expensive kernel time services too often. tp->tcp_mstamp is guaranteed to be updated once per incoming packet. We will use it in the following patches, removing specific skb_mstamp_get() calls, and removing ack_time from struct tcp_sacktag_state. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-26 14:44:36 -04:00
Al Viro	eea86b637a	Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', 'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess	2017-04-26 12:06:59 -04:00
Xin Long	35db069121	xfrm: do the garbage collection after flushing policy Now xfrm garbage collection can be triggered by 'ip xfrm policy del'. These is no reason not to do it after flushing policies, especially considering that 'garbage collection deferred' is only triggered when it reaches gc_thresh. It's no good that the policy is gone but the xdst still hold there. The worse thing is that xdst->route/orig_dst is also hold and can not be released even if the orig_dst is already expired. This patch is to do the garbage collection if there is any policy removed in xfrm_policy_flush. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2017-04-26 10:34:32 +02:00
Florian Westphal	9a08ecfe74	netfilter: don't attach a nat extension by default nowadays the NAT extension only stores the interface index (used to purge connections that got masqueraded when interface goes down) and pptp nat information. Previous patches moved nf_ct_nat_ext_add to those places that need it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	2fe7c321ab	netfilter: pptp: attach nat extension when needed make sure nat extension gets added if the master conntrack is subject to NAT. This will be required once the nat core stops adding it by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	ff459018d7	netfilter: masquerade: attach nat extension if not present Currently the nat extension is always attached as soon as nat module is loaded. However, most NAT uses do not need the nat extension anymore. Prepare to remove the add-nat-by-default by making those places that need it attach it if its not present yet. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	22d4536d2c	netfilter: conntrack: handle initial extension alloc via krealloc krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing the initial allocation after the prealloc removal (we had to use ->alloc_len as the initial allocation size). This also means we do not zero the preallocated memory anymore; only offsets[]. Existing code makes sure the new (used) extension space gets zeroed out. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	23f671a1b5	netfilter: conntrack: mark extension structs as const Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	54044b1f02	netfilter: conntrack: remove prealloc support It was used by the nat extension, but since commit `7c96643519` ("netfilter: move nat hlist_head to nf_conn") its only needed for connections that use MASQUERADE target or a nat helper. Also it seems a lot easier to preallocate a fixed size instead. With default settings, conntrack first adds ecache extension (sysctl defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte on x86_64 for initial allocation. Followup patches can constify the extension structs and avoid the initial zeroing of the entire extension area. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Gao Feng	495dcb56d0	netfilter: SYNPROXY: Return NF_STOLEN instead of NF_DROP during handshaking Current SYNPROXY codes return NF_DROP during normal TCP handshaking, it is not friendly to caller. Because the nf_hook_slow would treat the NF_DROP as an error, and return -EPERM. As a result, it may cause the top caller think it meets one error. For example, the following codes are from cfv_rx_poll() err = netif_receive_skb(skb); if (unlikely(err)) { ++cfv->ndev->stats.rx_dropped; } else { ++cfv->ndev->stats.rx_packets; cfv->ndev->stats.rx_bytes += skb_len; } When SYNPROXY returns NF_DROP, then netif_receive_skb returns -EPERM. As a result, the cfv driver would treat it as an error, and increase the rx_dropped counter. So use NF_STOLEN instead of NF_DROP now because there is no error happened indeed, and free the skb directly. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	aee12a0a37	ebtables: remove nf_hook_register usage Similar to ip_register_table, pass nf_hook_ops to ebt_register_table(). This allows to handle hook registration also via pernet_ops and allows us to avoid use of legacy register_hook api. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Florian Westphal	1a0ed0ad48	netfilter: decnet: only register hooks in init namespace looks like decnet isn't namespacified in first place, so restrict hook registration to the initial namespace. Prepares for eventual removal of legacy nf_register_hook() api. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Florian Westphal	efe4160618	ipvs: convert to use pernet nf_hook api nf_(un)register_hooks has to maintain an internal hook list to add/remove those hooks from net namespaces as they are added/deleted. ipvs already uses pernet_ops, so we can switch to the (more recent) pernet hook api instead. Compile tested only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Florian Westphal	1fefe14725	netfilter: synproxy: only register hooks when needed Defer registration of the synproxy hooks until the first SYNPROXY rule is added. Also means we only register hooks in namespaces that need it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Trond Myklebust	35a2442189	NFS: NFS over RDMA Client Side Changes New Features: - Break RDMA connections after a connection timeout - Support for unloading the underlying device driver Bugfixes and cleanups: - Mark the receive workqueue as "read-mostly" - Silence warnings caused by ENOBUFS - Update a comment in xdr_init_decode_pages() - Remove rpcrdma_buffer->rb_pool. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlj/tSAACgkQ18tUv7Cl QOtC/Q/9ElH3UEK4uF7yc14B6LhwcX0n9Ka47CqPnil/+4aN1HK8Oa/cHd3NxeJP B8MOoRLVI9VrY3bLIpbzl49z+RVssgnuMDAQFqCPHja5YxzxVKnLApFUw5WyuDyl FXp/qZUkHyZymp3VIuY7c/iRK3KIJm7J3Ca5JxRc+x4Vu2jYBRWUt3+8cC+pPTM+ MSGxSsjsfE0isIOKt7Z5nucx5sdlbCZXoZzrcOL2tZ4IP+5rgYk+3W+H6yg0hzCv P+Bv6Ce0Ye0ebcLpi+rMJlYjy4e5YbWuMqhnVrhR4NEtAJMH4NSvg5rT3iv/CKDf vJnSvoOSERowlDmvK3h8BAQ9u3V81u3C21xRDdiCgrIfNBvmzthNyYV9fuGZkR8Q BCNDKji7r+uWxhFwX+X4D1izBRTEv7PoHLQCF8WDPqU2M8dLjF+mpM7yzSRzt7pF 8u9WGEtIr0l+YtOYRYS8UcQuDv5GVl5z5hoS/MZeifuWoJOAcfxeCtSENTK1Ftt9 4ysM298umF7rMHRUrSnI3d3OKIwTkdMwVmZscAvNLHR2VvuwqrwGM4B0NmADK1Za y3/tzAgL3jKl3dTI1Ny9djBgpxSnnAwLa+I92LbiessP9woGgOYKmfruDHFzS/yp +4JRbjqaXB6iFtpjDre3H3CCEwrRvBZJ1TOQwc84z6xqMs8kmL0= =F9zg -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: NFS over RDMA Client Side Changes New Features: - Break RDMA connections after a connection timeout - Support for unloading the underlying device driver Bugfixes and cleanups: - Mark the receive workqueue as "read-mostly" - Silence warnings caused by ENOBUFS - Update a comment in xdr_init_decode_pages() - Remove rpcrdma_buffer->rb_pool.	2017-04-25 18:42:48 -04:00
Chuck Lever	dadf3e435d	svcrdma: Clean out old XDR encoders Clean up: These have been replaced and are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:56 -04:00
Chuck Lever	2cf32924c6	svcrdma: Remove the req_map cache req_maps are no longer used by the send path and can thus be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	68cc4636bb	svcrdma: Remove unused RDMA Write completion handler Clean up. All RDMA Write completions are now handled by svc_rdma_wc_write_ctx. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	ded8d19641	svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt The sge array in struct svc_rdma_op_ctxt is no longer used for sending RDMA Write WRs. It need only accommodate the construction of Send and Receive WRs. The maximum inline size is the largest payload it needs to handle now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	f5821c76b2	svcrdma: Clean up RPC-over-RDMA backchannel reply processing Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	4757d90b15	svcrdma: Report Write/Reply chunk overruns Observed at Connectathon 2017. If a client has underestimated the size of a Write or Reply chunk, the Linux server writes as much payload data as it can, then it recognizes there was a problem and closes the connection without sending the transport header. This creates a couple of problems: <> The client never receives indication of the server-side failure, so it continues to retransmit the bad RPC. Forward progress on the transport is blocked. <> The reply payload pages are not moved out of the svc_rqst, thus they can be released by the RPC server before the RDMA Writes have completed. The new rdma_rw-ized helpers return a distinct error code when a Write/Reply chunk overrun occurs, so it's now easy for the caller (svc_rdma_sendto) to recognize this case. Instead of dropping the connection, post an RDMA_ERROR message. The client now sees an RDMA_ERROR and can properly terminate the RPC transaction. As part of the new logic, set up the same delayed release for these payload pages as would have occurred in the normal case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	6b19cc5ca2	svcrdma: Clean up RDMA_ERROR path Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can be refactored to reduce code duplication and remove C structure- based XDR encoding. It is also relocated to the source file that contains its only caller. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	9a6a180b78	svcrdma: Use rdma_rw API in RPC reply path The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	f13193f50b	svcrdma: Introduce local rdma_rw API helpers The plan is to replace the local bespoke code that constructs and posts RDMA Read and Write Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. Some design notes: o The structure of RPC-over-RDMA transport headers is flexible, allowing multiple segments per Reply with arbitrary alignment, each with a unique R_key. Write and Send WRs continue to be built and posted in separate code paths. However, one whole chunk (with one or more RDMA segments apiece) gets exactly one ib_post_send and one work completion. o svc_xprt reference counting is modified, since a chain of rdma_rw_ctx structs generates one completion, no matter how many Write WRs are posted. o The current code builds the transport header as it is construct- ing Write WRs. I've replaced that with marshaling of transport header data items in a separate step. This is because the exact structure of client-provided segments may not align with the components of the server's reply xdr_buf, or the pages in the page list. Thus parts of each client-provided segment may be written at different points in the send path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:55 -04:00
Chuck Lever	c238c4c034	svcrdma: Clean up svc_rdma_get_inv_rkey() Replace C structure-based XDR decoding with more portable code that instead uses pointer arithmetic. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:54 -04:00
Chuck Lever	c55ab0707b	svcrdma: Add helper to save pages under I/O Clean up: extract the logic to save pages under I/O into a helper to add a big documenting comment without adding clutter in the send path. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:54 -04:00
Chuck Lever	b623589dba	svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT The Send Queue depth is temporarily reduced to 1 SQE per credit. The new rdma_rw API does an internal computation, during QP creation, to increase the depth of the Send Queue to handle RDMA Read and Write operations. This change has to come before the NFSD code paths are updated to use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases the size of the SQ too much, resulting in memory allocation failures during QP creation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:54 -04:00
Chuck Lever	6e6092ca30	svcrdma: Add svc_rdma_map_reply_hdr() Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:54 -04:00
Chuck Lever	17f5f7f506	svcrdma: Move send_wr to svc_rdma_op_ctxt Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-04-25 17:25:54 -04:00
Chuck Lever	2be1fce95e	xprtrdma: Remove rpcrdma_buffer::rb_pool Since commit `1e465fd4ff` ("xprtrdma: Replace send and receive arrays"), this field is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:35 -04:00
Chuck Lever	7ecce75fc3	sunrpc: Fix xdr_init_decode_pages() documenting comment Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:34 -04:00
Chuck Lever	0031e47c76	xprtrdma: Squelch ENOBUFS warnings When ro_map is out of buffers, that's not a permanent error, so don't report a problem. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:33 -04:00
Chuck Lever	7d7fa9b550	xprtrdma: Annotate receive workqueue Micro-optimize the receive workqueue by marking it's anchor "read- mostly." Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:31 -04:00
Chuck Lever	56a6bd154d	xprtrdma: Revert commit `d0f36c46de` Device removal is now adequately supported. Pinning the underlying device driver to prevent removal while an NFS mount is active is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:30 -04:00
Chuck Lever	a9b0e381ca	xprtrdma: Restore transport after device removal After a device removal, enable the transport connect worker to restore normal operation if there is another device with connectivity to the server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:28 -04:00
Chuck Lever	1890896b4e	xprtrdma: Refactor rpcrdma_ep_connect I'm about to add another arm to if (ep->rep_connected != 0) It will be cleaner to use a switch statement here. We'll be looking for a couple of specific errnos, or "anything else," basically to sort out the difference between a normal reconnect and recovery from device removal. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:27 -04:00
Chuck Lever	bebd031866	xprtrdma: Support unplugging an HCA from under an NFS mount The device driver for the underlying physical device associated with an RPC-over-RDMA transport can be removed while RPC-over-RDMA transports are still in use (ie, while NFS filesystems are still mounted and active). The IB core performs a connection event upcall to request that consumers free all RDMA resources associated with a transport. There may be pending RPCs when this occurs. Care must be taken to release associated resources without leaving references that can trigger a subsequent crash if a signal or soft timeout occurs. We rely on the caller of the transport's ->close method to ensure that the previous RPC task has invoked xprt_release but the transport remains write-locked. A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close is invoked, it destroys the transport's H/W resources, then wakes the upcall, which completes and allows the core driver unload to continue. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:24 -04:00
Chuck Lever	91a10c5297	xprtrdma: Use same device when mapping or syncing DMA buffers When the underlying device driver is reloaded, ia->ri_device will be replaced. All cached copies of that device pointer have to be updated as well. Commit `54cbd6b0c6` ("xprtrdma: Delay DMA mapping Send and Receive buffers") added the rg_device field to each regbuf. As part of handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on all regbufs for a transport. Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after the driver has been reloaded should reinitialize rg_device correctly for every case except rpcrdma_wc_receive, which still uses rpcrdma_rep::rr_device. Ensure the same device that was used to map a Receive buffer is also used to sync it in rpcrdma_wc_receive by using rg_device there instead of rr_device. This is the only use of rr_device, so it can be removed. The use of regbufs in the send path is also updated, for completeness. Fixes: `54cbd6b0c6` ("xprtrdma: Delay DMA mapping Send and ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:22 -04:00
Chuck Lever	fff09594ed	xprtrdma: Refactor rpcrdma_ia_open() In order to unload a device driver and reload it, xprtrdma will need to close a transport's interface adapter, and then call rpcrdma_ia_open again, possibly finding a different interface adapter. Make rpcrdma_ia_open safe to call on the same transport multiple times. This is a refactoring change only. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:20 -04:00
Chuck Lever	33849792cb	xprtrdma: Detect unreachable NFS/RDMA servers more reliably Current NFS clients rely on connection loss to determine when to retransmit. In particular, for protocols like NFSv4, clients no longer rely on RPC timeouts to drive retransmission: NFSv4 servers are required to terminate a connection when they need a client to retransmit pending RPCs. When a server is no longer reachable, either because it has crashed or because the network path has broken, the server cannot actively terminate a connection. Thus NFS clients depend on transport-level keepalive to determine when a connection must be replaced and pending RPCs retransmitted. However, RDMA RC connections do not have a native keepalive mechanism. If an NFS/RDMA server crashes after a client has sent RPCs successfully (an RC ACK has been received for all OTW RDMA requests), there is no way for the client to know the connection is moribund. In addition, new RDMA requests are subject to the RPC-over-RDMA credit limit. If the client has consumed all granted credits with NFS traffic, it is not allowed to send another RDMA request until the server replies. Thus it has no way to send a true keepalive when the workload has already consumed all credits with pending RPCs. To address this, forcibly disconnect a transport when an RPC times out. This prevents moribund connections from stopping the detection of failover or other configuration changes on the server. Note that even if the connection is still good, retransmitting any RPC will trigger a disconnect thanks to this logic in xprt_rdma_send_request: /* Must suppress retransmit to maintain credits */ if (req->rl_connect_cookie == xprt->connect_cookie) goto drop_connection; req->rl_connect_cookie = xprt->connect_cookie; Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:19 -04:00
Chuck Lever	e2a4f4fbef	sunrpc: Export xprt_force_disconnect() xprt_force_disconnect() is already invoked from the socket transport. I want to invoke xprt_force_disconnect() from the RPC-over-RDMA transport, which is a separate module from sunrpc.ko. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:16 -04:00
Chuck Lever	9378b274e1	xprtrdma: Cancel refresh worker during buffer shutdown Trying to create MRs while the transport is being torn down can cause a crash. Fixes: `e2ac236c0b` ("xprtrdma: Allocate MRs on demand") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-04-25 16:12:14 -04:00
Johannes Berg	127f60bfa9	mac80211: rewrite monitor mode delivery logic The monitor mode delivery logic makes it hard to add any kind of filtering in an efficient way, because the monitor SKB is created first and then passed to all interfaces. Rewrite the logic to create the monitor SKB the first time it's actually needed, and then keep delivering it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-25 21:43:37 +02:00
Vasanthakumar Thiagarajan	cd50ac0f31	cfg80211: Fix dfs state propagation for non-DFS center channel When part of a bigger bandwidth (160 MHz) channel falls in DFS channel range it is possible that the center frequency may not necessarily be a radar channel. Remove the sanity check on channel flag for IEEE80211_CHAN_RADAR in regulatory_propagate_dfs_state(), this should fix the dfs state propagation for non-DFS center freq which has DFS channels in it's bandwidth, should also fix unnecessary WARN_ON() spam in regulatory_propagate_dfs_state(). Fixes: `8976672736` ("cfg80211: Share Channel DFS state across wiphys of same DFS domain") Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2017-04-25 21:42:52 +02:00
Alexander Potapenko	fd2c83b357	net/packet: check length in getsockopt() called with PACKET_HDRLEN In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4 \|val\| remains uninitialized and the syscall may behave differently depending on its value, and even copy garbage to userspace on certain architectures. To fix this we now return -EINVAL if optlen is too small. This bug has been detected with KMSAN. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 14:05:52 -04:00
David Ahern	8048ced9be	net: ipv6: regenerate host route if moved to gc list Taking down the loopback device wreaks havoc on IPv6 routing. By extension, taking down a VRF device wreaks havoc on its table. Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6 FIB code while running syzkaller fuzzer. The root cause is a dead dst that is on the garbage list gets reinserted into the IPv6 FIB. While on the gc (or perhaps when it gets added to the gc list) the dst->next is set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the out-of-bounds access. Andrey's reproducer was the key to getting to the bottom of this. With IPv6, host routes for an address have the dst->dev set to the loopback device. When the 'lo' device is taken down, rt6_ifdown initiates a walk of the fib evicting routes with the 'lo' device which means all host routes are removed. That process moves the dst which is attached to an inet6_ifaddr to the gc list and marks it as dead. The recent change to keep global IPv6 addresses added a new function, fixup_permanent_addr, that is called on admin up. That function restarts dad for an inet6_ifaddr and when it completes the host route attached to it is inserted into the fib. Since the route was marked dead and moved to the gc list, re-inserting the route causes the reported out-of-bounds accesses. If the device with the address is taken down or the address is removed, the WARN_ON in fib6_del is triggered. All of those faults are fixed by regenerating the host route if the existing one has been moved to the gc list, something that can be determined by checking if the rt6i_ref counter is 0. Fixes: `f1705ec197` ("net: ipv6: Make address flushing on ifdown optional") Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 14:04:44 -04:00
Xin Long	b1b9d36602	bridge: move bridge multicast cleanup to ndo_uninit During removing a bridge device, if the bridge is still up, a new mdb entry still can be added in br_multicast_add_group() after all mdb entries are removed in br_multicast_dev_del(). Like the path: mld_ifc_timer_expire -> mld_sendpack -> ... br_multicast_rcv -> br_multicast_add_group The new mp's timer will be set up. If the timer expires after the bridge is freed, it may cause use-after-free panic in br_multicast_group_expired. BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge] Call Trace: <IRQ> [<ffffffff81094536>] call_timer_fn+0x36/0x110 [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge] [<ffffffff81096967>] run_timer_softirq+0x237/0x340 [<ffffffff8108dcbf>] __do_softirq+0xef/0x280 [<ffffffff8169889c>] call_softirq+0x1c/0x30 [<ffffffff8102c275>] do_softirq+0x65/0xa0 [<ffffffff8108e055>] irq_exit+0x115/0x120 [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80 Nikolay also found it would cause a memory leak - the mdb hash is reallocated and not freed due to the mdb rehash. unreferenced object 0xffff8800540ba800 (size 2048): backtrace: [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0 [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0 [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge] [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge] [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge] [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge] [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge] [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0 [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10 [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20 [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6] [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6] [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6] [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6] [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6] [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6] This could happen when ip link remove a bridge or destroy a netns with a bridge device inside. With Nikolay's suggestion, this patch is to clean up bridge multicast in ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so that netif_running check in br_multicast_add_group can avoid this issue. v1->v2: - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead of calling dev_close in br_dev_delete. (NOTE: Depends upon `b6fe0440c6` ("bridge: implement missing ndo_uninit()")) Fixes: `e10177abf8` ("bridge: multicast: fix handling of temp and perm entries") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 14:02:39 -04:00
Sabrina Dubroca	ec9c4215fe	ipv6: fix source routing Commit `a149e7c7ce` ("ipv6: sr: add support for SRH injection through setsockopt") introduced handling of IPV6_SRCRT_TYPE_4, but at the same time restricted it to only IPV6_SRCRT_TYPE_0 and IPV6_SRCRT_TYPE_4. Previously, ipv6_push_exthdr() and fl6_update_dst() would also handle other values (ie STRICT and TYPE_2). Restore previous source routing behavior, by handling IPV6_SRCRT_STRICT and IPV6_SRCRT_TYPE_2 the same way as IPV6_SRCRT_TYPE_0 in ipv6_push_exthdr() and fl6_update_dst(). Fixes: `a149e7c7ce` ("ipv6: sr: add support for SRH injection through setsockopt") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 13:59:24 -04:00
David S. Miller	b5cdae3291	net: Generic XDP This provides a generic SKB based non-optimized XDP path which is used if either the driver lacks a specific XDP implementation, or the user requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE. It is arguable that perhaps I should have required something like this as part of the initial XDP feature merge. I believe this is critical for two reasons: 1) Accessibility. More people can play with XDP with less dependencies. Yes I know we have XDP support in virtio_net, but that just creates another depedency for learning how to use this facility. I wrote this to make life easier for the XDP newbies. 2) As a model for what the expected semantics are. If there is a pure generic core implementation, it serves as a semantic example for driver folks adding XDP support. One thing I have not tried to address here is the issue of XDP_PACKET_HEADROOM, thanks to Daniel for spotting that. It seems incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or whatever even if the XDP program doesn't try to push headers at all. I think we really need the verifier to somehow propagate whether certain XDP helpers are used or not. v5: - Handle both negative and positive offset after running prog - Fix mac length in XDP_TX case (Alexei) - Use rcu_dereference_protected() in free_netdev (kbuild test robot) v4: - Fix MAC header adjustmnet before calling prog (David Ahern) - Disable LRO when generic XDP is installed (Michael Chan) - Bypass qdisc et al. on XDP_TX and record the event (Alexei) - Do not perform generic XDP on reinjected packets (DaveM) v3: - Make sure XDP program sees packet at MAC header, push back MAC header if we do XDP_TX. (Alexei) - Elide GRO when generic XDP is in use. (Alexei) - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic XDP even if the driver has an XDP implementation. (Alexei) - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS attribute. (Daniel) v2: - Add some "fall through" comments in switch statements based upon feedback from Andrew Lunn - Use RCU for generic xdp_prog, thanks to Johannes Berg. Tested-by: Andy Gospodarek <andy@greyhouse.net> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 13:33:49 -04:00
Parthasarathy Bhuvaragan	05ff837897	tipc: fix socket flow control accounting error at tipc_recv_stream Until now in tipc_recv_stream(), we update the received unacknowledged bytes based on a stack variable and not based on the actual message size. If the user buffer passed at tipc_recv_stream() is smaller than the received skb, the size variable in stack differs from the actual message size in the skb. This leads to a flow control accounting error causing permanent congestion. In this commit, we fix this accounting error by always using the size of the incoming message. Fixes: `10724cc7bb` ("tipc: redesign connection-level flow control") Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:45:38 -04:00
Parthasarathy Bhuvaragan	3364d61c92	tipc: fix socket flow control accounting error at tipc_send_stream Until now in tipc_send_stream(), we return -1 when the socket encounters link congestion even if the socket had successfully sent partial data. This is incorrect as the application resends the same the partial data leading to data corruption at receiver's end. In this commit, we return the partially sent bytes as the return value at link congestion. Fixes: `10724cc7bb` ("tipc: redesign connection-level flow control") Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:45:37 -04:00
Paolo Abeni	b7d6df5751	ipv6: move stub initialization after ipv6 setup completion The ipv6 stub pointer is currently initialized before the ipv6 routing subsystem: a 3rd party can access and use such stub before the routing data is ready. Moreover, such pointer is not cleared in case of initialization error, possibly leading to dangling pointers usage. This change addresses the above moving the stub initialization at the end of ipv6 init code. Fixes: `5f81bd2e5d` ("ipv6: export a stub for IPv6 symbols used by vxlan") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:43:16 -04:00
Guillaume Nault	a485c2b877	l2tp: define "l2tpeth" device type Export type of l2tpeth interfaces to userspace (/sys/class/net/<iface>/uevent). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:41:56 -04:00
Guillaume Nault	c39855febc	l2tp: set name_assign_type for devices created by l2tp_eth.c Export naming scheme used when creating l2tpeth interfaces (/sys/class/net/<iface>/name_assign_type). This let userspace know if the device's name has been generated automatically or defined manually. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:41:56 -04:00
Jamal Hadi Salim	e0ee84ded7	net sched actions: Complete the JUMPX opcode per discussion at netconf/netdev: When we have an action that is capable of branching (example a policer), we can achieve a continuation of the action graph by programming a "continue" where we find an exact replica of the same filter rule with a lower priority and the remainder of the action graph. When you have 100s of thousands of filters which require such a feature it gets very inefficient to do two lookups. This patch completes a leftover feature of action codes. Its time has come. Example below where a user labels packets with a different skbmark on ingress of a port depending on whether they have/not exceeded the configured rate. This mark is then used to make further decisions on some egress port. #rate control, very low so we can easily see the effect sudo $TC actions add action police rate 1kbit burst 90k \ conform-exceed pipe/jump 2 index 10 # skbedit index 11 will be used if the user conforms sudo $TC actions add action skbedit mark 11 ok index 11 # skbedit index 12 will be used if the user does not conform sudo $TC actions add action skbedit mark 12 ok index 12 #lets bind the user .. sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \ match ip dst 127.0.0.8/32 flowid 1:10 \ action police index 10 \ action skbedit index 11 \ action skbedit index 12 #run a ping -f and see what happens.. # jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip filter pref 8 u32 filter pref 8 u32 fh 800: ht divisor 1 filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10 (rule hit 2800 success 1005) match 7f000008/ffffffff at 16 (success 1005 ) action order 1: police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b ref 2 bind 1 installed 207 sec used 122 sec Action statistics: Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0) backlog 0b 0p requeues 0 action order 2: skbedit mark 11 pass index 11 ref 2 bind 1 installed 204 sec used 122 sec Action statistics: Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 3: skbedit mark 12 pass index 12 ref 2 bind 1 installed 201 sec used 122 sec Action statistics: Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Not bad, about 28% non-conforming packets.. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-25 11:30:06 -04:00
Dave Johnson	9dd2ab609e	netfilter: Wrong icmp6 checksum for ICMPV6_TIME_EXCEED in reverse SNATv6 path When recalculating the outer ICMPv6 checksum for a reverse path NATv6 such as ICMPV6_TIME_EXCEED nf_nat_icmpv6_reply_translation() was accessing data beyond the headlen of the skb for non-linear skb. This resulted in incorrect ICMPv6 checksum as garbage data was used. Patch replaces csum_partial() with skb_checksum() which supports non-linear skbs similar to nf_nat_icmp_reply_translation() from ipv4. Signed-off-by: Dave Johnson <dave-kernel@centerclick.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-25 11:10:38 +02:00
Liping Zhang	277a292835	netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded Currently, after adding the following nft rules: # nft add set x target1 { type ipv4_addr \; flags timeout \;} # nft add rule x y set add ip daddr timeout 1d @target1 counter the counters will always be zero despite of the elements are added to the dynamic set "target1" or not, as we will break the nft expr traversal unconditionally: # nft list ruleset ... set target1 { ... elements = { 8.8.8.8 expires 23h59m53s} } chain output { ... set add ip daddr timeout 1d @target1 counter packets 0 bytes 0 ^ ^ ... } Since we add the elements to the set successfully, we should continue to the next expression. Additionally, if elements are added to "flow table" successfully, we will _always_ continue to the next expr, even if the operation is _OP_ADD. So it's better to keep them to be consistent. Fixes: `22fe54d5fe` ("netfilter: nf_tables: add support for dynamic set updates") Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-25 11:10:37 +02:00
Linus Lüssing	cf3cb246e2	bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port When trying to redirect bridged frames to the bridge device itself or a bridge port (brouting) via the dnat target then this currently fails: The ethernet destination of the frame is dnat'ed to the MAC address of the bridge device or port just fine. However, the IP code drops it in the beginning of ip_input.c/ip_rcv() as the dnat target left the skb->pkt_type as PACKET_OTHERHOST. Fixing this by resetting skb->pkt_type to an appropriate type after dnat'ing. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-25 11:08:31 +02:00
Oliver Hartkopp	1ef83310b8	can: network namespace support for CAN gateway The CAN gateway was not implemented as per-net in the initial network namespace support by Mario Kicherer (`8e8cda6d73`). This patch enables the CAN gateway to be used in different namespaces. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:30 +02:00
Oliver Hartkopp	384317ef41	can: network namespace support for CAN_BCM protocol The CAN_BCM protocol and its procfs entries were not implemented as per-net in the initial network namespace support by Mario Kicherer (`8e8cda6d73`). This patch adds the missing per-net functionality for the CAN BCM. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:29 +02:00
Oliver Hartkopp	cb5635a367	can: complete initial namespace support The statistics and its proc output was not implemented as per-net in the initial network namespace support by Mario Kicherer (`8e8cda6d73`). This patch adds the missing per-net statistics for the CAN subsystem. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:29 +02:00
Oliver Hartkopp	f2e72f43e7	can: remove obsolete definitions can_rx_alldev_list is a per-net data structure now. Remove it's definition here and can_rx_dev_list too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:28 +02:00
Oliver Hartkopp	48452c169d	can: remove obsolete pernet_operations definitions The namespace support for the CAN subsystem does not need any additional memory. So when ".size = 0" there's no extra memory allocated by the system. And therefore ".id" is obsolete too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:28 +02:00
Oliver Hartkopp	a7bbd28f04	can: fix memory leak in initial namespace support The can_rx_alldev_list is a per-net data structure now and allocated in can_pernet_init(). Make sure the memory is free'd in can_pernet_exit() too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2017-04-25 09:04:27 +02:00
Salvatore Benedetto	58771c1cb0	Bluetooth: convert smp and selftest to crypto kpp API * Convert both smp and selftest to crypto kpp API * Remove module ecc as no more required * Add ecdh_helper functions for wrapping kpp async calls This patch has been tested only with selftest, which is called on module loading. Signed-off-by: Salvatore Benedetto <salvatore.benedetto@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2017-04-25 04:53:42 +02:00
Pan Bian	78302fd405	tipc: check return value of nlmsg_new Function nlmsg_new() will return a NULL pointer if there is no enough memory, and its return value should be checked before it is used. However, in function tipc_nl_node_get_monitor(), the validation of the return value of function nlmsg_new() is missed. This patch fixes the bug. Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 15:51:30 -04:00
Pan Bian	a50fe0ffd7	lwtunnel: check return value of nla_nest_start Function nla_nest_start() may return a NULL pointer on error. However, in function lwtunnel_fill_encap(), the return value of nla_nest_start() is not validated before it is used. This patch checks the return value of nla_nest_start() against NULL. Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 15:51:30 -04:00
Benjamin LaHaise	a577d8f793	cls_flower: add support for matching MPLS fields (v2) Add support to the tc flower classifier to match based on fields in MPLS labels (TTL, Bottom of Stack, TC field, Label). Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Simon Horman <simon.horman@netronome.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hadar Hen Zion <hadarh@mellanox.com> Cc: Gao Feng <fgao@ikuai8.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:30:46 -04:00
Benjamin LaHaise	029c1ecbb2	flow_dissector: add mpls support (v2) Add support for parsing MPLS flows to the flow dissector in preparation for adding MPLS match support to cls_flower. Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Simon Horman <simon.horman@netronome.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Eric Dumazet <jhs@mojatatu.com> Cc: Hadar Hen Zion <hadarh@mellanox.com> Cc: Gao Feng <fgao@ikuai8.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:30:46 -04:00
Wei Wang	59450f8d83	net/tcp_fastopen: Remove mss check in tcp_write_timeout() Christoph Paasch from Apple found another firewall issue for TFO: After successful 3WHS using TFO, server and client starts to exchange data. Afterwards, a 10s idle time occurs on this connection. After that, firewall starts to drop every packet on this connection. The fix for this issue is to extend existing firewall blackhole detection logic in tcp_write_timeout() by removing the mss check. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:27:17 -04:00
Wei Wang	46c2fa3987	net/tcp_fastopen: Add snmp counter for blackhole detection This counter records the number of times the firewall blackhole issue is detected and active TFO is disabled. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:27:17 -04:00
Wei Wang	cf1ef3f071	net/tcp_fastopen: Disable active side TFO in certain scenarios Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:27:17 -04:00
David S. Miller	bc95cd8e8b	mlx5-updates-2017-04-22 Sparse and compiler warnings fixes from Stephen Hemminger. From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling E-Switch encapsulation mode, this knob will enable HW support for applying encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJY+5cRAAoJEEg/ir3gV/o+5c8H/1/khPzy26B2lWyjPC8CRCQF eSd0tiHLgIqbZTbnIHTR+NbZ/SUFaukoJi8OKn1fGFHCCajWvPP4xkENVKrUdi3q kOgNZb/R1V0j6SdELyoMalFPjAscTgdmwYMnry+vcjOxJ+H2uUTnMKXwFf8IsBjz EINy8oZ5jZcejmft0c2O5HN4Bt/7U5ttM3CroAdcvPT9lq2DFJL2uCABhTO/1DdY b7uVa47FnkqxX19Ebn7fjp5r3diGYOmCPMjdC89C//rbkLB8FN61EkcSLpGY3YNm djmCPQ+xaa3ielmBpOk3AMayFEtYW0nDMj9eWECVByadRQZ2qz9wTVXBp5CX9zg= =E3Jt -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2017-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-04-22 Sparse and compiler warnings fixes from Stephen Hemminger. From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling E-Switch encapsulation mode, this knob will enable HW support for applying encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:11:10 -04:00
David Ahern	58c4c6a3f7	net: add rcu locking when changing early demux systemd-sysctl is triggering a suspicious RCU usage message when net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via a sysctl config file: [ 33.896184] =============================== [ 33.899558] [ ERR: suspicious RCU usage. ] [ 33.900624] 4.11.0-rc7+ #104 Not tainted [ 33.901698] ------------------------------- [ 33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage! [ 33.905724] other info that might help us debug this: [ 33.907656] rcu_scheduler_active = 2, debug_locks = 0 [ 33.909288] 1 lock held by systemd-sysctl/143: [ 33.910373] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48 [ 33.912407] stack backtrace: [ 33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104 [ 33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 33.917870] Call Trace: [ 33.918431] dump_stack+0x81/0xb6 [ 33.919241] lockdep_rcu_suspicious+0x10f/0x118 [ 33.920263] proc_configure_early_demux+0x65/0x10a [ 33.921391] proc_udp_early_demux+0x3a/0x41 add rcu locking to proc_configure_early_demux. Fixes: `dddb64bcb3` ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:08:19 -04:00
David Ahern	fc1f8f4f31	net: ipv6: send unsolicited NA if enabled for all interfaces When arp_notify is set to 1 for either a specific interface or for 'all' interfaces, gratuitous arp requests are sent. Since ndisc_notify is the ipv6 equivalent to arp_notify, it should follow the same semantics. Commit `4a6e3c5def` ("net: ipv6: send unsolicited NA on admin up") sends the NA on admin up. The final piece is checking devconf_all->ndisc_notify in addition to the per device setting. Add it. Fixes: `5cb04436ee` ("ipv6: add knob to send unsolicited ND on link-layer address change") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 14:07:18 -04:00
Peter Tirsek	6bd3d19292	netfilter: xt_socket: Fix broken IPv6 handling Commit `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") used the outdated XT_SOCKET_HAVE_IPV6 macro which was removed earlier in commit `8db4c5be88` ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c"). With that macro never being defined, the xt_socket match emits an "Unknown family 10" warning when used with IPv6: WARNING: CPU: 0 PID: 1377 at net/netfilter/xt_socket.c:160 socket_mt_enable_defrag+0x47/0x50 [xt_socket] Unknown family 10 Modules linked in: xt_socket nf_socket_ipv4 nf_socket_ipv6 nf_defrag_ipv4 [...] CPU: 0 PID: 1377 Comm: ip6tables-resto Not tainted 4.10.10 #1 Hardware name: [...] Call Trace: ? __warn+0xe7/0x100 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? warn_slowpath_fmt+0x39/0x40 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_v2_check+0x12/0x40 [xt_socket] ? xt_check_match+0x6b/0x1a0 [x_tables] ? xt_find_match+0x93/0xd0 [x_tables] ? xt_request_find_match+0x20/0x80 [x_tables] ? translate_table+0x48e/0x870 [ip6_tables] ? translate_table+0x577/0x870 [ip6_tables] ? walk_component+0x3a/0x200 ? kmalloc_order+0x1d/0x50 ? do_ip6t_set_ctl+0x181/0x490 [ip6_tables] ? filename_lookup+0xa5/0x120 ? nf_setsockopt+0x3a/0x60 ? ipv6_setsockopt+0xb0/0xc0 ? sock_common_setsockopt+0x23/0x30 ? SyS_socketcall+0x41d/0x630 ? vfs_read+0xfa/0x120 ? do_fast_syscall_32+0x7a/0x110 ? entry_SYSENTER_32+0x47/0x71 This patch brings the conditional back in line with how the rest of the file handles IPv6. Fixes: `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") Signed-off-by: Peter Tirsek <peter@tirsek.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	64f3967c7a	netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj We should acquire the ct->lock before accessing or modifying the nf_ct_seqadj, as another CPU may modify the nf_ct_seqadj at the same time during its packet proccessing. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	53b56da83d	netfilter: ctnetlink: make it safer when updating ct->status After converting to use rcu for conntrack hash, one CPU may update the ct->status via ctnetlink, while another CPU may process the packets and update the ct->status. So the non-atomic operation "ct->status \|= status;" via ctnetlink becomes unsafe, and this may clear the IPS_DYING_BIT bit set by another CPU unexpectedly. For example: CPU0 CPU1 ctnetlink_change_status __nf_conntrack_find_get old = ct->status nf_ct_gc_expired - nf_ct_kill - test_and_set_bit(IPS_DYING_BIT new = old \| status; - ct->status = new; <-- oops, _DYING_ is cleared! Now using a series of atomic bit operation to solve the above issue. Also note, user shouldn't set IPS_TEMPLATE, IPS_SEQ_ADJUST directly, so make these two bits be unchangable too. If we set the IPS_TEMPLATE_BIT, ct will be freed by nf_ct_tmpl_free, but actually it is alloced by nf_conntrack_alloc. If we set the IPS_SEQ_ADJUST_BIT, this may cause the NULL pointer deference, as the nfct_seqadj(ct) maybe NULL. Last, add some comments to describe the logic change due to the commit `a963d710f3` ("netfilter: ctnetlink: Fix regression in CTA_STATUS processing"), which makes me feel a little confusing. Fixes: `76507f69c4` ("[NETFILTER]: nf_conntrack: use RCU for conntrack hash") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	88be4c09d9	netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice Currently, ctnetlink_change_conntrack is always protected by _expect_lock, but this will cause a deadlock when deleting the helper from a conntrack, as the _expect_lock will be acquired again by nf_ct_remove_expectations: CPU0 ---- lock(nf_conntrack_expect_lock); lock(nf_conntrack_expect_lock); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by lt-conntrack_gr/12853: #0: (&table[i].mutex){+.+.+.}, at: [<ffffffffa05e2009>] nfnetlink_rcv_msg+0x399/0x6a9 [nfnetlink] #1: (nf_conntrack_expect_lock){+.....}, at: [<ffffffffa05f2c1f>] ctnetlink_new_conntrack+0x17f/0x408 [nf_conntrack_netlink] Call Trace: dump_stack+0x85/0xc2 __lock_acquire+0x1608/0x1680 ? ctnetlink_parse_tuple_proto+0x10f/0x1c0 [nf_conntrack_netlink] lock_acquire+0x100/0x1f0 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] _raw_spin_lock_bh+0x3f/0x50 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] ctnetlink_change_helper+0xc6/0x190 [nf_conntrack_netlink] ctnetlink_new_conntrack+0x1b2/0x408 [nf_conntrack_netlink] nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink] ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink] ? nfnetlink_bind+0x1a0/0x1a0 [nfnetlink] netlink_rcv_skb+0xa4/0xc0 nfnetlink_rcv+0x87/0x770 [nfnetlink] Since the operations are unrelated to nf_ct_expect, so we can drop the _expect_lock. Also note, after removing the _expect_lock protection, another CPU may invoke nf_conntrack_helper_unregister, so we should use rcu_read_lock to protect __nf_conntrack_helper_find invoked by ctnetlink_change_helper. Fixes: `ca7433df3a` ("netfilter: conntrack: seperate expect locking from nf_conntrack_lock") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	14e5676156	netfilter: ctnetlink: drop the incorrect cthelper module request First, when creating a new ct, we will invoke request_module to try to load the related inkernel cthelper. So there's no need to call the request_module again when updating the ct helpinfo. Second, ctnetlink_change_helper may be called with rcu_read_lock held, i.e. rcu_read_lock -> nfqnl_recv_verdict -> nfqnl_ct_parse -> ctnetlink_glue_parse -> ctnetlink_glue_parse_ct -> ctnetlink_change_helper. But the request_module invocation may sleep, so we can't call it with the rcu_read_lock held. Remove it now. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	54a5f9d9ab	netfilter: nft_set_bitmap: free dummy elements when destroy the set We forget to free dummy elements when deleting the set. So when I was running nft-test.py, I saw many kmemleak warnings: kmemleak: 1344 new suspected memory leaks ... # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800631345c8 (size 32): comm "nft", pid 9075, jiffies 4295743309 (age 1354.815s) hex dump (first 32 bytes): f8 63 13 63 00 88 ff ff 88 79 13 63 00 88 ff ff .c.c.....y.c.... 04 0c 00 00 00 00 00 00 00 00 00 00 08 03 00 00 ................ backtrace: [<ffffffff819059da>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81288174>] __kmalloc+0x164/0x310 [<ffffffffa061269d>] nft_set_elem_init+0x3d/0x1b0 [nf_tables] [<ffffffffa06130da>] nft_add_set_elem+0x45a/0x8c0 [nf_tables] [<ffffffffa0613645>] nf_tables_newsetelem+0x105/0x1d0 [nf_tables] [<ffffffffa05fe6d4>] nfnetlink_rcv+0x414/0x770 [nfnetlink] [<ffffffff817f0ca6>] netlink_unicast+0x1f6/0x310 [<ffffffff817f10c6>] netlink_sendmsg+0x306/0x3b0 ... Fixes: `e920dde516` ("netfilter: nft_set_bitmap: keep a list of dummy elements") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:25 +02:00
Liping Zhang	66e5a6b18b	netfilter: nf_ct_helper: permit cthelpers with different names via nfnetlink cthelpers added via nfnetlink may have the same tuple, i.e. except for the l3proto and l4proto, other fields are all zero. So even with the different names, we will also fail to add them: # nfct helper add ssdp inet udp # nfct helper add tftp inet udp nfct v1.4.3: netlink error: File exists So in order to avoid unpredictable behaviour, we should: 1. cthelpers can be selected by nft ct helper obj or xt_CT target, so report error if duplicated { name, l3proto, l4proto } tuple exist. 2. cthelpers can be selected by nf_ct_tuple_src_mask_cmp when nf_ct_auto_assign_helper is enabled, so also report error if duplicated { l3proto, l4proto, src-port } tuple exist. Also note, if the cthelper is added from userspace, then the src-port will always be zero, it's invalid for nf_ct_auto_assign_helper, so there's no need to check the second point listed above. Fixes: `893e093c78` ("netfilter: nf_ct_helper: bail out on duplicated helpers") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:05 +02:00
Jarno Rajahalme	cf5d709188	openvswitch: Delete conntrack entry clashing with an expectation. Conntrack helpers do not check for a potentially clashing conntrack entry when creating a new expectation. Also, nf_conntrack_in() will check expectations (via init_conntrack()) only if a conntrack entry can not be found. The expectation for a packet which also matches an existing conntrack entry will not be removed by conntrack, and is currently handled inconsistently by OVS, as OVS expects the expectation to be removed when the connection tracking entry matching that expectation is confirmed. It should be noted that normally an IP stack would not allow reuse of a 5-tuple of an old (possibly lingering) connection for a new data connection, so this is somewhat unlikely corner case. However, it is possible that a misbehaving source could cause conntrack entries be created that could then interfere with new related connections. Fix this in the OVS module by deleting the clashing conntrack entry after an expectation has been matched. This causes the following nf_conntrack_in() call also find the expectation and remove it when creating the new conntrack entry, as well as the forthcoming reply direction packets to match the new related connection instead of the old clashing conntrack entry. Fixes: `7f8a436eaa` ("openvswitch: Add conntrack action") Reported-by: Yang Song <yangsong@vmware.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:04:41 +02:00
Gao Feng	470acf55a0	netfilter: xt_CT: fix refcnt leak on error path There are two cases which causes refcnt leak. 1. When nf_ct_timeout_ext_add failed in xt_ct_set_timeout, it should free the timeout refcnt. Now goto the err_put_timeout error handler instead of going ahead. 2. When the time policy is not found, we should call module_put. Otherwise, the related cthelper module cannot be removed anymore. It is easy to reproduce by typing the following command: # iptables -t raw -A OUTPUT -p tcp -j CT --helper ftp --timeout xxx Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:03:01 +02:00
Jarno Rajahalme	120645513f	openvswitch: Add eventmask support to CT action. Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK, which can be used in conjunction with the commit flag (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which conntrack events (IPCT_*) should be delivered via the Netfilter netlink multicast groups. Default behavior depends on the system configuration, but typically a lot of events are delivered. This can be very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some types of events are of interest. Netfilter core init_conntrack() adds the event cache extension, so we only need to set the ctmask value. However, if the system is configured without support for events, the setting will be skipped due to extension not being found. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 13:53:25 -04:00
Jarno Rajahalme	abd0a4f2b4	openvswitch: Typo fix. Fix typo in a comment. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 13:53:24 -04:00
Ansis Atteka	b40c5f4fde	udp: disable inner UDP checksum offloads in IPsec case Otherwise, UDP checksum offloads could corrupt ESP packets by attempting to calculate UDP checksum when this inner UDP packet is already protected by IPsec. One way to reproduce this bug is to have a VM with virtio_net driver (UFO set to ON in the guest VM); and then encapsulate all guest's Ethernet frames in Geneve; and then further encrypt Geneve with IPsec. In this case following symptoms are observed: 1. If using ixgbe NIC, then it will complain with following error message: ixgbe 0000:01:00.1: partial checksum but l4 proto=32! 2. Receiving IPsec stack will drop all the corrupted ESP packets and increase XfrmInStateProtoError counter in /proc/net/xfrm_stat. 3. iperf UDP test from the VM with packet sizes above MTU will not work at all. 4. iperf TCP test from the VM will get ridiculously low performance because. Signed-off-by: Ansis Atteka <aatteka@ovn.org> Co-authored-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 13:48:54 -04:00
Robert Shearman	b7c8487cb3	ipv4: Avoid caching l3mdev dst on mismatched local route David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red when either policy routing IP rules are present or the local table lookup ip rule is before the l3mdev lookup results in a hang with these messages: unregister_netdevice: waiting for red to become free. Usage count = 1 The problem is caused by caching the dst used for sending the packet out of the specified interface on a local route with a different nexthop interface. Thus the dst could stay around until the route in the table the lookup was done is deleted which may be never. Address the problem by not forcing output device to be the l3mdev in the flow's output interface if the lookup didn't use the l3mdev. This then results in the dst using the right device according to the route. Changes in v2: - make the dev_out passed in by __ip_route_output_key_hash correct instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as suggested by David. Fixes: `5f02ce24c2` ("net: l3mdev: Allow the l3mdev to be a loopback") Reported-by: David Ahern <dsa@cumulusnetworks.com> Suggested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-24 12:50:29 -04:00

1 2 3 4 5 ...

46307 Commits