PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In one error case, tpacket_rcv drops packets after incrementing the
ring producer index.
If this happens, it does not update tp_status to TP_STATUS_USER and
thus the reader is stalled for an iteration of the ring, causing out
of order arrival.
The only such error path is when virtio_net_hdr_from_skb fails due
to encountering an unknown GSO type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are updates to device drivers and file systems that for some reason
or another were not included in the kernel in the previous y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the now
unused time_t/timeval/timespec types and helper functions after all five
branches are merged for linux-5.6, ensuring that no new users get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1], should
be the first release that can serve as a base for a 32-bit system designed
to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along with
installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to be
ported to use the time64 syscalls added in linux-5.1 in place of the
existing system calls. This impacts most users of futex() and seccomp()
as well as programming languages that have their own runtime environment
not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit time_t
in a compatible way, so they must be configured to use CLOCK_MONOTONIC
times or (with a y2106 problem) unsigned 32-bit timestamps. Most
importantly this impacts all users of 'struct input_event'.
- All y2038 problems that are present on 64-bit machines also apply to
32-bit machines. In particular this affects file systems with on-disk
timestamps using signed 32-bit seconds: ext4 with ext3-style small
inodes, ext2, xfs (to be fixed soon) and ufs.
Changes since v1 [2]:
- Add Acks I received
- Rebase to v5.5-rc1, dropping patches that got merged already
- Add NFS, XFS and the final three patches from another series
- Rewrite etnaviv patches
- Add one late revert to avoid an etnaviv regression
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
[2] https://lore.kernel.org/lkml/20191108213257.3097633-1-arnd@arndb.de/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJeMYy3AAoJEGCrR//JCVInEGwP/0R+S+ok7vw9OdLVT0lFl07D
IcVabgOWf24imN7m7L7Mlt3nDfxIT4tMpiAXq7eMO3spcyViG18O2LXdSQ4/7QBp
+BlhoMjOP9w34Jyd7mnkFr4vqQALvfIqkS8rFObDtDub2Rfj9PC36MRMIu8BPXlv
RK8bigwJeH/DV38yc5/JeUcD+WuewYLsK9XPWN+4yB4vgGsNU3ZQQ6nnzbR3hMsN
DN8WZ68Y7IBs0Kyxkf+s2zmRXtCa2RiFg/2TUsk5olVAJVaenvte69hq5RSbg1vW
vLi6K8cBoPWL59nqCzcNE+TUhSUg3LOj/a/KWyl76yovz7AlJaNjssOf8ZjHw6sL
MhQqz3hXTxiJDS2Jvbf1yojiYGlzrq/gqcRFGe9jPcZdieMc4/yZCx60G/Exa5Pu
YdMcqMyDWPFyUAFQNWEF59HPheOdj6tb1KpJ6bwgCo3P7QqhLrU4z9w3Py4/ZfBO
4sWcWteSsD6MN/ADJ2WQ56nNxzM2AvkeVJKcF6FCkdngXX9T0GExmZz7SqB5Du99
9lNjIiD5E+LBa/Swo/7n49aYa8x06V1pmHYTZVh9Wkl+CZiO21umezQFrWsfaMTp
xt3c6pFdMG5xNMGpreTAXOmf2R+T6O8IO2qQq/TYjzqOLH7QC830P7avkmml+cK1
LjOBE2TfSeO8Ru1dXV4t
=wx0A
-----END PGP SIGNATURE-----
Merge tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"Core, driver and file system changes
These are updates to device drivers and file systems that for some
reason or another were not included in the kernel in the previous
y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the
now unused time_t/timeval/timespec types and helper functions after
all five branches are merged for linux-5.6, ensuring that no new users
get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1],
should be the first release that can serve as a base for a 32-bit
system designed to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along
with installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to
be ported to use the time64 syscalls added in linux-5.1 in place of
the existing system calls. This impacts most users of futex() and
seccomp() as well as programming languages that have their own
runtime environment not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and
linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit
time_t in a compatible way, so they must be configured to use
CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
timestamps. Most importantly this impacts all users of 'struct
input_event'.
- All y2038 problems that are present on 64-bit machines also apply
to 32-bit machines. In particular this affects file systems with
on-disk timestamps using signed 32-bit seconds: ext4 with
ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
y2038: sh: remove timeval/timespec usage from headers
y2038: sparc: remove use of struct timex
y2038: rename itimerval to __kernel_old_itimerval
y2038: remove obsolete jiffies conversion functions
nfs: fscache: use timespec64 in inode auxdata
nfs: fix timstamp debug prints
nfs: use time64_t internally
sunrpc: convert to time64_t for expiry
drm/etnaviv: avoid deprecated timespec
drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
drm/msm: avoid using 'timespec'
hfs/hfsplus: use 64-bit inode timestamps
hostfs: pass 64-bit timestamps to/from user space
packet: clarify timestamp overflow
tsacct: add 64-bit btime field
acct: stop using get_seconds()
um: ubd: use 64-bit time_t where possible
xtensa: ISS: avoid struct timeval
dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
...
If __ethtool_get_link_ksettings() is failed and with
non-zero value, prb_calc_retire_blk_tmo() should return
DEFAULT_PRB_RETIRE_TOV firstly.
This patch is to refactory code and make it more readable.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory mapped packet socket data structure in version 1 through 3
all contain 32-bit second values for the packet time stamps, which makes
them suffer from the overflow of time_t in y2038 or y2106 (depending
on whether user space interprets the value as signed or unsigned).
The implementation uses the deprecated getnstimeofday() function.
In order to get rid of that, this changes the code to use
ktime_get_real_ts64() as a replacement, documenting the nature of the
overflow. As long as the user applications treat the timestamps as
unsigned, or only use the difference between timestamps, they are
fine, and changing the timestamps to 64-bit wouldn't require a more
invasive user space API change.
Note: a lot of other APIs suffer from incompatible structures when
time_t gets redefined to 64-bit in 32-bit user space, but this one
does not.
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/lkml/CAF=yD-Jomr-gWSR-EBNKnSpFL46UeG564FLfqTCMNEm-prEaXA@mail.gmail.com/T/#u
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There is softlockup when using TPACKET_V3:
...
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
(__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54)
(_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c)
(mod_timer) from [<c0549c30>]
(prb_retire_rx_blk_timer_expired+0x68/0x11c)
(prb_retire_rx_blk_timer_expired) from [<c027a7ac>]
(call_timer_fn+0x90/0x17c)
(call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc)
(run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318)
(__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac)
(irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4)
(msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4)
(handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118)
(gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c)
...
If __ethtool_get_link_ksettings() is failed in
prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
is zero and the timer expire for retire_blk_timer is turn to
mod_timer(&pkc->retire_blk_timer, jiffies + 0),
which will trigger cpu usage of softirq is 100%.
Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Tested-by: Xiao Jiangfeng <xiaojiangfeng@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 174e23810c
("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
recycle always drop skb extensions. The additional skb_ext_del() that is
performed via nf_reset on napi skb recycle is not needed anymore.
Most nf_reset() calls in the stack are there so queued skb won't block
'rmmod nf_conntrack' indefinitely.
This removes the skb_ext_del from nf_reset, and renames it to a more
fitting nf_reset_ct().
In a few selected places, add a call to skb_ext_reset to make sure that
no active extensions remain.
I am submitting this for "net", because we're still early in the release
cycle. The patch applies to net-next too, but I think the rename causes
needless divergence between those trees.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.
In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.
The aquantia driver conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
When an application is run that:
a) Sets its scheduler to be SCHED_FIFO
and
b) Opens a memory mapped AF_PACKET socket, and sends frames with the
MSG_DONTWAIT flag cleared, its possible for the application to hang
forever in the kernel. This occurs because when waiting, the code in
tpacket_snd calls schedule, which under normal circumstances allows
other tasks to run, including ksoftirqd, which in some cases is
responsible for freeing the transmitted skb (which in AF_PACKET calls a
destructor that flips the status bit of the transmitted frame back to
available, allowing the transmitting task to complete).
However, when the calling application is SCHED_FIFO, its priority is
such that the schedule call immediately places the task back on the cpu,
preventing ksoftirqd from freeing the skb, which in turn prevents the
transmitting task from detecting that the transmission is complete.
We can fix this by converting the schedule call to a completion
mechanism. By using a completion queue, we force the calling task, when
it detects there are no more frames to send, to schedule itself off the
cpu until such time as the last transmitted skb is freed, allowing
forward progress to be made.
Tested by myself and the reporter, with good results
Change Notes:
V1->V2:
Enhance the sleep logic to support being interruptible and
allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
V2->V3:
Rearrage the point at which we wait for the completion queue, to
avoid needing to check for ph/skb being null at the end of the loop.
Also move the complete call to the skb destructor to avoid needing to
modify __packet_set_status. Also gate calling complete on
packet_read_pending returning zero to avoid multiple calls to complete.
(Willem de Bruijn)
Move timeo computation within loop, to re-fetch the socket
timeout since we also use the timeo variable to record the return code
from the wait_for_complete call (Neil Horman)
V3->V4:
Willem has requested that the control flow be restored to the
previous state. Doing so lets us eliminate the need for the
po->wait_on_complete flag variable, and lets us get rid of the
packet_next_frame function, but introduces another complexity.
Specifically, but using the packet pending count, we can, if an
applications calls sendmsg multiple times with MSG_DONTWAIT set, each
set of transmitted frames, when complete, will cause
tpacket_destruct_skb to issue a complete call, for which there will
never be a wait_on_completion call. This imbalance will lead to any
future call to wait_for_completion here to return early, when the frames
they sent may not have completed. To correct this, we need to re-init
the completion queue on every call to tpacket_snd before we enter the
loop so as to ensure we wait properly for the frames we send in this
iteration.
Change the timeout and interrupted gotos to out_put rather than
out_status so that we don't try to free a non-existant skb
Clean up some extra newlines (Willem de Bruijn)
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot found we can leak memory in packet_set_ring(), if user application
provides buggy parameters.
Fixes: 7f953ab2ba ("af_packet: TX_RING support for TPACKET_V3")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two places where we want to clear the pressure
if possible, add a helper to make it more obvious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__packet_rcv_has_room() can now be run without lock being held.
po->pressure is only a non persistent hint, we can mark
all read/write accesses with READ_ONCE()/WRITE_ONCE()
to document the fact that the field could be written
without any synchronization.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_rcv() can be hit under DDOS quite hard, since
it will always grab a socket spinlock, to eventually find
there is no room for an additional packet.
Using tcpdump [1] on a busy host can lead to catastrophic consequences,
because of all cpus spinning on a contended spinlock.
This replicates a similar strategy used in packet_rcv()
[1] Also some applications mistakenly use af_packet socket
bound to ETH_P_ALL only to send packets.
Receive queue is never drained and immediately full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under DDOS, we want to be able to increment tp_drops without
touching the spinlock. This will help readers to drain
the receive queue slightly faster :/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goal is to be able to use __tpacket_v3_has_room() without holding
a lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goal is to be able to use __tpacket_has_room() without holding a lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable 'status' in __packet_lookup_frame_in_block() is never used since
introduction in commit f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer
implementation."), we can remove it.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.
2) Read SFP eeprom in max 16 byte increments to avoid problems with
some SFP modules, from Russell King.
3) Fix UDP socket lookup wrt. VRF, from Tim Beale.
4) Handle route invalidation properly in s390 qeth driver, from Julian
Wiedmann.
5) Memory leak on unload in RDS, from Zhu Yanjun.
6) sctp_process_init leak, from Neil HOrman.
7) Fix fib_rules rule insertion semantic change that broke Android,
from Hangbin Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
pktgen: do not sleep with the thread lock held.
net: mvpp2: Use strscpy to handle stat strings
net: rds: fix memory leak in rds_ib_flush_mr_pool
ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
net: aquantia: fix wol configuration not applied sometimes
ethtool: fix potential userspace buffer overflow
Fix memory leak in sctp_process_init
net: rds: fix memory leak when unload rds_rdma
ipv6: fix the check before getting the cookie in rt6_get_cookie
ipv4: not do cache for local delivery if bc_forwarding is enabled
s390/qeth: handle error when updating TX queue count
s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
s390/qeth: check dst entry before use
s390/qeth: handle limited IPv4 broadcast in L3 TX path
net: fix indirect calls helpers for ptype list hooks.
net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
udp: only choose unbound UDP socket for multicast when not in a VRF
net/tls: replace the sleeping lock around RX resync with a bit lock
...
Rollover used to use a complex RCU mechanism for assignment, which had
a race condition. The below patch fixed the bug and greatly simplified
the logic.
The feature depends on fanout, but the state is private to the socket.
Fanout_release returns f only when the last member leaves and the
fanout struct is to be freed.
Destroy rollover unconditionally, regardless of fanout state.
Fixes: 57f015f5ec ("packet: fix crash in fanout_demux_rollover()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Packet sockets in datagram mode take a destination address. Verify its
length before passing to dev_hard_header.
Prior to 2.6.14-rc3, the send code ignored sll_halen. This is
established behavior. Directly compare msg_namelen to dev->addr_len.
Change v1->v2: initialize addr in all paths
Fixes: 6b8d95f179 ("packet: validate address length if non-zero")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet send checks that msg_name is at least sizeof sockaddr_ll.
Packet recv must return at least this length, so that its output
can be passed unmodified to packet send.
This ceased to be true since adding support for lladdr longer than
sll_addr. Since, the return value uses true address length.
Always return at least sizeof sockaddr_ll, even if address length
is shorter. Zero the padding bytes.
Change v1->v2: do not overwrite zeroed padding again. use copy_len.
Fixes: 0fb375fb9b ("[AF_PACKET]: Allow for > 8 byte hardware addresses.")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.
With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.
To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.
We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.
We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.
Tested with ixgbe and CONFIG_FCOE=m
With pktgen using queue xmit:
threads vanilla patched
(kpps) (kpps)
1 2334 2428
2 4166 4278
4 7895 8100
v1 -> v2:
- rebased after helper's name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently packet_pick_tx_queue() is the only caller of
ndo_select_queue() using a fallback argument other than
netdev_pick_tx.
Leveraging rx queue, we can obtain a similar queue selection
behavior using core helpers. After this change, ndo_select_queue()
is always invoked with netdev_pick_tx() as fallback.
We can change ndo_select_queue() signature in a followup patch,
dropping an indirect call per transmitted packet in some scenarios
(e.g. TCP syn and XDP generic xmit)
This changes slightly how af packet queue selection happens when
PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
tacking in account both XPS and TC mapping.
v1 -> v2:
- rebased after helper name change
RFC -> v1:
- initialize sender_cpu to the expected value
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I am using "protocol ip" filters in TC to manipulate TC flower
classifiers, which are only available with "protocol ip". However,
I faced an issue that packets sent via raw sockets with ETH_P_ALL
did not match the ip filters even if they did satisfy the condition
(e.g., DHCP offer from dhcpd).
I have determined that the behavior was caused by an unexpected
value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP,
when packets were sent via raw sockets with ETH_P_ALL set.
IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for
packets sent via raw sockets because ETH_P_ALL is not a real ether
type used on wire, but a virtual one.
This patch fixes the tx protocol selection in cases of transmission
via raw sockets created with ETH_P_ALL so that it asks the driver to
extract protocol from the Ethernet header.
Fixes: 75c65772c3 ("net/packet: Ask driver for protocol if not provided by user")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@lab.ntt.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using fanouts with AF_PACKET, the demux functions such as
fanout_demux_cpu will return an index in the fanout socket array, which
corresponds to the selected socket.
The ordering of this array depends on the order the sockets were added
to a given fanout group, so for FANOUT_CPU this means sockets are bound
to cpus in the order they are configured, which is OK.
However, when stopping then restarting the interface these sockets are
bound to, the sockets are reassigned to the fanout group in the reverse
order, due to the fact that they were inserted at the head of the
interface's AF_PACKET socket list.
This means that traffic that was directed to the first socket in the
fanout group is now directed to the last one after an interface restart.
In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
socket that used to receive traffic from the last CPU after an interface
restart.
This commit introduces a helper to add a socket at the tail of a list,
then uses it to register AF_PACKET sockets.
Note that this changes the order in which sockets are listed in /proc and
with sock_diag.
Fixes: dc99f60069 ("packet: Add fanout support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c72219b75f introduced tpacket_set_protocol that parses the Ethernet L2
header and sets skb->protocol if it's unset. It is no longer needed
since the introduction of packet_parse_headers. In case of SOCK_RAW and
unset skb->protocol, packet_parse_headers asks the driver to tell the
protocol number, and it's implemented for all Ethernet devices. As the
old function supported only Ethernet, no functionality is lost.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a socket was created with socket(AF_PACKET, SOCK_RAW, 0), the
protocol number is unavailable. Try to ask the driver to extract it from
the L2 header in order for skb_try_probe_transport_header to succeed.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the socket was created with socket(AF_PACKET, SOCK_RAW, 0),
skb->protocol will be unset, __skb_flow_dissect() will fail, and
skb_probe_transport_header() will fall back to the offset_hint, making
the resulting skb_transport_offset incorrect.
If, however, there is no transport header in the packet,
transport_header shouldn't be set to an arbitrary value.
Fix it by leaving the transport offset unset if it couldn't be found, to
be explicit rather than to fill it with some wrong value. It changes the
behavior, but if some code relied on the old behavior, it would be
broken anyway, as the old one is incorrect.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow. Check it for overflow without limiting the total buffer
size to UINT_MAX.
This change fixes support for packet ring buffers >= UINT_MAX.
Fixes: 8f8d28e4d6 ("net/packet: fix overflow in check for tp_frame_nr")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit cb9f1b7838, scapy (which uses an AF_PACKET socket in
SOCK_RAW mode) is unable to send a basic icmp packet over a sit tunnel:
Here is a example of the setup:
$ ip link set ntfp2 up
$ ip addr add 10.125.0.1/24 dev ntfp2
$ ip tunnel add tun1 mode sit ttl 64 local 10.125.0.1 remote 10.125.0.2 dev ntfp2
$ ip addr add fd00:cafe:cafe::1/128 dev tun1
$ ip link set dev tun1 up
$ ip route add fd00:200::/64 dev tun1
$ scapy
>>> p = []
>>> p += IPv6(src='fd00💯:1', dst='fd00:200::1')/ICMPv6EchoRequest()
>>> send(p, count=1, inter=0.1)
>>> quit()
$ ip -s link ls dev tun1 | grep -A1 "TX.*errors"
TX: bytes packets errors dropped carrier collsns
0 0 1 0 0 0
The problem is that the network offset is set to the hard_header_len of the
output device (tun1, ie 14 + 20) and in our case, because the packet is
small (48 bytes) the pskb_inet_may_pull() fails (it tries to pull 40 bytes
(ipv6 header) starting from the network offset).
This problem is more generally related to device with variable hard header
length. To avoid a too intrusive patch in the current release, a (ugly)
workaround is proposed in this patch. It has to be cleaned up in net-next.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=993675a3100b1
Link: http://patchwork.ozlabs.org/patch/1024489/
Fixes: cb9f1b7838 ("ip: validate header length on virtual device xmit")
CC: Willem de Bruijn <willemb@google.com>
CC: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'dev' is non NULL when the addr_len check triggers so it must goto a label
that does the dev_put otherwise dev will have a leaked refcount.
This bug causes the ib_ipoib module to become unloadable when using
systemd-network as it triggers this check on InfiniBand links.
Fixes: 99137b7888 ("packet: validate address length")
Reported-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Validate packet socket address length if a length is given. Zero
length is equivalent to not setting an address.
Fixes: 99137b7888 ("packet: validate address length")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet sockets with SOCK_DGRAM may pass an address for use in
dev_hard_header. Ensure that it is of sufficient length.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.
Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.
Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_snd sends packets with user pages linked into skb frags. It
notifies that pages can be reused when the skb is released by setting
skb->destructor to tpacket_destruct_skb.
This can cause data corruption if the skb is orphaned (e.g., on
transmit through veth) or cloned (e.g., on mirror to another psock).
Create a kernel-private copy of data in these cases, same as tun/tap
zerocopy transmission. Reuse that infrastructure: mark the skb as
SKBTX_ZEROCOPY_FRAG, which will trigger copy in skb_orphan_frags(_rx).
Unlike other zerocopy packets, do not set shinfo destructor_arg to
struct ubuf_info. tpacket_destruct_skb already uses that ptr to notify
when the original skb is released and a timestamp is recorded. Do not
change this timestamp behavior. The ubuf_info->callback is not needed
anyway, as no zerocopy notification is expected.
Mark destructor_arg as not-a-uarg by setting the lower bit to 1. The
resulting value is not a valid ubuf_info pointer, nor a valid
tpacket_snd frame address. Add skb_zcopy_.._nouarg helpers for this.
The fix relies on features introduced in commit 52267790ef ("sock:
add MSG_ZEROCOPY"), so can be backported as is only to 4.14.
Tested with from `./in_netns.sh ./txring_overwrite` from
http://github.com/wdebruij/kerneltools/tests
Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Reported-by: Anand H. Krishnan <anandhkrishnan@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we use raw socket as the vhost backend, a packet from virito with
gso offloading information, cannot be sent out in later validaton at
xmit path, as we did not set correct skb->protocol which is further used
for looking up the gso function.
To fix this, we set this field according to virito hdr information.
Fixes: e858fae2b0 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Jianfeng Tan <jianfeng.tan@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter. With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them. So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)
Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this. Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
The first intended user is lldpd.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 71e4128620.
mmap()/munmap() can not be backed by kmalloced pages :
We fault in :
VM_BUG_ON_PAGE(PageSlab(page), page);
unmap_single_vma+0x8a/0x110
unmap_vmas+0x4b/0x90
unmap_region+0xc9/0x140
do_munmap+0x274/0x360
vm_munmap+0x81/0xc0
SyS_munmap+0x2b/0x40
do_syscall_64+0x13e/0x1c0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Fixes: 71e4128620 ("packet: switch kvzalloc to allocate memory")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Bisected-by: John Sperbeck <jsperbeck@google.com>
Cc: Zhang Yu <zhangyu31@baidu.com>
Cc: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patches includes following change:
*Use modern kvzalloc()/kvfree() instead of custom allocations.
*Remove order argument for alloc_pg_vec, it can get from req.
*Remove order argument for free_pg_vec, free_pg_vec now uses
kvfree which does not need order argument.
*Remove pg_vec_order from struct packet_ring_buffer, no longer
need to save/restore 'order'
*Remove variable 'order' for packet_set_ring, it is now unused
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>