Go to file
Christian Brauner 94f82008ce Revert "vfs: Allow userns root to call mknod on owned filesystems."
This reverts commit 55956b59df.

commit 55956b59df ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:

bool may_open_dev(const struct path *path)
{
        return !(path->mnt->mnt_flags & MNT_NODEV) &&
                !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.

All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).

Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.

[1]: https://github.com/systemd/systemd/pull/9483

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-22 14:18:34 -08:00
arch Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc 2018-12-21 14:23:57 -08:00
block block: Fix null_blk_zoned creation failure with small number of zones 2018-12-11 16:19:38 -07:00
certs export.h: remove VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR() 2018-08-22 23:21:44 +09:00
crypto crypto: user - Disable statistics interface 2018-12-07 13:56:08 +08:00
Documentation XArray updates for 4.20-rc7 2018-12-13 16:35:58 -08:00
drivers Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc 2018-12-21 14:23:57 -08:00
firmware kbuild: remove all dummy assignments to obj- 2017-11-18 11:46:06 +09:00
fs Revert "vfs: Allow userns root to call mknod on owned filesystems." 2018-12-22 14:18:34 -08:00
include dma-mapping: fix flags in dma_alloc_wc 2018-12-22 08:46:27 -08:00
init psi: fix reference to kernel commandline enable 2018-12-14 15:05:45 -08:00
ipc ipc: IPCMNI limit check for semmni 2018-10-31 08:54:14 -07:00
kernel fork,memcg: fix crash in free_thread_stack on memcg charge fail 2018-12-21 14:51:18 -08:00
lib XArray: Fix xa_alloc when id exceeds max 2018-12-13 14:07:33 -05:00
LICENSES This is a fairly typical cycle for documentation. There's some welcome 2018-10-24 18:01:11 +01:00
mm mm, page_alloc: fix has_unmovable_pages for HugePages 2018-12-21 14:51:18 -08:00
net tls: Do not call sk_memcopy_from_iter with zero length 2018-12-21 10:26:54 -08:00
samples VFIO updates for v4.20 2018-10-31 11:01:38 -07:00
scripts Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-21 09:22:24 -08:00
security ima: cleanup the match_token policy code 2018-12-17 16:31:28 -08:00
sound ALSA: hda/realtek: Enable audio jacks of ASUS UX433FN/UX333FA with ALC294 2018-12-10 11:25:22 +01:00
tools Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-12-19 23:34:33 -08:00
usr initramfs: move gen_initramfs_list.sh from scripts/ to usr/ 2018-08-22 23:21:44 +09:00
virt KVM: fix unregistering coalesced mmio zone from wrong bus 2018-12-18 22:07:25 +01:00
.clang-format page cache: Convert find_get_pages_contig to XArray 2018-10-21 10:46:34 -04:00
.cocciconfig
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore Kbuild updates for v4.17 (2nd) 2018-04-15 17:21:30 -07:00
.mailmap mailmap: Update email for Punit Agrawal 2018-11-05 10:02:11 +00:00
COPYING COPYING: use the new text with points to the license files 2018-03-23 12:41:45 -06:00
CREDITS MAINTAINERS: update entry for MMP platform 2018-12-03 12:39:57 -08:00
Kbuild Kbuild updates for v4.15 2017-11-17 17:45:29 -08:00
Kconfig kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
MAINTAINERS Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-12-21 14:21:17 -08:00
Makefile Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-21 09:22:24 -08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.