Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone. Then attach the pin to original
vfsmount. Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out. Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.
mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These externs belong in fs/internal.h. Rename (they are not acct-specific
anymore) and move them over there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.
The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since March 2009 the kernel has treated the state that if no
MS_..ATIME flags are passed then the kernel defaults to relatime.
Defaulting to relatime instead of the existing atime state during a
remount is silly, and causes problems in practice for people who don't
specify any MS_...ATIME flags and to get the default filesystem atime
setting. Those users may encounter a permission error because the
default atime setting does not work.
A default that does not work and causes permission problems is
ridiculous, so preserve the existing value to have a default
atime setting that is always guaranteed to work.
Using the default atime setting in this way is particularly
interesting for applications built to run in restricted userspace
environments without /proc mounted, as the existing atime mount
options of a filesystem can not be read from /proc/mounts.
In practice this fixes user space that uses the default atime
setting on remount that are broken by the permission checks
keeping less privileged users from changing more privileged users
atime settings.
Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.
In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked. These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.
The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.
The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled. Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.
The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.
Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.
Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are no races as locked mount flags are guaranteed to never change.
Moving the test into do_remount makes it more visible, and ensures all
filesystem remounts pass the MNT_LOCK_READONLY permission check. This
second case is not an issue today as filesystem remounts are guarded
by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
mount namespaces, but it could become an issue in the future.
Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.
Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.
Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.
Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.
This yields a simpler to understand lock, and a faster setns system call.
In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.
This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make delayed_free() call free_vfsmnt() so that we don't have two functions
doing the same job. This requires the calls to mnt_free_id() in free_vfsmnt()
to be moved into the callers of that function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint. That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations. It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.
Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.
One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups. And iterate through
the peers before dealing with the next group.
Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.
Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}. It means that the master
of M_{k} will be among the sequence of masters of M. On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).
So we go through the sequence of masters of M until we find
a marked one (P). Let N be the one before it. Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.
That's it for the hard part; the rest is fairly simple. Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().
It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating. Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.
[fix for dumb braino folded]
Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc. This is primarily being done for
the cgroups filesystem, but the goal is to also move debugfs to it when
it is ready, solving all of the known issues in that filesystem as well.
The code isn't completed yet, but all should be stable now (there is a
big section that was reverted due to problems found when testing.)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be using
soon (it's in this tree to make merges after -rc1 easier.)
All of this has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
=0pc0
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)
This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)
All of this has been in linux-next with no reported issues"
* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...
We're in the process of separating out core sysfs functionality into
kernfs which will deal with sysfs_dirents directly. This patch
rearranges mount path so that the kernfs and sysfs parts are separate.
* As sysfs_super_info won't be visible outside kernfs proper,
kernfs_super_ns() is added to allow kernfs users to access a
super_block's namespace tag.
* Generic mount operation is separated out into kernfs_mount_ns().
sysfs_mount() now just performs sysfs-specific permission check,
acquires namespace tag, and invokes kernfs_mount_ns().
* Generic superblock release is separated out into kernfs_kill_sb()
which can be used directly as file_system_type->kill_sb(). As sysfs
needs to put the namespace tag, sysfs_kill_sb() wraps
kernfs_kill_sb() with ns tag put.
* sysfs_dir_cachep init and sysfs_inode_init() are separated out into
kernfs_init(). kernfs_init() uses only small amount of memory and
trying to handle and propagate kernfs_init() failure doesn't make
much sense. Use SLAB_PANIC for sysfs_dir_cachep and make
sysfs_inode_init() panic on failure.
After this change, kernfs_init() should be called before
sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Gao feng <gaofeng@cn.fujitsu.com> reported that commit
e51db73532
userns: Better restrictions on when proc and sysfs can be mounted
caused a regression on mounting a new instance of proc in a mount
namespace created with user namespace privileges, when binfmt_misc
is mounted on /proc/sys/fs/binfmt_misc.
This is an unintended regression caused by the absolutely bogus empty
directory check in fs_fully_visible. The check fs_fully_visible replaced
didn't even bother to attempt to verify proc was fully visible and
hiding proc files with any kind of mount is rare. So for now fix
the userspace regression by allowing directory with nlink == 1
as /proc/sys/fs/binfmt_misc has.
I will have a better patch but it is not stable material, or
last minute kernel material. So it will have to wait.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Tested-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
number against seq, then grabs reference to mnt. Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode. We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock. It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of passing the direction as argument (and checking it on every
step through the hash chain), just have separate __lookup_mnt() and
__lookup_mnt_last(). And use the standard iterators...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
aka br_write_{lock,unlock} of vfsmount_lock. Inlines in fs/mount.h,
vfsmount_lock extern moved over there as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the rootfs code was a wrapper around ramfs, having them in the same
file made sense. Now that it can wrap another filesystem type, move it in
with the init code instead.
This also allows a subsequent patch to access rootfstype= command line
arg.
Signed-off-by: Rob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile 2 (of many) from Al Viro:
"Mostly Miklos' series this time"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify dcache.c inlined helpers where possible
fuse: drop dentry on failed revalidate
fuse: clean up return in fuse_dentry_revalidate()
fuse: use d_materialise_unique()
sysfs: use check_submounts_and_drop()
nfs: use check_submounts_and_drop()
gfs2: use check_submounts_and_drop()
afs: use check_submounts_and_drop()
vfs: check unlinked ancestors before mount
vfs: check submounts and drop atomically
vfs: add d_walk()
vfs: restructure d_genocide()
Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.
The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.
Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
prevent mounts to be added to the disconnected subtree using relative paths
after the d_drop().
This patch fixes these issues by checking for unlinked (unhashed, non-root)
ancestors before proceeding with the mount. This is done with rename
seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
including our dentry itself. This ensures that the only one of
check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christopher reported a regression where he was unable to unmount a NFS
filesystem where the root had gone stale. The problem is that
d_revalidate handles the root of the filesystem differently from other
dentries, but d_weak_revalidate does not. We could simply fix this by
making d_weak_revalidate return success on IS_ROOT dentries, but there
are cases where we do want to revalidate the root of the fs.
A umount is really a special case. We generally aren't interested in
anything but the dentry and vfsmount that's attached at that point. If
the inode turns out to be stale we just don't care since the intent is
to stop using it anyway.
Try to handle this situation better by treating umount as a special
case in the lookup code. Have it resolve the parent using normal
means, and then do a lookup of the final dentry without revalidating
it. In most cases, the final lookup will come out of the dcache, but
the case where there's a trailing symlink or !LAST_NORM entry on the
end complicates things a bit.
Cc: Neil Brown <neilb@suse.de>
Reported-by: Christopher T Vogan <cvogan@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.
Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.
Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Don't copy bind mounts of /proc/<pid>/ns/mnt between namespaces.
These files hold references to a mount namespace and copying them
between namespaces could result in a reference counting loop.
The current mnt_ns_loop test prevents loops on the assumption that
mounts don't cross between namespaces. Unfortunately unsharing a
mount namespace and shared substrees can both cause mounts to
propogate between mount namespaces.
Add two flags CL_COPY_UNBINDABLE and CL_COPY_MNT_NS_FILE are added to
control this behavior, and CL_COPY_ALL is redefined as both of them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This should actually be returning an ERR_PTR on error instead of NULL.
That was how it was designed and all the callers expect it.
[AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
return errors" missed - originally collect_mounts() was expected to return
NULL on failure]
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When creating a less privileged mount namespace or propogating mounts
from a more privileged to a less privileged mount namespace lock the
submounts so they may not be unmounted individually in the child mount
namespace revealing what is under them.
This enforces the reasonable expectation that it is not possible to
see under a mount point. Most of the time mounts are on empty
directories and revealing that does not matter, however I have seen an
occassionaly sloppy configuration where there were interesting things
concealed under a mount point that probably should not be revealed.
Expirable submounts are not locked because they will eventually
unmount automatically so whatever is under them already needs
to be safe for unprivileged users to access.
From a practical standpoint these restrictions do not appear to be
significant for unprivileged users of the mount namespace. Recursive
bind mounts and pivot_root continues to work, and mounts that are
created in a mount namespace may be unmounted there. All of which
means that the common idiom of keeping a directory of interesting
files and using pivot_root to throw everything else away continues to
work just fine.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
while list_add(A, B) and list_add(B, A) are equivalent when both A and B
are guaranteed to be empty, the usual idiom is list_add(what, where),
not the other way round... Not a bug per se, but only by accident and
it makes RTFS harder for no good reason.
Spotted-by: Rajat Sharma <fs.rajat@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fixes from Al Viro:
"A nasty bug in fs/namespace.c caught by Andrey + a couple of less
serious unpleasantness - ecryptfs misc device playing hopeless games
with try_module_get() and palinfo procfs support being... not quite
correctly done, to be polite."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mnt: release locks on error path in do_loopback
palinfo fixes
procfs: add proc_remove_subtree()
ecryptfs: close rmmod race
... and provide namespace_lock() as a trivial wrapper;
switch to those two consistently.
Result is patterned after rtnl_lock/rtnl_unlock pair.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
global list of release_mounts() fodder, protected by namespace_sem;
eventually, all umount_tree() callers will use it as kill list.
Helper picking the contents of that list, releasing namespace_sem
and doing release_mounts() on what it got.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_loopback calls lock_mount(path) and forget to unlock_mount
if clone_mnt or copy_mnt fails.
[ 77.661566] ================================================
[ 77.662939] [ BUG: lock held when returning to user space! ]
[ 77.664104] 3.9.0-rc5+ #17 Not tainted
[ 77.664982] ------------------------------------------------
[ 77.666488] mount/514 is leaving the kernel with locks still held!
[ 77.668027] 2 locks held by mount/514:
[ 77.668817] #0: (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff811cca22>] lock_mount+0x32/0xe0
[ 77.671755] #1: (&namespace_sem){+++++.}, at: [<ffffffff811cca3a>] lock_mount+0x4a/0xe0
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.
proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.
Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.
In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
As a matter of policy MNT_READONLY should not be changable if the
original mounter had more privileges than creator of the mount
namespace.
Add the flag CL_UNPRIVILEGED to note when we are copying a mount from
a mount namespace that requires more privileges to a mount namespace
that requires fewer privileges.
When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT
if any of the mnt flags that should never be changed are set.
This protects both mount propagation and the initial creation of a less
privileged mount namespace.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a read-only bind mount is copied from mount namespace in a higher
privileged user namespace to a mount namespace in a lesser privileged
user namespace, it should not be possible to remove the the read-only
restriction.
Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
remain read-only.
CC: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Guarantee that the policy of which files may be access that is
established by setting the root directory will not be violated
by user namespaces by verifying that the root directory points
to the root of the mount namespace at the time of user namespace
creation.
Changing the root is a privileged operation, and as a matter of policy
it serves to limit unprivileged processes to files below the current
root directory.
For reasons of simplicity and comprehensibility the privilege to
change the root directory is gated solely on the CAP_SYS_CHROOT
capability in the user namespace. Therefore when creating a user
namespace we must ensure that the policy of which files may be access
can not be violated by changing the root directory.
Anyone who runs a processes in a chroot and would like to use user
namespace can setup the same view of filesystems with a mount
namespace instead. With this result that this is not a practical
limitation for using user namespaces.
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It's safe only under namespace_sem or vfsmount_lock; all places
in fs/namespace.c that want mnt->mnt_ns->user_ns actually want to use
current->nsproxy->mnt_ns->user_ns (note the calls of check_mnt() in
there).
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The compiler may optimize the while loop and make the check just be done once,
so we should use ACCESS_ONCE() to guard access to ->mnt_flags
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.
However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.
Which made the following nasty sequence possible.
pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}
Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Change return value from -EINVAL to -EPERM when the permission check fails.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Add a filesystem flag to mark filesystems that are safe to mount as
an unprivileged user.
- Add a filesystem flag to mark filesystems that don't need MNT_NODEV
when mounted by an unprivileged user.
- Relax the permission checks to allow unprivileged users that have
CAP_SYS_ADMIN permissions in the user namespace referred to by the
current mount namespace to be allowed to mount, unmount, and move
filesystems.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Sharing mount subtress with mount namespaces created by unprivileged
users allows unprivileged mounts created by unprivileged users to
propagate to mount namespaces controlled by privileged users.
Prevent nasty consequences by changing shared subtrees to slave
subtress when an unprivileged users creates a new mount namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This will allow for support for unprivileged mounts in a new user namespace.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
setns support for the mount namespace is a little tricky as an
arbitrary decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship between
the two mount namespaces. Therefore I arbitrarily find the root
mount point, and follow every mount on top of it to find the top
of the mount stack. Then I set fs->root and fs->pwd to that
location. The topmost root of the mount stack seems like a
reasonable place to be.
Bind mount support for the mount namespace inodes has the
possibility of creating circular dependencies between mount
namespaces. Circular dependencies can result in loops that
prevent mount namespaces from every being freed. I avoid
creating those circular dependencies by adding a sequence number
to the mount namespace and require all bind mounts be of a
younger mount namespace into an older mount namespace.
Add a helper function proc_ns_inode so it is possible to
detect when we are attempting to bind mound a namespace inode.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.
For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.
This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.
Later, we'll add other information to the struct as it becomes
convenient.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
normally we deal with lock_mount()/umount races by checking that
mountpoint to be is still in our namespace after lock_mount() has
been done. However, do_add_mount() skips that check when called
with MNT_SHRINKABLE in flags (i.e. from finish_automount()). The
reason is that ->mnt_ns may be a temporary namespace created exactly
to contain automounts a-la NFS4 referral handling. It's not the
namespace of the caller, though, so check_mnt() would fail here.
We still need to check that ->mnt_ns is non-NULL in that case,
though.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add comments describing what the directions "up" and "down" mean and ref count
handling to the VFS mount following family of functions.
Signed-off-by: Valerie Aurora <vaurora@redhat.com> (Original author)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM. Change it to return
an explicit error.
Also change clone_mnt() for consistency and because union mounts will add new
error cases.
Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix.
[AV: folded braino fix by Dan Carpenter]
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Valerie Aurora <valerie.aurora@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
don't rely on proc_mounts->m being the first field; container_of()
is there for purpose. No need to bother with ->private, while
we are at it - the same container_of will do nicely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__mnt_make_shortterm() in there undoes the effect of __mnt_make_longterm()
we'd done back when we set ->mnt_ns non-NULL; it should not be done to
vfsmounts that had never gone through commit_tree() and friends. Kudos to
lczerner for catching that one...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
If there are any inodes on the super block that have been unlinked
(i_nlink == 0) but have not yet been deleted then prevent the
remounting the super block read-only.
Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently remouting superblock read-only is racy in a major way.
With the per mount read-only infrastructure it is now possible to
prevent most races, which this patch attempts.
Before starting the remount read-only, iterate through all mounts
belonging to the superblock and if none of them have any pending
writes, set sb->s_readonly_remount. This indicates that remount is in
progress and no further write requests are allowed. If the remount
succeeds set MS_RDONLY and reset s_readonly_remount.
If the remounting is unsuccessful just reset s_readonly_remount.
This can result in transient EROFS errors, despite the fact the
remount failed. Unfortunately hodling off writes is difficult as
remount itself may touch the filesystem (e.g. through load_nls())
which would deadlock.
A later patch deals with delayed writes due to nlink going to zero.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Keep track of vfsmounts belonging to a superblock. List is protected
by vfsmount_lock.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>