Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the expiring_list is empty, we can avoid a costly spinlock in the
rcu-walk path through autofs4_d_manage (once the rest of the path
becomes rcu-walk friendly).
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable 'ino' already exists and already has the correct value.
The d_fsdata of a dentry is never changed after the d_fsdata is
instantiated, so this new assignment cannot be necessary.
It was introduced in commit b5b801779d ("autofs4: Add d_manage()
dentry operation").
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On strict build environments we can see:
fs/autofs4/inode.c: In function 'autofs4_fill_super':
fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this function
make[2]: *** [fs/autofs4/inode.o] Error 1
make[1]: *** [fs/autofs4] Error 2
make: *** [fs] Error 2
make: *** Waiting for unfinished jobs....
This is due to the use of pgrp_set being used to indicate pgrp has has
been set rather than initializing pgrp itself.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs_dev_ioctl_init is only called by __init init_autofs4_fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs needs to be able to see private data dentry flags for its dentrys
that are being created but not yet hashed and for its dentrys that have
been rmdir()ed but not yet freed. It needs to do this so it can block
processes in these states until a status has been returned to indicate
the given operation is complete.
It does this by keeping two lists, active and expring, of dentrys in
this state and uses ->d_release() to keep them stable while it checks
the reference count to determine if they should be used.
But with the recent lockref changes dentrys being freed sometimes don't
transition to a reference count of 0 before being freed so autofs can
occassionally use a dentry that is invalid which can lead to a panic.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There wasn't any check of the size passed from userspace before trying
to allocate the memory required.
This meant that userspace might request more space than allowed,
triggering an OOM.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs4 module doesn't consider symlinks for expire as it did in the
older autofs v3 module (so it's actually a long standing regression).
The user space daemon has focused on the use of bind mounts instead of
symlinks for a long time now and that's why this has not been noticed.
But with the future addition of amd map parsing to automount(8), not to
mention amd itself (of am-utils), symlink expiry will be needed.
The direct and offset mount types can't be symlinks and the tree mounts of
version 4 were always real mounts so only indirect mounts need expire
symlinks.
Since the current users of the autofs4 module haven't reported this as a
problem to date this patch probably isn't a candidate for backport to
stable.
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the helper macro !IS_ROOT to replace parent != dentry->d_parent. Just
clean up.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While kzallocing sbi/ino fails, it should return -ENOMEM.
And it should return the err value from autofs_prepare_pipe.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PID and the TGID of the process triggering the mount are sent to the
daemon. Currently the global pid values are sent (ones valid in the
initial pid namespace) but this is wrong if the autofs daemon itself is
not running in the initial pid namespace.
So send the pid values that are valid in the namespace of the autofs
daemon.
The namespace to use is taken from the oz_pgrp pid pointer, which was
set at mount time to the mounting process' pid namespace.
If the pid translation fails (the triggering process is in an unrelated
pid namespace) then the automount fails with ENOENT.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable autofs4 to work in a "container". oz_pgrp is converted from
pid_t to struct pid and this is stored at mount time based on the
"pgrp=" option or if the option is missing then the current pgrp.
The "pgrp=" option is interpreted in the PID namespace of the current
process. This option is flawed in that it doesn't carry the namespace
information, so it should be deprecated. AFAICS the autofs daemon
always sends the current pgrp, which is the default anyway.
The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl.
This ioctl sets oz_pgrp to the current pgrp. It is not allowed to
change the pid namespace.
oz_pgrp is used mainly to determine whether the process traversing the
autofs mount tree is the autofs daemon itself or not. This function now
compares the pid pointers instead of the pid_t values.
One other use of oz_pgrp is in autofs4_show_options. There is shows the
virtual pid number (i.e. the one that is valid inside the PID namespace
of the calling process)
For debugging printk convert oz_pgrp to the value in the initial pid
namespace.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't drop ->wq_mutex before calling autofs4_notify_daemon() only to regain it
there. Besides being pointless, that opens a race window where autofs4_wait_release()
could've come and freed wq->name.name. And do the debugging printk in the "reused an
existing wq" case before dropping ->wq_mutex - the same reason...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Ian Kent <raven@themaw.net>
When reconnecting to automounts at startup an autofs ioctl is used
to find the device and inode of existing mounts so they can be used
to open a file descriptor of possibly covered mounts.
At this time the the caller might not yet "own" the mount so it can
trigger calling ->d_automount(). This causes automount to hang when
trying to reconnect to direct or offset mount types.
Consequently kern_path() can't be used but kern_path_mountpoint() can be.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When checking if an autofs mount point is busy it isn't sufficient to
only check if it's a mount point.
For example, if the mount of an offset mountpoint in a tree is denied
for this host by its export and the dentry becomes a process working
directory the check incorrectly returns the mount as not in use at
expire.
This can happen since the default when mounting within a tree is
nostrict, which means ingnore mount fails on mounts within the tree and
continue. The nostrict option is meant to allow mounting in this case.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixed the sparse warning:
fs/autofs4/root.c:411:5: warning: symbol 'autofs4_d_manage' was not declared. Should it be static?"
[ Clearly it should be static as the function is declared static at the
top of root.c. - imk ]
Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Sparse complains:
fs/autofs4/root.c:409:9: sparse: context imbalance in 'autofs4_d_automount' - different lock contexts for basic block
This was introduced by commit f55fb0c243 ("autofs4 - dont clear
DCACHE_NEED_AUTOMOUNT on rootless mount")
The function autofs4_d_automount can be left with the (&sbi->fs_lock)
held if sbi->version <= 4 and simple_empty(dentry) == false so the
warning seems valid.
--> Add an spin_unlock in this case before we jump to done
Unfortunately compile tested only.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to SUSv3:
[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.
[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.
So -EPERM should be returned if capability checks fails.
Strictly speaking this is an API change since the error code user sees is
altered.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.
This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.
This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.
This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.
The files under /proc/<pid>/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.
The files under /proc/<pid>/ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.
Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.
Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc/<pid>/ns/ files in this tree.
Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.
Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...
For direct (and offset) mounts, if an automounted mount is manually
umounted the trigger mount dentry can appear non-empty causing it to
not trigger mounts. This can also happen if there is a file handle
leak in a user space automounting application.
This happens because, when a ioctl control file handle is opened
on the mount, a cursor dentry is created which causes list_empty()
to see the dentry as non-empty. Since there is a case where listing
the directory of these dentrys is needed, the use of dcache_dir_*()
functions for .open() and .release() is needed.
Consequently simple_empty() must be used instead of list_empty()
when checking for an empty directory.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DCACHE_NEED_AUTOMOUNT flag is cleared on mount and set on expire
for autofs rootless multi-mount dentrys to prevent unnecessary calls
to ->d_automount().
Since DCACHE_MANAGE_TRANSIT is always set on autofs dentrys ->d_managed()
is always called so the check can be done in ->d_manage() without the
need to change the flag. This still avoids unnecessary calls to
->d_automount(), adds negligible overhead and eliminates a seriously
ugly check in the expire code.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.
When creating directories and symlinks default the uid and gid of
the mount requester to the global root uid and gid. autofs4_wait
will update these fields when a mount is requested.
When generating autofsv5 packets report the uid and gid of the mount
requestor in user namespace of the process that opened the pipe,
reporting unmapped uids and gids as overflowuid and overflowgid.
In autofs_dev_ioctl_requester return the uid and gid of the last mount
requester converted into the calling processes user namespace. When the
uid or gid don't map return overflowuid and overflowgid as appropriate,
allowing failure to find a mount requester to be distinguished from
failure to map a mount requester.
The uid and gid mount options specifying the user and group of the
root autofs inode are converted into kuid and kgid as they are parsed
defaulting to the current uid and current gid of the process that
mounts autofs.
Mounting of autofs for the present remains confined to processes in
the initial user namespace.
Cc: Ian Kent <raven@themaw.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
In autofs4_d_automount(), if a mount fail occurs the AUTOFS_INF_PENDING
mount pending flag is not cleared.
One effect of this is when using the "browse" option, directory entry
attributes show up with all "?"s due to the incorrect callback and
subsequent failure return (when in fact no callback should be made).
Signed-off-by: Ian Kent <ikent@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only difference between autofs_dev_ioctl_fd_install() and
fd_install() is __set_close_on_exec() done by the latter. Just
use get_unused_fd_flags(O_CLOEXEC) to allocate the descriptor
and be done with that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In some cases when an autofs indirect mount is contained in a file
system that is marked as shared (such as when systemd does the
equivalent of "mount --make-rshared /" early in the boot), mounts
stop expiring.
When this happens the first expiry check on a mountpoint dentry in
autofs_expire_indirect() sees a mountpoint dentry with a higher
than minimal reference count. Consequently the dentry is condidered
busy and the actual expiry check is never done.
This particular check was originally meant as an optimisation to
detect a path walk in progress but with the addition of rcu-walk
it can be ineffective anyway.
Removing the test allows automounts to expire again since the
actual expire check doesn't rely on the dentry reference count.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following a report of a crash during an automount expire I found that
the locking in fs/autofs4/expire.c:get_next_positive_subdir() was wrong.
Not only is the locking wrong but the function is more complex than it
needs to be.
The function is meant to calculate (and dget) the next entry in the list
of directories contained in the root of an autofs mount point (an autofs
indirect mount to be precise). The main problem was that the d_lock of
the owner of the list was not being taken when walking the list, which
lead to list corruption under load. The only other lock that needs to
be taken is against the next dentry candidate so it can be checked for
usability.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
+AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
/9c4zxxekjHZnn2oooEa
=oLXf
-----END PGP SIGNATURE-----
Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."
* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
The autofs packet size has had a very unfortunate size problem on x86:
because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
because the packet data was not 8-byte aligned, the size of the autofsv5
packet structure differed between 32-bit and 64-bit modes despite
looking otherwise identical (300 vs 304 bytes respectively).
We first fixed that up by making the 64-bit compat mode know about this
problem in commit a32744d4ab ("autofs: work around unhappy compat
problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
64-bit kernel because everything then worked the same way as on a 32-bit
kernel.
But it turned out that 'automount' had actually known and worked around
this problem in user space, so fixing the kernel to do the proper 32-bit
compatibility handling actually *broke* 32-bit automount on a 64-bit
kernel, because it knew that the packet sizes were wrong and expected
those incorrect sizes.
As a result, we ended up reverting that compatibility mode fix, and
thus breaking systemd again, in commit fcbf94b9de.
With both automount and systemd doing a single read() system call, and
verifying that they get *exactly* the size they expect but using
different sizes, it seemed that fixing one of them inevitably seemed to
break the other. At one point, a patch I seriously considered applying
from Michael Tokarev did a "strcmp()" to see if it was automount that
was doing the operation. Ugly, ugly.
However, a prettier solution exists now thanks to the packetized pipe
mode. By marking the communication pipe as being packetized (by simply
setting the O_DIRECT flag), we can always just write the bigger packet
size, and if user-space does a smaller read, it will just get that
partial end result and the extra alignment padding will simply be thrown
away.
This makes both automount and systemd happy, since they now get the size
they asked for, and the kernel side of autofs simply no longer needs to
care - it could pad out the packet arbitrarily.
Of course, if there is some *other* user of autofs (please, please,
please tell me it ain't so - and we haven't heard of any) that tries to
read the packets with multiple writes, that other user will now be
broken - the whole point of the packetized mode is that one system call
gets exactly one packet, and you cannot read a packet in pieces.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit a32744d4ab.
While that commit was technically the right thing to do, and made the
x86-64 compat mode work identically to native 32-bit mode (and thus
fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
turns out that the automount binaries had workarounds for this compat
problem.
Now, the workarounds are disgusting: doing an "uname()" to find out the
architecture of the kernel, and then comparing it for the 64-bit cases
and fixing up the size of the read() in automount for those. And they
were confused: it's not actually a generic 64-bit issue at all, it's
very much tied to just x86-64, which has different alignment for an
'u64' in 64-bit mode than in 32-bit mode.
But the end result is that fixing the compat layer actually breaks the
case of a 32-bit automount on a x86-64 kernel.
There are various approaches to fix this (including just doing a
"strcmp()" on current->comm and comparing it to "automount"), but I
think that I will do the one that teaches pipes about a special "packet
mode", which will allow user space to not have to care too deeply about
the padding at the end of the autofs packet.
That change will make the compat workaround unnecessary, so let's revert
it first, and get automount working again in compat mode. The
packetized pipes will then fix autofs for systemd.
Reported-and-requested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Ian Kent <raven@themaw.net>
Cc: stable@kernel.org # for 3.3
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.
This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."
Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
...
When the autofs protocol version 5 packet type was added in commit
5c0a32fc2c ("autofs4: add new packet type for v5 communications"), it
obvously tried quite hard to be word-size agnostic, and uses explicitly
sized fields that are all correctly aligned.
However, with the final "char name[NAME_MAX+1]" array at the end, the
actual size of the structure ends up being not very well defined:
because the struct isn't marked 'packed', doing a "sizeof()" on it will
align the size of the struct up to the biggest alignment of the members
it has.
And despite all the members being the same, the alignment of them is
different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
number (256), the name[] array starts out a 4-byte aligned.
End result: the "packed" size of the structure is 300 bytes: 4-byte, but
not 8-byte aligned.
As a result, despite all the fields being in the same place on all
architectures, sizeof() will round up that size to 304 bytes on
architectures that have 8-byte alignment for u64.
Note that this is *not* a problem for 32-bit compat mode on POWER, since
there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
and 64-bit alignment is different for 64-bit entities, and as a result
the structure that has exactly the same layout has different sizes.
So on x86-64, but no other architecture, we will just subtract 4 from
the size of the structure when running in a compat task. That way we
will write the properly sized packet that user mode expects.
Not pretty. Sadly, this very subtle, and unnecessary, size difference
has been encoded in user space that wants to read packets of *exactly*
the right size, and will refuse to touch anything else.
Reported-and-tested-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wrap accesses to the fd_sets in struct fdtable (for recording open files and
close-on-exec flags) so that we can move away from using fd_sets since we
abuse the fd_set structs by not allocating the full-sized structure under
normal circumstances and by non-core code looking at the internals of the
fd_sets.
The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
since that cannot be told about their abnormal lengths.
This introduces six wrapper functions for setting, clearing and testing
close-on-exec flags and fd-is-open flags:
void __set_close_on_exec(int fd, struct fdtable *fdt);
void __clear_close_on_exec(int fd, struct fdtable *fdt);
bool close_on_exec(int fd, const struct fdtable *fdt);
void __set_open_fd(int fd, struct fdtable *fdt);
void __clear_open_fd(int fd, struct fdtable *fdt);
bool fd_is_open(int fd, const struct fdtable *fdt);
Note that I've prepended '__' to the names of the set/clear functions because
they require the caller to hold a lock to use them.
Note also that I haven't added wrappers for looking behind the scenes at the
the array. Possibly that should exist too.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
When recursing down the locks when traversing a tree/list in
get_next_positive_dentry() or get_next_positive_subdir() a lock can
change from being nested to being a parent which breaks lockdep. This
patch tells lockdep about what we did.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I don't know how I missed this obvious mistake when I
reviewed Als' patches, sorry.
[ Quoting Al:
Grr... Note to self: do git status *and* git stash show -p
before git push. Nothing like "WTF? I'd fixed that braino"
feeling ;-/
Al sent the same patch - it got broken in commit d668dc5663:
"autofs4: deal with autofs4_write/autofs4_write races". ]
Reported-and-tested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just serialize the actual writing of packets into pipe on
a new mutex, independent from everything else in the locking
hierarchy. As soon as something has started feeding a piece
of packet into the pipe to daemon, we *want* everything else
about to try the same to wait until we are done.
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>