kernel-ark

Author	SHA1	Message	Date
Andi Kleen	962830df36	brlocks/lglocks: API cleanups lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason to not just use common utility functions and put all the data into a common data structure. In preparation, this patch changes the API to look more like normal function calls with pointers, not magic macros. The patch is rather large because I move over all users in one go to keep it bisectable. This impacts the VFS somewhat in terms of lines changed. But no actual behaviour change. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:41 -04:00
Linus Torvalds	36126f8f2e	word-at-a-time: make the interfaces truly generic This changes the interfaces in <asm/word-at-a-time.h> to be a bit more complicated, but a lot more generic. In particular, it allows us to really do the operations efficiently on both little-endian and big-endian machines, pretty much regardless of machine details. For example, if you can rely on a fast population count instruction on your architecture, this will allow you to make your optimized <asm/word-at-a-time.h> file with that. NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is not truly generic, it actually only works on big-endian. Why? Because on little-endian the generic algorithms are wasteful, since you can inevitably do better. The x86 implementation is an example of that. (The only truly non-generic part of the asm-generic implementation is the "find_zero()" function, and you could make a little-endian version of it. And if the Kbuild infrastructure allowed us to pick a particular header file, that would be lovely) The <asm/word-at-a-time.h> functions are as follows: - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm uses. - has_zero(): take a word, and determine if it has a zero byte in it. It gets the word, the pointer to the constant pool, and a pointer to an intermediate "data" field it can set. This is the "quick-and-dirty" zero tester: it's what is run inside the hot loops. - "prep_zero_mask()": take the word, the data that has_zero() produced, and the constant pool, and generate an exact mask of which byte had the first zero. This is run directly outside the loop, and allows the "has_zero()" function to answer the "is there a zero byte" question without necessarily getting exactly which byte is the first one to contain a zero. If you do multiple byte lookups concurrently (eg "hash_name()", which looks for both NUL and '/' bytes), after you've done the prep_zero_mask() phase, the result of those can be or'ed together to get the "either or" case. - The result from "prep_zero_mask()" can then be fed into "find_zero()" (to find the byte offset of the first byte that was zero) or into "zero_bytemask()" (to find the bytemask of the bytes preceding the zero byte). The existence of zero_bytemask() is optional, and is not necessary for the normal string routines. But dentry name hashing needs it, so if you enable DENTRY_WORD_AT_A_TIME you need to expose it. This changes the generic strncpy_from_user() function and the dentry hashing functions to use these modified word-at-a-time interfaces. This gets us back to the optimized state of the x86 strncpy that we lost in the previous commit when moving over to the generic version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-26 11:33:40 -07:00
Linus Torvalds	ce004178be	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc changes from David S. Miller: "This has the generic strncpy_from_user() implementation architectures can now use, which we've been developing on linux-arch over the past few days. For good measure I ran both a 32-bit and a 64-bit glibc testsuite run, and the latter of which pointed out an adjustment I needed to make to sparc's user_addr_max() definition. Linus, you were right, STACK_TOP was not the right thing to use, even on sparc itself :-) From Sam Ravnborg, we have a conversion of sparc32 over to the common alloc_thread_info_node(), since the aspect which originally blocked our doing so (sun4c) has been removed." Fix up trivial arch/sparc/Kconfig and lib/Makefile conflicts. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Fix user_addr_max() definition. lib: Sparc's strncpy_from_user is generic enough, move under lib/ kernel: Move REPEAT_BYTE definition into linux/kernel.h sparc: Increase portability of strncpy_from_user() implementation. sparc: Optimize strncpy_from_user() zero byte search. sparc: Add full proper error handling to strncpy_from_user(). sparc32: use the common implementation of alloc_thread_info_node()	2012-05-24 15:10:28 -07:00
David S. Miller	446969084d	kernel: Move REPEAT_BYTE definition into linux/kernel.h And make sure that everything using it explicitly includes that header file. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-24 13:10:05 -07:00
Linus Torvalds	644473e9c6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...	2012-05-23 17:42:39 -07:00
Linus Torvalds	31ed8e6f93	Merge branch 'dentry-cleanups' (dcache access cleanups and optimizations) This branch simplifies and clarifies the dcache lookup, and allows us to do certain nice optimizations when comparing dentries. It also cleans up the interface to __d_lookup_rcu(), especially around passing the inode information around. * dentry-cleanups: vfs: make it possible to access the dentry hash/len as one 64-bit entry vfs: move dentry name length comparison from dentry_cmp() into callers vfs: do the careful dentry name access for all dentry_cmp cases vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces	2012-05-21 08:50:57 -07:00
Linus Torvalds	7e5cb5e151	Merge branch 'vfs-cleanups' (random vfs cleanups) This teaches vfs_fstat() to use the appropriate f[get\|put]_light functions, allowing it to avoid some unnecessary locking for the common case. More noticeably, it also cleans up and simplifies the "getname_flags()" function, which now relies on the architecture strncpy_from_user() doing all the user access checks properly, instead of hacking around the fact that on x86 it didn't use to do it right (see commit `92ae03f2ef`: "x86: merge 32/64-bit versions of 'strncpy_from_user()' and speed it up"). * vfs-cleanups: VFS: make vfs_fstat() use f[get\|put]_light() VFS: clean up and simplify getname_flags() x86: make word-at-a-time strncpy_from_user clear bytes at the end	2012-05-21 08:46:08 -07:00
Linus Torvalds	12f8ad4b05	vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces The calling conventions for __d_lookup_rcu() and dentry_cmp() are annoying in different ways, and there is actually one single underlying reason for both of the annoyances. The fundamental reason is that we do the returned dentry sequence number check inside __d_lookup_rcu() instead of doing it in the caller. This results in two annoyances: - __d_lookup_rcu() now not only needs to return the dentry and the sequence number that goes along with the lookup, it also needs to return the inode pointer that was validated by that sequence number check. - and because we did the sequence number check early (to validate the name pointer and length) we also couldn't just pass the dentry itself to dentry_cmp(), we had to pass the counted string that contained the name. So that sequence number decision caused two separate ugly calling conventions. Both of these problems would be solved if we just did the sequence number check in the caller instead. There's only one caller, and that caller already has to do the sequence number check for the parent anyway, so just do that. That allows us to stop returning the dentry->d_inode in that in-out argument (pointer-to-pointer-to-inode), so we can make the inode argument just a regular input inode pointer. The caller can just load the inode from dentry->d_inode, and then do the sequence number check after that to make sure that it's synchronized with the name we looked up. And it allows us to just pass in the dentry to dentry_cmp(), which is what all the callers really wanted. Sure, dentry_cmp() has to be a bit careful about the dentry (which is not stable during RCU lookup), but that's actually very simple. And now that dentry_cmp() can clearly see that the first string argument is a dentry, we can use the direct word access for that, instead of the careful unaligned zero-padding. The dentry name is always properly aligned, since it is a single path component that is either embedded into the dentry itself, or was allocated with kmalloc() (see __d_alloc). Finally, this also uninlines the nasty slow-case for dentry comparisons: that one does need to do a sequence number check, since it will call in to the low-level filesystems, and we want to give those a stable inode pointer and path component length/start arguments. Doing an extra sequence check for that slow case is not a problem, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-04 18:21:14 -07:00
Linus Torvalds	e419b4cc58	vfs: make word-at-a-time accesses handle a non-existing page It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that can have holes in the kernel address space: it seems to happen easily with Xen, and it looks like the AMD gart64 code will also punch holes dynamically. Actually hitting that case is still very unlikely, so just do the access, and take an exception and fix it up for the very unlikely case of it being a page-crosser with no next page. And hey, this abstraction might even help other architectures that have other issues with unaligned word accesses than the possible missing next page. IOW, this could do the byte order magic too. Peter Anvin fixed a thinko in the shifting for the exception case. Reported-and-tested-by: Jana Saout <jana@saout.de> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-03 14:01:40 -07:00
Eric W. Biederman	8e96e3b7b8	userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-05-03 03:29:34 -07:00
Linus Torvalds	3f9f0aa687	VFS: clean up and simplify getname_flags() This removes a number of silly games around strncpy_from_user() in do_getname(), and removes that helper function entirely. We instead make getname_flags() just use strncpy_from_user() properly directly. Removing the wrapper function simplifies things noticeably, mostly because we no longer play the unnecessary games with segments (x86 strncpy_from_user() no longer needs the hack), but also because the empty path handling is just much more obvious. The return value of "strncpy_to_user()" is much more obvious than checking an odd error return case from do_getname(). [ non-x86 architectures were notified of this change several weeks ago, since it is possible that they have copied the old broken x86 strncpy_from_user. But nobody reacted, so .. See http://www.spinics.net/lists/linux-arch/msg17313.html for details ] Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-04-28 14:38:32 -07:00
Eric W. Biederman	1a48e2ac03	userns: Replace the hard to write inode_userns with inode_capable. This represents a change in strategy of how to handle user namespaces. Instead of tagging everything explicitly with a user namespace and bulking up all of the comparisons of uids and gids in the kernel, all uids and gids in use will have a mapping to a flat kuid and kgid spaces respectively. This allows much more of the existing logic to be preserved and in general allows for faster code. In this new and improved world we allow someone to utiliize capabilities over an inode if the inodes owner mapps into the capabilities holders user namespace and the user has capabilities in their user namespace. Which is simple and efficient. Moving the fs uid comparisons to be comparisons in a flat kuid space follows in later patches, something that is only significant if you are using user namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-04-07 17:02:46 -07:00
Linus Torvalds	f68e556e23	Make the "word-at-a-time" helper functions more commonly usable I have a new optimized x86 "strncpy_from_user()" that will use these same helper functions for all the same reasons the name lookup code uses them. This is preparation for that. This moves them into an architecture-specific header file. It's architecture-specific for two reasons: - some of the functions are likely to want architecture-specific implementations. Even if the current code happens to be "generic" in the sense that it should work on any little-endian machine, it's likely that the "multiply by a big constant and shift" implementation is less than optimal for an architecture that has a guaranteed fast bit count instruction, for example. - I expect that if architectures like sparc want to start playing around with this, we'll need to abstract out a few more details (in particular the actual unaligned accesses). So we're likely to have more architecture-specific stuff if non-x86 architectures start using this. (and if it turns out that non-x86 architectures don't start using this, then having it in an architecture-specific header is still the right thing to do, of course) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-04-06 13:54:56 -07:00
Eric W. Biederman	975d6b3932	vfs: Don't allow a user namespace root to make device nodes Safely making device nodes in a container is solvable but simply having the capability in a user namespace is not sufficient to make this work. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-04-03 04:28:51 -07:00
J. Bruce Fields	c0d0259481	vfs: fix out-of-date dentry_unhash() comment `64252c75a2` "vfs: remove dget() from dentry_unhash()" changed the implementation but not the comment. Cc: Sage Weil <sage@newdream.net> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:17 -04:00
Miklos Szeredi	bad6118978	vfs: split __lookup_hash Split __lookup_hash into two component functions: lookup_dcache - tries cached lookup, returns whether real lookup is needed lookup_real - calls i_op->lookup This eliminates code duplication between d_alloc_and_lookup() and d_inode_lookup(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:17 -04:00
Al Viro	81e6f52089	untangling do_lookup() - take __lookup_hash()-calling case out of line. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:17 -04:00
Al Viro	a32555466c	untangling do_lookup() - switch to calling __lookup_hash() now we have __lookup_hash() open-coded if !dentry case; just call the damn thing instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	a6ecdfcfba	untangling do_lookup() - merge d_alloc_and_lookup() callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	ec335e91a4	untangling do_lookup() - merge failure exits in !dentry case Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	d774a058d9	untangling do_lookup() - massage !dentry case towards __lookup_hash() Reorder if-else cases for starters... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	08b0ab7c20	untangling do_lookup() - get rid of need_reval in !dentry case Everything arriving into if (!dentry) will have need_reval = 1. Indeed, the only way to get there with need_reval reset to 0 would be via if (unlikely(d_need_lookup(dentry))) goto unlazy; if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) { status = d_revalidate(dentry, nd); if (unlikely(status <= 0)) { if (status != -ECHILD) need_reval = 0; goto unlazy; ... unlazy: /* no assignments to dentry */ if (dentry && unlikely(d_need_lookup(dentry))) { dput(dentry); dentry = NULL; } and if d_need_lookup() had already been false the first time around, it will remain false on the second call as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	acc9cb3cd4	untangling do_lookup() - eliminate a loop. d_lookup() will fail after successful d_invalidate(), if we are holding i_mutex all along. IOW, we don't need to jump back to l: - we know what path will be taken there and can do that (i.e. d_alloc_and_lookup()) directly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	37c17e1f37	untangling do_lookup() - expand the area under ->i_mutex keep holding ->i_mutex over revalidation parts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Al Viro	3f6c7c71a2	untangling do_lookup() - isolate !dentry stuff from the rest of it. Duplicate the revalidation-related parts into if (!dentry) branch. Next step will be to pull them under i_mutex. This and the next 8 commits are more or less a splitup of patch by Miklos; folks, when you are working with something that convoluted, carve your patches up into easily reviewed steps, especially when a lot of codepaths involved are rarely hit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Miklos Szeredi	cda309de25	vfs: move MAY_EXEC check from __lookup_hash() The only caller of __lookup_hash() that needs the exec permission check on parent is lookup_one_len(). All lookup_hash() callers already checked permission in LOOKUP_PARENT walk. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Miklos Szeredi	3637c05d88	vfs: don't revalidate just looked up dentry __lookup_hash() calls ->lookup() if the dentry needs lookup and on success revalidates the dentry (all under dir->i_mutex). While this is harmless it doesn't make a lot of sense. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Miklos Szeredi	fa4ee15951	vfs: fix d_need_lookup/d_revalidate order in do_lookup Doing revalidate on a dentry which has not yet been looked up makes no sense. Move the d_need_lookup() check before d_revalidate(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-31 16:03:16 -04:00
Linus Torvalds	11bcb32848	The following text was taken from the original review request: "[PATCH 0/3] RFC - module.h usage cleanups in fs/ and lib/" https://lkml.org/lkml/2012/2/29/589 -- Fix up files in fs/ and lib/ dirs to only use module.h if they really need it. These are trivial in scope vs. the work done previously. We now have things where any few remaining cleanups can be farmed out to arch or subsystem maintainers, and I have done so when possible. What is remaining here represents the bits that don't clearly lie within a single arch/subsystem boundary, like the fs dir and the lib dir. Some duplicate includes arising from overlapping fixes from independent subsystem maintainer submissions are also quashed. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPbNw3AAoJEOvOhAQsB9HWA7wQALrsQ6V6Z+B3KsvSoD5kFnpZ Y+4uggs+GdUdWmtRrZnTBp896gGuUgBxc3syA2XWd7Oqi49+c5c1m0cFxKyVdIHm fB+jmxS69soADtHR3cXmxcQshrUzUf2rTn8frcw4O/BmJuplv4xT9uPQzwGaRSZT gomQsQ1bGnkwjO2jfS8f/N5Mjr8u/z0WF7TTOTUSq+Cv3BervPaSPF1Ea6J8oo+N 4+/n8RlU1HWiI4inrgrFPN6UHmE45BAL2xGbB47LgooHJW8P5kAnU+vxGScaoy1Q JKX9WKT3VCiwR3VOPa86iLKP3Y8a3VlhyGn+yzzcYkGX/n0tbT7aoRhQm21sGIv0 DoeXWe7aiiY8cEW69G6GIfRPFl+Zh81m1Whbu7IZT/sV3asx6jWmEXE8CgCfeDt5 mNQk9D4Irf6+rmCSbeSVC4L0eFfLxNFouNyh2aus/q+gIjKNKYwZQryHrodK4wpv UgMKSTZfPrTAWay2gCNWNqo3Zs8e1LDqkftetxeU3jx2kTuaNzBl4Y7mhsX7sLYe MsFX3JUJ2pn6XWbgqcY+bdr/mzgsCrjzqdf15MTUzEc5SIfVF+XpNNZN1ITwl6UA /ZH9keBu1mEdCoPU5W74kYwx4p35hIeWJGfc0MRp07ruf941F+SBgMD11B0+06f0 pN0DcITTkD16+sS4x1cB =Z4w0 -----END PGP SIGNATURE----- Merge tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker: "Fix up files in fs/ and lib/ dirs to only use module.h if they really need it. These are trivial in scope vs the work done previously. We now have things where any few remaining cleanups can be farmed out to arch or subsystem maintainers, and I have done so when possible. What is remaining here represents the bits that don't clearly lie within a single arch/subsystem boundary, like the fs dir and the lib dir. Some duplicate includes arising from overlapping fixes from independent subsystem maintainer submissions are also quashed." Fix up trivial conflicts due to clashes with other include file cleanups (including some due to the previous bug.h cleanup pull). * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: lib: reduce the use of module.h wherever possible fs: reduce the use of module.h wherever possible includecheck: delete any duplicate instances of module.h	2012-03-24 10:24:31 -07:00
Linus Torvalds	f7493e5d9c	vfs: tidy up sparse warnings in fs/namei.c While doing the fs/namei.c cleanups, I ran sparse on it, and it pointed out other large integers and a couple of cases of us using '0' instead of the proper 'NULL'. Sparse still doesn't understand some of the conditional locking going on, but that's no excuse for not fixing up the trivial stuff. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-22 16:10:40 -07:00
Linus Torvalds	989412bbd2	vfs: tidy up fs/namei.c byte-repeat word constants In commit commit `1de5b41cd3` ("fs/namei.c: fix warnings on 32-bit") Andrew said that there must be a tidier way of doing this. This is that tidier way. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-22 15:58:27 -07:00
Al Viro	f132c5be05	Fix full_name_hash() behaviour when length is a multiple of 8 We want it to match what hash_name() is doing, which means extra multiply by 9 in this case... Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-22 15:10:43 -07:00
Linus Torvalds	95211279c5	Merge branch 'akpm' (Andrew's patch-bomb) Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...	2012-03-22 09:04:48 -07:00
Andrew Morton	1de5b41cd3	fs/namei.c: fix warnings on 32-bit i386 allnoconfig: fs/namei.c: In function 'has_zero': fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type fs/namei.c: In function 'hash_name': fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type There must be a tidier way of doing this. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-21 17:54:54 -07:00
Linus Torvalds	e2a0883e40	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...	2012-03-21 13:36:41 -07:00
Linus Torvalds	9f3938346a	Merge branch 'kmap_atomic' of git://github.com/congwang/linux Pull kmap_atomic cleanup from Cong Wang. It's been in -next for a long time, and it gets rid of the (no longer used) second argument to k[un]map_atomic(). Fix up a few trivial conflicts in various drivers, and do an "evil merge" to catch some new uses that have come in since Cong's tree. * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits) feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename] drbd: remove the second argument of k[un]map_atomic() zcache: remove the second argument of k[un]map_atomic() gma500: remove the second argument of k[un]map_atomic() dm: remove the second argument of k[un]map_atomic() tomoyo: remove the second argument of k[un]map_atomic() sunrpc: remove the second argument of k[un]map_atomic() rds: remove the second argument of k[un]map_atomic() net: remove the second argument of k[un]map_atomic() mm: remove the second argument of k[un]map_atomic() lib: remove the second argument of k[un]map_atomic() power: remove the second argument of k[un]map_atomic() kdb: remove the second argument of k[un]map_atomic() udf: remove the second argument of k[un]map_atomic() ubifs: remove the second argument of k[un]map_atomic() squashfs: remove the second argument of k[un]map_atomic() reiserfs: remove the second argument of k[un]map_atomic() ocfs2: remove the second argument of k[un]map_atomic() ntfs: remove the second argument of k[un]map_atomic() ...	2012-03-21 09:40:26 -07:00
Al Viro	68ac1234fb	switch touch_atime to struct path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:41 -04:00
Al Viro	8de5277879	vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link} New field of struct super_block - ->s_max_links. Maximal allowed value of ->i_nlink or 0; in the latter case all checks still need to be done in ->link/->mkdir/->rename instances. Note that this limit applies both to directoris and to non-directories. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:32 -04:00
Cong Wang	e8e3c3d66f	fs: remove the second argument of k[un]map_atomic() Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:21 +08:00
Linus Torvalds	b0e37d7ac6	Merge branch 'dcache-word-accesses' * branch 'dcache-word-accesses': vfs: use 'unsigned long' accesses for dcache name comparison and hashing This does the name hashing and lookup using word-sized accesses when that is efficient, namely on x86 (although any little-endian machine with good unaligned accesses would do). It does very much depend on little-endian logic, but it's a very hot couple of functions under some real loads, and this patch improves the performance of __d_lookup_rcu() and link_path_walk() by up to about 30%. Giving a 10% improvement on some very pathname-heavy benchmarks. Because we do make unaligned accesses past the filename, the optimization is disabled when CONFIG_DEBUG_PAGEALLOC is active, and we effectively depend on the fact that on x86 we don't really ever have the last page of usable RAM followed immediately by any IO memory (due to ACPI tables, BIOS buffer areas etc). Some of the bit operations we do are a bit "subtle". It's commented, but you do need to really think about the code. Or just consider it black magic. Thanks to people on G+ for some of the optimized bit tricks.	2012-03-19 16:37:28 -07:00
Miklos Szeredi	7f6c7e62fc	vfs: fix return value from do_last() complete_walk() returns either ECHILD or ESTALE. do_last() turns this into ECHILD unconditionally. If not in RCU mode, this error will reach userspace which is complete nonsense. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-10 17:05:30 -05:00
Miklos Szeredi	097b180ca0	vfs: fix double put after complete_walk() complete_walk() already puts nd->path, no need to do it again at cleanup time. This would result in Oopses if triggered, apparently the codepath is not too well exercised. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-10 17:05:30 -05:00
Linus Torvalds	bfcfaa77bd	vfs: use 'unsigned long' accesses for dcache name comparison and hashing Ok, this is hacky, and only works on little-endian machines with goo unaligned handling. And even then only with CONFIG_DEBUG_PAGEALLOC disabled, since it can access up to 7 bytes after the pathname. But it runs like a bat out of hell. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-08 18:08:44 -08:00
Linus Torvalds	ae942ae719	vfs: export full_name_hash() function to modules Commit `5707c87f` "vfs: uninline full_name_hash()" broke the modular build, because it needs exporting now that it isn't inlined any more. Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-02 19:40:57 -08:00
Linus Torvalds	200e9ef7ab	vfs: split up name hashing in link_path_walk() into helper function The code in link_path_walk() that finds out the length and the hash of the next path component is some of the hottest code in the kernel. And I have a version of it that does things at the full width of the CPU wordsize at a time, but that means that we really want to split it up into a separate helper function. So this re-organizes the code a bit and splits the hashing part into a helper function called "hash_name()". It returns the length of the pathname component, while at the same time computing and writing the hash to the appropriate location. The code generation is slightly changed by this patch, but generally for the better - and the added abstraction actually makes the code easier to read too. And the new interface is well suited for replacing just the "hash_name()" function with alternative implementations. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-02 14:49:24 -08:00
Linus Torvalds	0145acc202	vfs: uninline full_name_hash() .. and also use it in lookup_one_len() rather than open-coding it. There aren't any performance-critical users, so inlining it is silly. But it wouldn't matter if it wasn't for the fact that the word-at-a-time dentry name patches want to conditionally replace the function, and uninlining it sets the stage for that. So again, this is a preparatory patch that doesn't change any semantics, and only prepares for a much cleaner and testable word-at-a-time dentry name accessor patch. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-02 14:32:59 -08:00
Paul Gortmaker	630d9c4727	fs: reduce the use of module.h wherever possible For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-02-28 19:31:58 -05:00
Miklos Szeredi	e188dc02d3	vfs: fix d_inode_lookup() dentry ref leak d_inode_lookup() leaks a dentry reference on IS_DEADDIR(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-02-13 20:45:37 -05:00
Eric Paris	4043cde8ec	audit: do not call audit_getname on error Just a code cleanup really. We don't need to make a function call just for it to return on error. This also makes the VFS function even easier to follow and removes a conditional on a hot path. Signed-off-by: Eric Paris <eparis@redhat.com>	2012-01-17 16:17:01 -05:00
Al Viro	ece2ccb668	Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z	2012-01-06 23:15:54 -05:00
Al Viro	a73324da7a	vfs: move mnt_mountpoint to struct mount Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:57:05 -05:00
Al Viro	0714a53380	vfs: now it can be done - make mnt_parent point to struct mount Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:57:05 -05:00
Al Viro	3376f34fff	vfs: mnt_parent moved to struct mount the second victim... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:57:04 -05:00
Al Viro	c71053659e	vfs: spread struct mount - __lookup_mnt() result switch __lookup_mnt() to returning struct mount *; callers adjusted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:56:58 -05:00
Al Viro	a218d0fdc5	switch open and mkdir syscalls to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:19 -05:00
Al Viro	f69aac0006	switch may_mknod() to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:14 -05:00
Al Viro	1a67aafb5f	switch ->mknod() to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:54 -05:00
Al Viro	4acdaf27eb	switch ->create() to umode_t vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	18bb1db3e7	switch vfs_mkdir() and ->mkdir() to umode_t vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	8208a22bb8	switch sys_mknodat(2) to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:52 -05:00
Al Viro	a3fbbde70a	VFS: we need to set LOOKUP_JUMPED on mountpoint crossing Mountpoint crossing is similar to following procfs symlinks - we do not get ->d_revalidate() called for dentry we have arrived at, with unpleasant consequences for NFS4. Simple way to reproduce the problem in mainline: cat >/tmp/a.c <<'EOF' #include <unistd.h> #include <fcntl.h> #include <stdio.h> main() { struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1}; if (fcntl(0, F_SETLK, &fl)) perror("setlk"); } EOF cc /tmp/a.c -o /tmp/test then on nfs4: mount --bind file1 file2 /tmp/test < file1 # ok /tmp/test < file2 # spews "setlk: No locks available"... What happens is the missing call of ->d_revalidate() after mountpoint crossing and that's where NFS4 would issue OPEN request to server. The fix is simple - treat mountpoint crossing the same way we deal with following procfs-style symlinks. I.e. set LOOKUP_JUMPED... Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-07 14:58:06 -08:00
Andy Whitcroft	1fa1e7f615	readlinkat: ensure we return ENOENT for the empty pathname for normal lookups Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit `65cfc67223` Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
J. Bruce Fields	f3c7691e8d	leases: fix write-open/read-lease race In setlease, we use i_writecount to decide whether we can give out a read lease. In open, we break leases before incrementing i_writecount. There is therefore a window between the break lease and the i_writecount increment when setlease could add a new read lease. This would leave us with a simultaneous write open and read lease, which shouldn't happen. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andreas Gruenbacher	948409c74d	vfs: add a comment to inode_permission() Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:55 +02:00
Andreas Gruenbacher	d124b60a83	vfs: pass all mask flags check_acl and posix_acl_permission Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Andreas Gruenbacher	8fd90c8d1d	vfs: indicate that the permission functions take all the MAY_* flags Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Linus Torvalds	b6c8069d35	vfs: remove LOOKUP_NO_AUTOMOUNT flag That flag no longer makes sense, since we don't look up automount points as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT handling was buggy to begin with: it would avoid automounting even for cases where we really needed to do the automount handling, and could return ENOENT for autofs entries that hadn't been instantiated yet. With our new non-eager automount semantics, one discussion has been about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the newfstatat() and fstatat64() system calls), but it's probably not worth it: you can always force at least directory automounting by simply adding the final '/' to the filename, which works for all of the stat family system calls, old and new. So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a result of our bad default behavior. Acked-by: Ian Kent <raven@themaw.net> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-27 08:12:33 -07:00
Linus Torvalds	d94c177bee	vfs pathname lookup: Add LOOKUP_AUTOMOUNT flag Since we've now turned around and made LOOKUP_FOLLOW not force an automount, we want to add the ability to force an automount event on lookup even if we don't happen to have one of the other flags that force it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..) Most cases will never want to use this, since you'd normally want to delay automounting as long as possible, which usually implies LOOKUP_OPEN (when we open a file or directory, we really cannot avoid the automount any more). But Trond argued sufficiently forcefully that at a minimum bind mounting a file and quotactl will want to force the automount lookup. Some other cases (like nfs_follow_remote_path()) could use it too, although LOOKUP_DIRECTORY would work there as well. This commit just adds the flag and logic, no users yet, though. It also doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and was made irrelevant by the same change that made us not follow on LOOKUP_FOLLOW. Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ian Kent <raven@themaw.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-26 17:44:55 -07:00
Al Viro	1d2ef59014	restore pinning the victim dentry in vfs_rmdir()/vfs_rename_dir() We used to get the victim pinned by dentry_unhash() prior to commit `64252c75a2` ("vfs: remove dget() from dentry_unhash()") and ->rmdir() and ->rename() instances relied on that; most of them don't care, but ones that used d_delete() themselves do. As the result, we are getting rmdir() oopses on NFS now. Just grab the reference before locking the victim and drop it explicitly after unlocking, same as vfs_rename_other() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Simon Kirby <sim@hostway.ca> Cc: stable@kernel.org (3.0.x) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-14 11:31:55 -07:00
Miklos Szeredi	0ec26fd069	vfs: automount should ignore LOOKUP_FOLLOW Prior to 2.6.38 automount would not trigger on either stat(2) or lstat(2) on the automount point. After 2.6.38, with the introduction of the ->d_automount() infrastructure, stat(2) and others would start triggering automount while lstat(2), etc. still would not. This is a regression and a userspace ABI change. Problem originally reported here: http://thread.gmane.org/gmane.linux.kernel.autofs/6098 It appears that there was an attempt at fixing various userspace tools to not trigger the automount. But since the stat system call is rather common it is impossible to "fix" all userspace. This patch reverts the original behavior, which is to not trigger on stat(2) and other symlink following syscalls. [ It's not really clear what the right behavior is. Apparently Solaris does the "automount on stat, leave alone on lstat". And some programs can get unhappy when "stat+open+fstat" ends up giving a different result from the fstat than from the initial stat. But the change in 2.6.38 resulted in problems for some people, so we're going back to old behavior. Maybe we can re-visit this discussion at some future date - Linus ] Reported-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Ian Kent <raven@themaw.net> Cc: David Howells <dhowells@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-09 15:42:34 -07:00
Linus Torvalds	7813b94a54	vfs: rename 'do_follow_link' to 'should_follow_link' Al points out that the do_follow_link() helper function really is misnamed - it's about whether we should try to follow a symlink or not, not about actually doing the following. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-07 13:42:25 -07:00
Ari Savolainen	206b1d09a5	Fix POSIX ACL permission check After commit `3567866bf2`: "RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached" posix_acl_permission is being called with an unsupported flag and the permission check fails. This patch fixes the issue. Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-07 04:52:23 -04:00
Linus Torvalds	3ddcd0569c	vfs: optimize inode cache access patterns The inode structure layout is largely random, and some of the vfs paths really do care. The path lookup in particular is already quite D$ intensive, and profiles show that accessing the 'inode->i_op->xyz' fields is quite costly. We already optimized the dcache to not unnecessarily load the d_op structure for members that are often NULL using the DCACHE_OP_xyz bits in dentry->d_flags, and this does something very similar for the inode ops that are used during pathname lookup. It also re-orders the fields so that the fields accessed by 'stat' are together at the beginning of the inode structure, and roughly in the order accessed. The effect of this seems to be in the 1-2% range for an empty kernel "make -j" run (which is fairly kernel-intensive, mostly in filename lookup), so it's visible. The numbers are fairly noisy, though, and likely depend a lot on exact microarchitecture. So there's more tuning to be done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-06 22:53:23 -07:00
Al Viro	3567866bf2	RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-03 00:58:42 -04:00
David Howells	5a30d8a2b8	VFS: Fix automount for negative autofs dentries Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries. These need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-08-01 01:38:01 -04:00
Linus Torvalds	84635d68be	vfs: fix check_acl compile error when CONFIG_FS_POSIX_ACL is not set Commit `e77819e57f` ("vfs: move ACL cache lookup into generic code") didn't take the FS_POSIX_ACL config variable into account - when that is not set, ACL's go away, and the cache helper functions do not exist, causing compile errors like fs/namei.c: In function 'check_acl': fs/namei.c:191:10: error: implicit declaration of function 'negative_cached_acl' fs/namei.c:196:2: error: implicit declaration of function 'get_cached_acl' fs/namei.c:196:6: warning: assignment makes pointer from integer without a cast fs/namei.c:212:11: error: implicit declaration of function 'set_cached_acl' Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 22:47:03 -07:00
Linus Torvalds	14067ff536	vfs: make gcc generate more obvious code for acl permission checking The "fsuid is the inode owner" case is not necessarily always the likely case, but it's the case that doesn't do anything odd and that we want in straight-line code. Make gcc not generate random "jump around for the fun of it" code. This just helps me read profiles. That thing is one of the hottest parts of the whole pathname lookup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 19:55:52 -07:00
Christoph Hellwig	4e34e719e4	fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:30:23 -04:00
Linus Torvalds	e77819e57f	vfs: move ACL cache lookup into generic code This moves logic for checking the cached ACL values from low-level filesystems into generic code. The end result is a streamlined ACL check that doesn't need to load the inode->i_op->check_acl pointer at all for the common cached case. The filesystems also don't need to check for a non-blocking RCU walk case in their acl_check() functions, because that is all handled at a VFS layer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:23:39 -04:00
Tobias Klauser	8c5dc70aae	VFS: Fixup kerneldoc for generic_permission() The flags parameter went away in d749519b444db985e40b897f73ce1898b11f997e Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:43 -04:00
Al Viro	e3c3d9c838	unexport kern_path_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:16 -04:00
Al Viro	e0a0124936	switch vfs_path_lookup() to struct path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:14 -04:00
Al Viro	ed75e95de5	kill lookup_create() folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:12 -04:00
Al Viro	dae6ad8f37	new helpers: kern_path_create/user_path_create combination of kern_path_parent() and lookup_create(). Does not expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:05 -04:00
Al Viro	49084c3bb2	kill LOOKUP_CONTINUE LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:03 -04:00
Al Viro	8a5e929dd2	don't transliterate lower bits of ->intent.open.flags to FMODE_... ->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:52 -04:00
Al Viro	554a8b9f54	Don't pass nameidata when calling vfs_create() from mknod() All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:49 -04:00
Al Viro	d2d9e9fbc2	merge do_revalidate() into its only caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:34 -04:00
Al Viro	4ad5abb3d0	no reason to keep exec_permission() separate now cache footprint alone makes it a bad idea... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:32 -04:00
Al Viro	d594e7ec4d	massage generic_permission() to treat directories on a separate path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:30 -04:00
Al Viro	eecdd358b4	->permission() sanitizing: don't pass flags to exec_permission() pass mask instead; kill security_inode_exec_permission() since we can use security_inode_permission() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:29 -04:00
Al Viro	10556cb21a	->permission() sanitizing: don't pass flags to ->permission() not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:24 -04:00
Al Viro	2830ba7f34	->permission() sanitizing: don't pass flags to generic_permission() redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:22 -04:00
Al Viro	7e40145eb1	->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:21 -04:00
Al Viro	9c2c703929	->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:19 -04:00
Al Viro	1fc0f78ca9	->permission() sanitizing: MAY_NOT_BLOCK Duplicate the flags argument into mask bitmap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:18 -04:00
Al Viro	178ea73521	kill check_acl callback of generic_permission() its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:16 -04:00
Al Viro	07b8ce1ee8	lockless get_write_access/deny_write_access new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:14 -04:00
Al Viro	f4d6ff89d8	move exec_permission() up to the rest of permission-related functions ... and convert the comment before it into linuxdoc form. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:13 -04:00
Al Viro	3bfa784a65	kill file_permission() completely convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:11 -04:00
Al Viro	78f32a9b47	switch path_init() to exec_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:08 -04:00
Al Viro	4cf27141cb	make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC) capability overrides apply only to the default case; if fs has ->permission() that does _not_ call generic_permission(), we have no business doing them. Moreover, if it has ->permission() that does call generic_permission(), we have no need to recheck capabilities. Besides, the capability overrides should apply only if we got EACCES from acl_permission_check(); any other value (-EIO, etc.) should be returned to caller, capabilities or not capabilities. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:05 -04:00
Josef Bacik	44396f4b5c	fs: add a DCACHE_NEED_LOOKUP flag for d_flags Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:03 -04:00
Linus Torvalds	5943026240	vfs: fix race in rcu lookup of pruned dentry Don't update *inode in __follow_mount_rcu() until we'd verified that there is mountpoint there. Kudos to Hugh Dickins for catching that one in the first place and eventually figuring out the solution (and catching a braino in the earlier version of patch). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-19 21:49:01 -07:00
Al Viro	94c0d4ecbe	Fix ->d_lock locking order in unlazy_walk() Make sure that child is still a child of parent before nested locking of child->d_lock in unlazy_walk(); otherwise we are risking a violation of locking order and deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-12 21:40:23 -04:00
Al Viro	8e833fd2e1	fix comment in generic_permission() CAP_DAC_OVERRIDE is enough for MAY_EXEC on directory, even if no exec bits are set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:56 -04:00
Al Viro	6291176bcd	kill obsolete comment for follow_down() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-20 10:45:49 -04:00
Al Viro	8aef188452	VFS: Fix vfsmount overput on simultaneous automount [Kudos to dhowells for tracking that crap down] If two processes attempt to cause automounting on the same mountpoint at the same time, the vfsmount holding the mountpoint will be left with one too few references on it, causing a BUG when the kernel tries to clean up. The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount. During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. The fix is to make sure we have a ref on the vfsmount of the mountpoint before calling do_add_mount(). However, if lock_mount() doesn't transit, we're then left with an extra ref on the mountpoint vfsmount which needs releasing. We can handle that in follow_managed() by not making assumptions about what we can and what we cannot get from lookup_mnt() as the current code does. The callers of follow_managed() expect that reference to path->mnt will be grabbed iff path->mnt has been changed. follow_managed() and follow_automount() keep track of whether such reference has been grabbed and assume that it'll happen in those and only those cases that'll have us return with changed path->mnt. That assumption is almost correct - it breaks in case of racing automounts and in even harder to hit race between following a mountpoint and a couple of mount --move. The thing is, we don't need to make that assumption at all - after the end of loop in follow_manage() we can check if path->mnt has ended up unchanged and do mntput() if needed. The BUG can be reproduced with the following test program: #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/wait.h> int main(int argc, char **argv) { int pid, ws; struct stat buf; pid = fork(); stat(argv[1], &buf); if (pid > 0) wait(&ws); return 0; } and the following procedure: (1) Mount an NFS volume that on the server has something else mounted on a subdirectory. For instance, I can mount / from my server: mount warthog:/ /mnt -t nfs4 -r On the server /data has another filesystem mounted on it, so NFS will see a change in FSID as it walks down the path, and will mark /mnt/data as being a mountpoint. This will cause the automount code to be triggered. !!! Do not look inside the mounted fs at this point !!! (2) Run the above program on a file within the submount to generate two simultaneous automount requests: /tmp/forkstat /mnt/data/testfile (3) Unmount the automounted submount: umount /mnt/data (4) Unmount the original mount: umount /mnt At this point the kernel should throw a BUG with something like the following: BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12] Note that the bug appears on the root dentry of the original mount, not the mountpoint and not the submount because sys_umount() hasn't got to its final mntput_no_expire() yet, but this isn't so obvious from the call trace: [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs] [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs] [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b [<ffffffff81186261>] mntput_no_expire+0x18d/0x199 [<ffffffff811862a8>] mntput+0x3b/0x44 [<ffffffff81186d87>] release_mounts+0xa2/0xbf [<ffffffff811876af>] sys_umount+0x47a/0x4ba [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b as do_umount() is inlined. However, you can see release_mounts() in there. Note also that it may be necessary to have multiple CPU cores to be able to trigger this bug. Tested-by: Jeff Layton <jlayton@redhat.com> Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:28:16 -04:00
Török Edwin	50338b889d	fix wrong iput on d_inode introduced by `e6bc45d65d` Git bisection shows that commit `e6bc45d65d` causes BUG_ONs under high I/O load: kernel BUG at fs/inode.c:1368! [ 2862.501007] Call Trace: [ 2862.501007] [<ffffffff811691d8>] d_kill+0xf8/0x140 [ 2862.501007] [<ffffffff81169c19>] dput+0xc9/0x190 [ 2862.501007] [<ffffffff8115577f>] fput+0x15f/0x210 [ 2862.501007] [<ffffffff81152171>] filp_close+0x61/0x90 [ 2862.501007] [<ffffffff81152251>] sys_close+0xb1/0x110 [ 2862.501007] [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b A reliable way to reproduce this bug is: Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk, and apt-get remove openjdk-6-jdk. The buggy part of the patch is this: struct inode inode = NULL; ..... - if (nd.last.name[nd.last.len]) - goto slashes; inode = dentry->d_inode; - if (inode) - ihold(inode); + if (nd.last.name[nd.last.len] \|\| !inode) + goto slashes; + ihold(inode) ... if (inode) iput(inode); / truncate the inode here */ If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken), and dentry->d_inode is non-NULL, then this code now does an additional iput on the inode, which is wrong. Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0. Reference: https://lkml.org/lkml/2011/6/15/50 Reported-by: Norbert Preining <preining@logic.at> Reported-by: Török Edwin <edwintorok@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Török Edwin <edwintorok@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-16 11:27:39 -04:00
Theodore Ts'o	e6bc45d65d	vfs: make unlink() and rmdir() return ENOENT in preference to EROFS If user space attempts to remove a non-existent file or directory, and the file system is mounted read-only, return ENOENT instead of EROFS. Either error code is arguably valid/correct, but ENOENT is a more specific error message. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-06-07 08:51:14 -04:00
Sage Weil	3cebde2413	vfs: shrink_dcache_parent before rmdir, dir rename The dentry_unhash push-down series missed that shink_dcache_parent needs to be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and allow efficient dentry reclaim. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-30 01:48:27 -04:00
Al Viro	d6e9bd256c	Lift the check for automount points into do_lookup() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 07:03:15 -04:00
Al Viro	dea3937619	Trim excessive arguments of follow_mount_rcu() ... and kill a useless local variable in follow_dotdot_rcu(), while we are at it - follow_mount_rcu(nd, path, inode) always assigned value to *inode, and always it had been path->dentry->d_inode (aka nd->path.dentry->d_inode, since it always got &nd->path as the second argument). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 07:01:49 -04:00
Al Viro	287548e46a	split __follow_mount_rcu() into normal and .. cases Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 06:51:56 -04:00
Linus Torvalds	32e51f141f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits) cifs: remove unnecessary dentry_unhash on rmdir/rename_dir ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir exofs: remove unnecessary dentry_unhash on rmdir/rename_dir nfs: remove unnecessary dentry_unhash on rmdir/rename_dir ext2: remove unnecessary dentry_unhash on rmdir/rename_dir ext3: remove unnecessary dentry_unhash on rmdir/rename_dir ext4: remove unnecessary dentry_unhash on rmdir/rename_dir btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir ceph: remove unnecessary dentry_unhash calls vfs: clean up vfs_rename_other vfs: clean up vfs_rename_dir vfs: clean up vfs_rmdir vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems libfs: drop unneeded dentry_unhash vfs: update dentry_unhash() comment vfs: push dentry_unhash on rename_dir into file systems vfs: push dentry_unhash on rmdir into file systems vfs: remove dget() from dentry_unhash() vfs: dentry_unhash immediately prior to rmdir vfs: Block mmapped writes while the fs is frozen ...	2011-05-26 09:52:14 -07:00
Sage Weil	51892bbb57	vfs: clean up vfs_rename_other Simplify control flow to match vfs_rename_dir. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:53 -04:00
Sage Weil	9055cba711	vfs: clean up vfs_rename_dir Simplify control flow through vfs_rename_dir. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:52 -04:00
Sage Weil	912dbc15d9	vfs: clean up vfs_rmdir Simplify the control flow with an out label. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:51 -04:00
Miklos Szeredi	b5afd2c406	vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems vfs_rename_dir() doesn't properly account for filesystems with FS_RENAME_DOES_D_MOVE. If new_dentry has a target inode attached, it unhashes the new_dentry prior to the rename() iop and rehashes it after, but doesn't account for the possibility that rename() may have swapped {old,new}_dentry. For FS_RENAME_DOES_D_MOVE filesystems, it rehashes new_dentry (now the old renamed-from name, which d_move() expected to go away), such that a subsequent lookup will find it. Currently all FS_RENAME_DOES_D_MOVE filesystems compensate for this by failing in d_revalidate. The bug was introduced by: commit `349457ccf2` "[PATCH] Allow file systems to manually d_move() inside of ->rename()" Fix by not rehashing the new dentry. Rehashing used to be needed by d_move() but isn't anymore. Reported-by: Sage Weil <sage@newdream.net> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:50 -04:00
Sage Weil	a71905f0db	vfs: update dentry_unhash() comment The helper is now only called by file systems, not the VFS. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:49 -04:00
Sage Weil	e4eaac06bc	vfs: push dentry_unhash on rename_dir into file systems Only a few file systems need this. Start by pushing it down into each rename method (except gfs2 and xfs) so that it can be dealt with on a per-fs basis. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:48 -04:00
Sage Weil	79bf7c732b	vfs: push dentry_unhash on rmdir into file systems Only a few file systems need this. Start by pushing it down into each fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs basis. This does not change behavior for any in-tree file systems. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:47 -04:00
Sage Weil	64252c75a2	vfs: remove dget() from dentry_unhash() This serves no useful purpose that I can discern. All callers (rename, rmdir) hold their own reference to the dentry. A quick audit of all file systems showed no relevant checks on the value of d_count in vfs_rmdir/vfs_rename_dir paths. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:46 -04:00
Sage Weil	48293699a0	vfs: dentry_unhash immediately prior to rmdir This presumes that there is no reason to unhash a dentry if we fail because it is a mountpoint or the LSM check fails, and that the LSM checks do not depend on the dentry being unhashed. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:46 -04:00
Al Viro	9f1fafee9e	merge handle_reval_dot and nameidata_drop_rcu_last new helper: complete_walk(). Done on successful completion of walk, drops out of RCU mode, does d_revalidate of final result if that hadn't been done already. handle_reval_dot() and nameidata_drop_rcu_last() subsumed into that one; callers converted to use of complete_walk(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:32 -04:00
Al Viro	19660af736	consolidate nameidata_..._drop_rcu() Merge these into a single function (unlazy_walk(nd, dentry)), kill ..._maybe variants Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-26 07:26:02 -04:00
Erez Zadok	1a4022f88d	VFS: move BUG_ON test for symlink nd->depth after current->link_count test This solves a serious VFS-level bug in nested_symlink (which was rewritten from do_follow_link), and follows the order of depth tests that existed before. The bug triggers a BUG_ON in fs/namei.c:1381, when running racer with symlink and rename ops. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Acked-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-21 00:12:16 -07:00
Linus Torvalds	26cf46be95	vfs: micro-optimize acl_permission_check() It's a hot function, and we're better off not mixing types in the mask calculations. The compiler just ends up mixing 16-bit and 32-bit operations, for no good reason. So do everything in 'unsigned int' rather than mixing 'unsigned int' masking with a 'umode_t' (16-bit) mode variable. This, together with the parent commit (`47a150edc2`: "Cache user_ns in struct cred") makes acl_permission_check() much nicer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-13 11:51:01 -07:00
Tim Chen	c1530019e3	vfs: Fix absolute RCU path walk failures due to uninitialized seq number During RCU walk in path_lookupat and path_openat, the rcu lookup frequently failed if looking up an absolute path, because when root directory was looked up, seq number was not properly set in nameidata. We dropped out of RCU walk in nameidata_drop_rcu due to mismatch in directory entry's seq number. We reverted to slow path walk that need to take references. With the following patch, I saw a 50% increase in an exim mail server benchmark throughput on a 4-socket Nehalem-EX system. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: stable@kernel.org (v2.6.38) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-15 15:28:12 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Ian Kent	62a7375e5d	vfs - check non-mountpoint dentry might block in __follow_mount_rcu() When following a mount in rcu-walk mode we must check if the incoming dentry is telling us it may need to block, even if it isn't actually a mountpoint. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:24:32 -04:00
Linus Torvalds	b81a618dcd	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: deal with races in /proc//{syscall,stack,personality} proc: enable writing to /proc/pid/mem proc: make check_mem_permission() return an mm_struct on success proc: hold cred_guard_mutex in check_mem_permission() proc: disable mem_write after exec mm: implement access_remote_vm mm: factor out main logic of access_process_vm mm: use mm_struct to resolve gate vma's in __get_user_pages mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm mm: arch: make in_gate_area take an mm_struct instead of a task_struct mm: arch: make get_gate_vma take an mm_struct instead of a task_struct x86: mark associated mm when running a task in 32 bit compatibility mode x86: add context tag to mark mm when running a task in 32-bit compatibility mode auxv: require the target to be tracable (or yourself) close race in /proc//environ report errors in /proc//map* sanely pagemap: close races with suid execve make sessionid permissions in /proc//task/ match those in /proc/* fix leaks in path_lookupat() Fix up trivial conflicts in fs/proc/base.c	2011-03-23 20:51:42 -07:00
Serge E. Hallyn	2e14967075	userns: rename is_owner_or_cap to inode_owner_or_capable And give it a kernel-doc comment. [akpm@linux-foundation.org: btrfs changed in linux-next] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:13 -07:00
Serge E. Hallyn	e795b71799	userns: userns: check user namespace for task->file uid equivalence checks Cheat for now and say all files belong to init_user_ns. Next step will be to let superblocks belong to a user_ns, and derive inode_userns(inode) from inode->i_sb->s_user_ns. Finally we'll introduce more flexible arrangements. Changelog: Feb 15: make is_owner_or_cap take const struct inode Feb 23: make is_owner_or_cap bool [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:08 -07:00
Al Viro	bd23a539d0	fix leaks in path_lookupat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 09:56:55 -04:00
Al Viro	1aed3e4204	lose 'mounting_here' argument in ->d_manage() it's always false... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 10:01:59 -04:00
Al Viro	7cc90cc3ff	don't pass 'mounting_here' flag to follow_down() it's always false now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 09:04:20 -04:00
Al Viro	0e794589e5	fix follow_link() breakage commit `574197e0de` had a missing piece, breaking the loop detection ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 04:57:03 -04:00
Al Viro	574197e0de	tidy the trailing symlinks traversal up * pull the handling of current->total_link_count into __do_follow_link() * put the common "do ->put_link() if needed and path_put() the link" stuff into a helper (put_link(nd, link, cookie)) * rename __do_follow_link() to follow_link(), while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	b356379a02	Turn resolution of trailing symlinks iterative everywhere The last remaining place (resolution of nested symlink) converted to the loop of the same kind we have in path_lookupat() and path_openat(). Note that we still do have a recursion in pathname resolution; can't avoid it, really. However, it's strictly for nested symlinks now - i.e. ones in the middle of a pathname. link_path_walk() has lost the tail now - it always walks everything except the last component. do_follow_link() renamed to nested_symlink() and moved down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	ce0525449d	simplify link_path_walk() tail Now that link_path_walk() is called without LOOKUP_PARENT only from do_follow_link(), we can simplify the checks in last component handling. First of all, checking if we'd arrived to a directory is not needed - the caller will check it anyway. And LOOKUP_FOLLOW is guaranteed to be there, since we only get to that place with nd->depth > 0. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	bd92d7fed8	Make trailing symlink resolution in path_lookupat() iterative Now the only caller of link_path_walk() that does not pass LOOKUP_PARENT is do_follow_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	b21041d0f7	update nd->inode in __do_follow_link() instead of after do_follow_link() ... and note that we only need to do it for LAST_BIND symlinks Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	ce57dfc179	pull handling of one pathname component into a helper new helper: walk_component(). Handles everything except symlinks; returns negative on error, 0 on success and 1 on symlinks we decided to follow. Drops out of RCU mode on such symlinks. link_path_walk() and do_last() switched to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:20 -04:00
Aneesh Kumar K.V	11a7b371b6	fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH We don't want to allow creation of private hardlinks by different application using the fd passed to them via SCM_RIGHTS. So limit the null relative name usage in linkat syscall to CAP_DAC_READ_SEARCH Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-03-15 17:16:05 -04:00
Al Viro	bcda76524c	Allow O_PATH for symlinks At that point we can't do almost nothing with them. They can be opened with O_PATH, we can manipulate such descriptors with dup(), etc. and we can see them in /proc//{fd,fdinfo}/. We can't (and won't be able to) follow /proc//fd/ symlinks for those; there's simply not enough information for pathname resolution to go on from such point - to resolve a symlink we need to know which directory does it live in. We will be able to do useful things with them after the next commit, though - readlinkat() and fchownat() will be possible to use with dfd being an O_PATH-opened symlink and empty relative pathname. Combined with open_by_handle() it'll give us a way to do realink-by-handle and lchown-by-handle without messing with more redundant syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Al Viro	1abf0c718f	New kind of open files - "location only". New flag for open(2) - O_PATH. Semantics: * pathname is resolved, but the file itself is _NOT_ opened as far as filesystem is concerned. * almost all operations on the resulting descriptors shall fail with -EBADF. Exceptions are: 1) operations on descriptors themselves (i.e. close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD), fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD), fcntl(fd, F_SETFD, ...)) 2) fcntl(fd, F_GETFL), for a common non-destructive way to check if descriptor is open 3) "dfd" arguments of ...at(2) syscalls, i.e. the starting points of pathname resolution * closing such descriptor does NOT affect dnotify or posix locks. * permissions are checked as usual along the way to file; no permission checks are applied to the file itself. Of course, giving such thing to syscall will result in permission checks (at the moment it means checking that starting point of ....at() is a directory and caller has exec permissions on it). fget() and fget_light() return NULL on such descriptors; use of fget_raw() and fget_raw_light() is needed to get them. That protects existing code from dealing with those things. There are two things still missing (they come in the next commits): one is handling of symlinks (right now we refuse to open them that way; see the next commit for semantics related to those) and another is descriptor passing via SCM_RIGHTS datagrams. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	aae8a97d3e	fs: Don't allow to create hardlink for deleted file Add inode->i_nlink == 0 check in VFS. Some of the file systems do this internally. A followup patch will remove those instance. This is needed to ensure that with link by handle we don't allow to create hardlink of an unlinked file. The check also prevent a race between unlink and link Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Al Viro	f52e0c1130	New AT_... flag: AT_EMPTY_PATH For name_to_handle_at(2) we'll want both ...at()-style syscall that would be usable for non-directory descriptors (with empty relative pathname). Introduce new flag (AT_EMPTY_PATH) to deal with that and corresponding LOOKUP_EMPTY; teach user_path_at() and path_init() to deal with the latter. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 19:12:20 -04:00
Al Viro	73d049a40f	open-style analog of vfs_path_lookup() new function: file_open_root(dentry, mnt, name, flags) opens the file vfs_path_lookup would arrive to. Note that name can be empty; in that case the usual requirement that dentry should be a directory is lifted. open-coded equivalents switched to it, may_open() got down exactly one caller and became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:28 -04:00

1 2 3 4 5 ...

596 Commits