This is half of a patch that Qu Fuping submitted in April. The first part
was applied to fs/mpage.c in 2.6.12-rc4.
jfs_fsync should return error, but it doesn't wait for the metadata page to
be uptodate, e.g.:
jfs_fsync->jfs_commit_inode->txCommit->diWrite->read_metapage->
__get_metapage->read_cache_page reads a page from disk. Because read is
async, when read_cache_page: err = filler(data, page), filler will not
return error, it just submits I/O request and returns. So, page is not
uptodate. Checking only if(IS_ERROR(mp->page)) is not enough, we should
add "|| !PageUptodate(mp->page)"
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
In the rare case of failing to write the cleanmarker
the allocated node was not freed.
Pointed out by Forrest Zhao
Initial cleanup by Joern Engel
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Minimal patch removing uses of ROOT_DEV; next patch unexports it. I've
opposed this, but I've planned to reintroduce the functionality without using
ROOT_DEV.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the error message to refer to the error code, i.e. err, not count, plus
add some cosmetical fixes.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow the setting of NLM timeouts and grace periods through the proc and
sysclt interfaces on x86_64 architectures
Signed-off-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Something has changed in the core kernel such that we now get concurrent
inode write outs, one e.g via pdflush and one via sys_sync or whatever.
This causes a nasty deadlock in ntfs. The only clean solution
unfortunately requires a minor vfs api extension.
First the deadlock analysis:
Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
time. The NTFS driver uses the page cache for storing the file contents as
usual. More interestingly this file contains the table of on-disk inodes
as a sequence of MFT_RECORDs. Thus NTFS driver accesses the on-disk inodes
by accessing the MFT_RECORDs in the page cache pages of the loaded inode
$MFT.
The situation: VFS inode X on a mounted ntfs volume is dirty. For same
inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
table of inodes ($MFT, inode 0).
What happens:
Process 1: sys_sync()/umount()/whatever... calls __sync_single_inode() for
$MFT -> do_writepages() -> write_page for the dirty page containing the
on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
clears PageUptodate() on the page to prevent anyone else getting hold of it
whilst it does the write out (this is necessary as the on-disk inode needs
"fixups" applied before the write to disk which are removed again after the
write and PageUptodate is then set again). It then analyses the page
looking for dirty on-disk inodes and when it finds one it calls
ntfs_may_write_mft_record() to see if it is safe to write this on-disk
inode. This then calls ilookup5() to check if the corresponding VFS inode
is in icache(). This in turn calls ifind() which waits on the inode lock
via wait_on_inode whilst holding the global inode_lock.
Process 2: pdflush results in a call to __sync_single_inode for the same
VFS inode X on the ntfs volume. This locks the inode (I_LOCK) then calls
write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
the page (in page cache of table of inodes $MFT, inode 0) containing the
on-disk inode. This page has PageUptodate() clear because of Process 1
(see above) so read_cache_page() blocks when tries to take the page lock
for the page so it can call ntfs_read_page().
Thus Process 1 is holding the page lock on the page containing the on-disk
inode X and it is waiting on the inode X to be unlocked in ifind() so it
can write the page out and then unlock the page.
And Process 2 is holding the inode lock on inode X and is waiting for the
page to be unlocked so it can call ntfs_readpage() or discover that
Process 1 set PageUptodate() again and use the page.
Thus we have a deadlock due to ifind() waiting on the inode lock.
The only sensible solution: NTFS does not care whether the VFS inode is
locked or not when it calls ilookup5() (it doesn't use the VFS inode at
all, it just uses it to find the corresponding ntfs_inode which is of
course attached to the VFS inode (both are one single struct); and it uses
the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
hence we want a modified ilookup5_nowait() which is the same as ilookup5()
but it does not wait on the inode lock.
Without such functionality I would have to keep my own ntfs_inode cache in
the NTFS driver just so I can find ntfs_inodes independent of their VFS
inodes which would be slow, memory and cpu cycle wasting, and incredibly
stupid given the icache already exists in the VFS.
Below is a patch that does the ilookup5_nowait() implementation in
fs/inode.c and exports it.
ilookup5_nowait.diff:
Introduce ilookup5_nowait() which is basically the same as ilookup5() but
it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
done in ifind()).
This is needed to avoid a nasty deadlock in NTFS.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This moves the inotify sysctl knobs to "/proc/sys/fs/inotify" from
"/proc/sys/fs". Also some related cleanup.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All of the different xattr namespaces have different rules.
user.* and ACL's are not allowed on symlinks, and since these were the
first xattrs implemented, I assumed there was no need to support xattrs
on symlinks. This one-line patch should fix it.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was a pure indentation change, using:
scripts/Lindent fs/reiserfs/*.c include/linux/reiserfs_*.h
to make reiserfs match the regular Linux indentation style. As Jeff
Mahoney <jeffm@suse.com> writes:
The ReiserFS code is a mix of a number of different coding styles, sometimes
different even from line-to-line. Since the code has been relatively stable
for quite some time and there are few outstanding patches to be applied, it
is time to reformat the code to conform to the Linux style standard outlined
in Documentation/CodingStyle.
This patch contains the result of running scripts/Lindent against
fs/reiserfs/*.c and include/linux/reiserfs_*.h. There are places where the
code can be made to look better, but I'd rather keep those patches separate
so that there isn't a subtle by-hand hand accident in the middle of a huge
patch. To be clear: This patch is reformatting *only*.
A number of patches may follow that continue to make the code more consistent
with the Linux coding style.
Hans wasn't particularly enthusiastic about these patches, but said he
wouldn't really oppose them either.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
indent(1) doesn't know how to handle the "do not compile" error. It results
in the item_ops array declaration being indented a tab stop in when it should
not be. This patch replaces it with a #error that describes why it's failing.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While fixing an oops in the st driver in a dirty release path, I
encountered an oops in cdev_put for cdevs allocated using cdev_alloc. If
cdev_del is called when the cdev kobject still has an open user, when the
last cdev_put is called, the cdev_put will call kobject_put, which will end
up ultimately releasing the cdev in cdev_dynamic_release. Patch fixes the
oops by preventing cdev_put from accessing freed memory.
Signed-off-by: Brian King <brking@us.ibm.com>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Restore old set of ext2 mount options when remounting of a filesystem
fails.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a problem with ext3 mount option parsing. When remount of a filesystem
fails, old options are now restored.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a noninitial thread does exec, it becomes the new group leader. If
there is a ITIMER_REAL timer running, it points at the old group leader and
when it fires it can follow a stale pointer. The timer data needs to be
reset to point at the exec'ing thread that is becoming the group leader.
This has to synchronize with any concurrent firing of the timer to make
sure that it_real_fn can never run when the data points to a thread that
might have been reaped already.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bug symptoms
~~~~~~~~~~~~
For the same inode VFS calls read_inode() twice and doesn't call
clear_inode() between the two read_inode() invocations.
Bug description
~~~~~~~~~~~~~~~
Suppose we have an inode which has zero reference count but is still in
the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
() is then going to call clear_inode(), but drops the inode_lock
spinlock before this. If in this moment another task calls iget() for an
inode which was just removed from i_hash by prune_icache(), then iget()
invokes read_inode() for this inode, because it is *already removed*
from i_hash.
The end result is: we call iget(#N) then iput(#N); inode #N has zero
i_count now and is in the inode cache; kswapd starts. kswapd removes the
inode #N from i_hash ans is preempted; we call iget(#N) again;
read_inode() is invoked as the result; but we expect clear_inode()
before.
Fix
~~~~~~~
To fix the bug I remove inodes from i_hash later, when clear_inode() is
actually called. I remove them from i_hash under spinlock protection.
Since the i_state is set to I_FREEING, it is safe to do this. The others
will sleep waiting for the inode state change.
I also postpone removing inodes from i_sb_list. It is not compulsory to
do so but I do it for readability reasons. Inodes are added/removed to
the lists together everywhere in the code and there is no point to
change this rule. This is harmless because the only user of i_sb_list
which somehow may interfere with me (invalidate_list()) is excluded by
the iprune_sem mutex.
The same race is possible in invalidate_list() so I do the same for it.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes queer behavior in __wait_on_freeing_inode().
If I_LOCK was not set it called yield(), effectively busy waiting for the
removal of the inode from the hash. This change was introduced within
"[PATCH] eliminate inode waitqueue hashtable" Changeset 1.1938.166.16 last
october by wli.
The solution is to restore the old behavior, of unconditionally waiting on
the waitqueue. It doesn't matter if I_LOCK is not set initally, the task
will go to sleep, and wake up when wake_up_inode() is called from
generic_delete_inode() after removing the inode from the hash chain.
Comment is also updated to better reflect current behavior.
This condition is very hard to trigger normally (simultaneous clear_inode()
with iget()) so probably only heavy stress testing can reveal any change of
behavior.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In case of a mount error locks might be uninitialized but
accessed by the resulting call to jffs2_kill_sb().
Signed-off-by: Artem B. Bityuckiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We recently changed the method of collecting and sorting of
tmp_dnode objects to use a temporary RB-tree instead of a
temporary list. Rename function and update comments.
Signed-off-by: Artem B. Bityuckiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After discussion at the recent NFSv4 bake-a-thon, I realized that my
assumption that NFS4_FH_PERSISTENT required filehandles to persist was a
misreading of the spec. This also fixes an interoperability problem with the
Solaris client.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We shouldn't be allowing, e.g., write locks on files not open for read. To
enforce this, we add a pointer from the lock stateid back to the open stateid
it came from, so that the check will continue to be correct even after the
open is upgraded or downgraded.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As long as we're here, do some miscellaneous cleanup.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The handling of close_lru in preprocess_stateid_op was a source of some
confusion here recently. Try to make the logic a little clearer, by renaming
find_openstateowner_id to make its purpose clearer and untangling some
unnecessarily complicated goto's.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nfs4_preprocess_seqid_op is called by NFSv4 operations that imply an implicit
renewal of the client lease.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
from RFC 3530:
"Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED."
(Note that share_denied is really only a legal error for OPEN.)
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
An OPEN from the same client/open stateowner requires a stateid update because
of the share/deny access update.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We're insisting that the lock sequence id field passed in the
open_to_lockowner struct always be zero. This is probably thanks to the
sentence in rfc3530: "The first request issued for any given lock_owner is
issued with a sequence number of zero."
But there doesn't seem to be any problem with allowing initial sequence
numbers other than zero. And currently this is causing lock reclaims from the
Linux client to fail.
In the spirit of "be liberal in what you accept, conservative in what you
send", we'll relax the check (and patch the Linux client as well).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add some comments on the use of so_seqid, in an attempt to avoid some of the
confusion outlined in the previous patch....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sequence number we store in the sequence id is the last one we received
from the client. So on the next operation we'll check that the client gives
us the next higher number.
We increment sequence id's at the last moment, in encode, so that we're sure
of knowing the right error return. (The decision to increment the sequence id
depends on the exact error returned.)
However on the *first* use of a sequence number, if we set the sequence number
to the one received from the client and then let the increment happen on
encode, we'll be left with a sequence number one to high.
For that reason, ENCODE_SEQID_OP_TAIL only increments the sequence id on
*confirmed* stateowners.
This creates a problem for open reclaims, which are confirmed on first use.
Therefore the open reclaim code, as a special exception, *decrements* the
sequence id, cancelling out the undesired increment on encode. But this
prevents the sequence id from ever being incremented in the case where
multiple reclaims are sent with the same openowner. Yuch!
We could add another exception to the open reclaim code, decrementing the
sequence id only if this is the first use of the open owner.
But it's simpler by far to modify the meaning of the op_seqid field: instead
of representing the previous value sent by the client, we take op_seqid, after
encoding, to represent the *next* sequence id that we expect from the client.
This eliminates the need for special-case handling of the first use of a
stateowner.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Yeah, it's trivial, but this drives me up the wall....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A misreading of the spec lead us to convert all errors on open and lock
reclaims to RECLAIM_BAD. This causes problems--for example, a reboot within
the grace period could lead to reclaims with stale stateid's, and we'd like to
return STALE errors in those cases.
What rfc3530 actually says about RECLAIM_BAD: "The reclaim provided by the
client does not match any of the server's state consistency checks and is
bad." I'm assuming that "state consistency checks" refers to checks for
consistency with the state recorded to stable storage, and that the error
should be reserved for that case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A GRACE or NOGRACE response to a lock request should also bump the sequence
id. So we delay the handling of grace period errors till after we've found
the relevant owner.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The GRACE and NOGRACE errors should bump the sequence id on open. So we delay
the handling of these errors until nfsd4_process_open2, at which point we've
set the open owner, so the encode routine will be able to bump the sequence
id.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We oops in list_for_each_entry(), because release_stateowner frees something
on the list we're traversing.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure we don't try to delete client recovery directories multiple times;
fixes some spurious error messages.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oops, this lookup_one_len needs the i_sem.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to fsync the recovery directory after writing to it, but we weren't
doing this correctly. (For example, we weren't taking the i_sem when calling
->fsync().)
Just reuse the existing nfsd fsync code instead.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to remove the recovery directory here too. (This chunk just got lost
somehow in the process of commuting the reboot recovery patches past the other
patches.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames _mntput() to something a little more descriptive:
mntput_no_expire().
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames vfsmount->mnt_fslink to something a little more
descriptive: vfsmount->mnt_expire.
Signed-off-by: Mike Waychison <michael.waychison@sun.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dcookies shouldn't play with the internals of dentry and vfsmnt
refcounting. It defeats grepping, and is prone to break if implementation
details change.
In addition the function doesn't even seem to be performance critical: it
calls kmem_cache_alloc().
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch sets ->mnt_namespace where it's actually added to the
namespace.
Previously mnt_namespace was set in do_kern_mount() even if the filesystem
was never added to any process's namespace (most kernel-internal
filesystems).
This discrepancy doesn't actually cause any problems, but it's cleaner if
mnt_namespace is NULL for these non exported filesystems.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clears mnt_namespace in an expired mount.
If mnt_namespace is not cleared, it's possible to attach a new mount to the
already detached mount, because check_mnt() can return true.
The effect is a resource leak, since the resulting tree will never be
freed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a bug noticed by Al Viro:
However, we still have a problem here - just what would
happen if vfsmount is detached while we were grabbing namespace
semaphore? Refcount alone is not useful here - we might be held by
whoever had detached the vfsmount. IOW, we should check that it's
still attached (i.e. that mnt->mnt_parent != mnt). If it's not -
just leave it alone, do mntput() and let whoever holds it deal with
the sucker. No need to put it back on lists.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits the mark_mounts_for_expiry() function. It's too complex and
too deeply nested, even without the bugfix in the following patch.
Otherwise code is completely the same.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch simplifies mark_mounts_for_expiry() by using detach_mnt() instead
of duplicating everything it does.
It should be an equivalent transformation except for righting the dput/mntput
order.
Al Viro said: "Looks sane".
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a race found by Ram in mark_mounts_for_expiry() in
fs/namespace.c.
The bug can only be triggered with simultaneous exiting of a process having
a private namespace, and expiry of a mount from within that namespace.
It's practically impossible to trigger, and I haven't even tried. But
still, a bug is a bug.
The race happens when put_namespace() is called by another task, while
mark_mounts_for_expiry() is between atomic_read() and get_namespace(). In
that case get_namespace() will be called on an already dead namespace with
unforeseeable results.
The solution was suggested by Al Viro, with his own words:
Instead of screwing with atomic_read() in there, why don't we
simply do the following:
a) atomic_dec_and_lock() in put_namespace()
b) __put_namespace() called without dropping lock
c) the first thing done by __put_namespace would be
struct vfsmount *root = namespace->root;
namespace->root = NULL;
spin_unlock(...);
....
umount_tree(root);
...
d) check in mark_... would be simply namespace && namespace->root.
And we are all set; no screwing around with atomic_read(), no magic
at all. Dying namespace gets NULL ->root.
All changes of ->root happen under spinlock.
If under a spinlock we see non-NULL ->mnt_namespace, it won't be
freed until we drop the lock (we will set ->mnt_namespace to NULL
under that lock before we get to freeing namespace).
If under a spinlock we see non-NULL ->mnt_namespace and
->mnt_namespace->root, we can grab a reference to namespace and be
sure that it won't go away.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clears mnt_namespace on unmount.
Not clearing mnt_namespace has two effects:
1) It is possible to attach a new mount to a detached mount,
because check_mnt() returns true.
This means, that when no other references to the detached mount
remain, it still can't be freed. This causes a resource leak,
and possibly un-removable modules.
2) If mnt_namespace is dereferenced (only in mark_mounts_for_expiry())
after the namspace has been freed, it can cause an Oops, memory
corruption, etc.
1) has been tested before and after the patch, 2) is only speculation.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We're dereferencing `flp' and then we're testing it for NULLness.
Either the compiler accidentally saved us or the existing null-pointer checdk
is redundant.
This defect was found automatically by Coverity Prevent, a static analysis tool.
Signed-off-by: Zaur Kambarov <zkambarov@coverity.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix debugging printk.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are not using the in-inode space for xattrs in reserved inodes because
mkfs.ext3 doesn't initialize it properly. For those inodes, we set
i_extra_isize to 0. Make sure that we also don't overwrite the
i_extra_isize field when writing out the inode in that case. This is for
cleanliness only, and doesn't fix an actual bug.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new section called ".data.read_mostly" for data items that are read
frequently and rarely written to like cpumaps etc.
If these maps are placed in the .data section then these frequenly read
items may end up in cachelines with data is is frequently updated. In that
case all processors in an SMP system must needlessly reload the cachelines
again and again containing elements of those frequently used variables.
The ability to share these cachelines will allow each cpu in an SMP system
to keep local copies of those shared cachelines thereby optimizing
performance.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Original patch from Matt Mackall <mpm@selenic.com>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use a bit spin lock in the first buffer of the page to synchronise asynch
IO buffer completions, instead of the global page_uptodate_lock, which is
showing some scalabilty problems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some time ago a trivial patch broke HPPFS (one var became a pointer, not
all uses were updated). It wasn't fixed at that time because not very
used, now it's been requested so I've fixed this, and it has been tested
positively (at least partially).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Make ioprio syscalls return long, like set/getpriority syscalls.
- Move function prototypes into syscalls.h so we can pick them up in the
32/64bit compat code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OCFS2 wants to mark an inode which has been orphaned by another node so
that during final iput it takes the correct path through the VFS and can
pass through the OCFS2 delete_inode callback. Since i_nlink can get out of
date with other nodes, the best way I see to accomplish this is by clearing
i_nlink on those inodes at drop_inode time. Other than this small amount
of work, nothing different needs to happen, so I think it would be cleanest
to be able to just call generic_drop_inode at the end of the OCFS2
drop_inode callback.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It isn't _normal_ that we allow key collision in rbtrees,
but it does not matter as long as the two nodes with the same
version are together.
Signed-off-by: Artem B. Bityuckiy <dedekind@infradead.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use an rbtree instead of a simple linked list. We were wasting
an amazing amount of time in jffs2_add_tn_to_list().
Thanks to Artem Bityuckiy and Jarkko Jlavinen for noticing.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch addresses the following minor issues:
- Typo in printk
- Redundant casts
- Use C99 struct initializers instead of memset
- Parenthesis around return value
- Use inline instead of __inline__
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes 2.4 compatability header from freevxfs.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix a buffer_head leak in vxfs_getfsh()
- s/SLAB_KERNEL/GFP_KERNEL/
- check sb_bread() return value
- drop pointless buffer-mapped() test.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
udf_find_entry can never be called with a NULL argument, so we shouldn't
check for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch enables attrs by default if the reiserfs_attrs_cleared
bit is set in the superblock. This allows chattr-type attrs to be used
without any further action by the user.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ReiserFS currently will allow the user to set/get attrs for files
regardless if they are enabled. The patch checks to see if they are
enabled, and returns -NOTTY if they are not.
ext[23] doesn't need this check because attrs are always enabled.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in ext3 block reservation code, the global filesystem reservation
tree lock (rsv_block) is hold during the process of searching for a space
to make a new reservation window, including while scaning the block bitmap
to verify if the avalible window has a free block. Holding the lock during
bitmap scan is unnecessary and could possibly cause scalability issue and
latency issues.
This patch tries to address this by dropping the lock before scan the
bitmap. Before that we need to reserve the open window in case someone
else is targetting at the same window. Question was should we reserve the
whole free reservable space or just the window size we need. Reserve the
whole free reservable space will possibly force other threads which
intended to do block allocation nearby move to another block group(cause
bad layout). In this patch, we just reserve the desired size before drop
the lock and scan the block bitmap. This patch fixed a ext3 reservation
latency issue seen on a cvs check out test. Patch is tested with many fsx,
tiobench, dbench and untar a kernel test.
Signed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The return value of "match_int" is checked 27 out of 28 times
In lib/parser.c
142 /**
143 * match_int: - scan a decimal representation of an integer from a substring_t
144 * @s: substring_t to be scanned
145 * @result: resulting integer on success
146 *
147 * Description: Attempts to parse the &substring_t @s as a decimal integer. On
148 * success, sets @result to the integer represented by the string and returns 0.
149 * Returns either -ENOMEM or -EINVAL on failure.
150 */
151 int match_int(substring_t *s, int *result)
152 {
153 return match_number(s, result, 0);
154 }
Signed-off-by: Zaur Kambarov <zkambarov@coverity.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the case of buffered AIO, in the aio retry path (aio_run_iocb), when the
retry method returns EIOCBRETRY the kicked iocb is added to the context run
list but is never queued onto the work queue. The request therefore is
never completed.
This patch fixes that by adding the appropriate call to aio_queue_work in
aio_run_aiocb so that subsequent retries will be handled by the aio worker
thread.
Signed-off-by: Sébastien Dugué <sebastien.dugue@bull.net>
Acked-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Looks like it sneaked back with the NFS ACL merge..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This up() should be down() instead.
Signed-off-by: Wen-chien Jesse Sung <jesse@cola.voip.idv.tw>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I'm finally getting around to cleaning out debug code that I've never used.
There has always been code ifdef'ed out by _JFS_DEBUG_DMAP, _JFS_DEBUG_IMAP,
_JFS_DEBUG_DTREE, and _JFS_DEBUG_XTREE, which I have personally never used,
and I doubt that anyone has since the design stage back in OS/2. There is
also a function, xtGather, that has never been used, and I don't know why it
was ever there.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
The situation: VFS inode X on a mounted ntfs volume is dirty. For
same inode X, the ntfs_inode is dirty and thus corresponding on-disk
inode, i.e. mft record, which is in a dirty PAGE_CACHE_PAGE belonging
to the table of inodes, i.e. $MFT, inode 0.
What happens:
Process 1: sys_sync()/umount()/whatever... calls
__sync_single_inode() for $MFT -> do_writepages() -> write_page for
the dirty page containing the on-disk inode X, the page is now locked
-> ntfs_write_mst_block() which clears PageUptodate() on the page to
prevent anyone else getting hold of it whilst it does the write out.
This is necessary as the on-disk inode needs "fixups" applied before
the write to disk which are removed again after the write and
PageUptodate is then set again. It then analyses the page looking
for dirty on-disk inodes and when it finds one it calls
ntfs_may_write_mft_record() to see if it is safe to write this
on-disk inode. This then calls ilookup5() to check if the
corresponding VFS inode is in icache(). This in turn calls ifind()
which waits on the inode lock via wait_on_inode whilst holding the
global inode_lock.
Process 2: pdflush results in a call to __sync_single_inode for the
same VFS inode X on the ntfs volume. This locks the inode (I_LOCK)
then calls write-inode -> ntfs_write_inode -> map_mft_record() ->
read_cache_page() for the page (in page cache of table of inodes
$MFT, inode 0) containing the on-disk inode. This page has
PageUptodate() clear because of Process 1 (see above) so
read_cache_page() blocks when it tries to take the page lock for the
page so it can call ntfs_read_page().
Thus Process 1 is holding the page lock on the page containing the
on-disk inode X and it is waiting on the inode X to be unlocked in
ifind() so it can write the page out and then unlock the page.
And Process 2 is holding the inode lock on inode X and is waiting for
the page to be unlocked so it can call ntfs_readpage() or discover
that Process 1 set PageUptodate() again and use the page.
Thus we have a deadlock due to ifind() waiting on the inode lock.
The solution: The fix is to use the newly introduced
ilookup5_nowait() which does not wait on the inode's lock and hence
avoids the deadlock. This is safe as we do not care about the VFS
inode and only use the fact that it is in the VFS inode cache and the
fact that the vfs and ntfs inodes are one struct in memory to find
the ntfs inode in memory if present. Also, the ntfs inode has its
own locking so it does not matter if the vfs inode is locked.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Missed conversion in the swsusp cleanup.
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>