Commit Graph

16615 Commits

Author SHA1 Message Date
Yan, Zheng
7a7965f83e Btrfs: Fix oopsen when dropping empty tree.
When dropping a empty tree, walk_down_tree() skips checking
extent information for the tree root. This will triggers a
BUG_ON in walk_up_proc().

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:45 -05:00
Miao Xie
d7ce5843bb Btrfs: remove BUG_ON() due to mounting bad filesystem
Mounting a bad filesystem caused a BUG_ON(). The following is steps to
reproduce it.
 # mkfs.btrfs /dev/sda2
 # mount /dev/sda2 /mnt
 # mkfs.btrfs /dev/sda1 /dev/sda2
 (the program says that /dev/sda2 was mounted, and then exits. )
 # umount /mnt
 # mount /dev/sda1 /mnt

At the third step, mkfs.btrfs exited in the way of make filesystem. So the
initialization of the filesystem didn't finish. So the filesystem was bad, and
it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
code, not user's operation, so I think it is a bug of btrfs.

This patch fixes it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Roel Kluin
014e4ac4f7 Btrfs: make error return negative in btrfs_sync_file()
It appears the error return should be negative

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Yan, Zheng
f044ba7835 Btrfs: fix race between allocate and release extent buffer.
Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Sunil Mushran
cda70ba8c0 ocfs2/dlm: Remove BUG_ON in dlm recovery when freeing locks of a dead node
During recovery, the dlm frees the locks for the dead node. If it finds a
lock in a resource for the dead node, it expects that node to also have a
ref in that lock resource. If not, it BUGs.

ossbz#1175 was filed with the above BUG. Now, while it is correct that we
should be expecting the ref, I see no reason why we have to BUG. After all,
we are freeing up the lock and clearing the ref.

This patch replaces the BUG_ON with a printk(). Hopefully, that will give
us more clues next time this happens.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1175

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-03 17:51:41 -08:00
Sunil Mushran
079b805782 ocfs2: Plugs race between the dc thread and an unlock ast message
This patch plugs a race between the downconvert thread and an unlock ast message.
Specifically, after the downconvert worker has done its task, the dc thread needs
to check whether an unlock ast made the downconvert moot.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@sus.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-03 17:26:03 -08:00
Linus Torvalds
c1c0cbb878 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix potential leak of dirty data on umount
2010-02-03 08:47:15 -08:00
Trond Myklebust
9b4b351346 NFS: Don't clobber the attribute type in nfs_update_inode()
If the NFS_ATTR_FATTR_TYPE field isn't set in fattr->valid, then we should
not set the S_IFMT part of inode->i_mode.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-03 08:27:35 -05:00
Trond Myklebust
387c149b54 NFS: Fix a umount race
Ensure that we unregister the bdi before kill_anon_super() calls
ida_remove() on our device name.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-02-03 08:27:35 -05:00
Trond Myklebust
9f557cd807 NFS: Fix an Oops when truncating a file
The VM/VFS does not allow mapping->a_ops->invalidatepage() to fail.
Unfortunately, nfs_wb_page_cancel() may fail if a fatal signal occurs.
Since the NFS code assumes that the page stays mapped for as long as the
writeback is active, we can end up Oopsing (among other things).

The only safe fix here is to convert nfs_wait_on_request(), so as to make
it uninterruptible (as is already the case with wait_on_page_writeback()).


Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-02-03 08:27:22 -05:00
Steven Whitehouse
8f05228ee7 GFS2: Extend umount wait coverage to full glock lifetime
Although all glocks are, by the time of the umount glock wait,
scheduled for demotion, some of them haven't made it far
enough through the process for the original set of waiting
code to wait for them.

This extends the ref count to the whole glock lifetime in order
to ensure that the waiting does catch all glocks. It does make
it a bit more invasive, but it seems the only sensible solution
at the moment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-03 09:56:21 +00:00
Steven Whitehouse
e402746a94 GFS2: Wait for unlock completion on umount
This patch adds a wait on umount between the point at which we
dispose of all glocks and the point at which we unmount the
lock protocol. This ensures that we've received all the replies
to our unlock requests before we stop the locking.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2010-02-03 09:47:04 +00:00
Sunil Mushran
db0f6ce697 ocfs2: Remove overzealous BUG_ON during blocked lock processing
During blocked lock processing, we should consider the possibility that the
lock is no longer blocking.

Joel Becker <joel.becker@oracle.com> assisted in fixing this issue.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:16 -08:00
Sunil Mushran
0d74125a6a ocfs2: Do not downconvert if the lock level is already compatible
During upconvert, if the master were to send a BAST, dlmglue will detect the
upconversion in process and send a cancel convert to the master. Upon receiving
the AST for the cancel convert, it will re-process the lock resource to determine
whether it needs downconverting. Say, the up was from PR to EX and the BAST was
for EX. After the cancel convert, it will need to downconvert to NL.

However, if the node was originally upconverting from NL to EX, then there would
be no reason to downconvert (assuming the same message sequence).

This patch makes dlmglue consider the possibility that the current lock level
is already compatible and that downconverting is not required.

Joel Becker <joel.becker@oracle.com> assisted in fixing this issue.

Fixes ossbz#1178
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178

Reported-by: Coly Li <coly.li@suse.de>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:14 -08:00
Sunil Mushran
a191282601 ocfs2: Prevent a livelock in dlmglue
There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
to get an ast for an upconvert request, followed immediately by a bast,
there is a small window where the fs may downconvert the lock before the
process requesting the upconvert is able to take the lock.

This patch adds a new flag to indicate that the upconvert is still in
progress and that the dc thread should not downconvert it right now.

Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker
<joel.becker@oracle.com> contributed heavily to this patch.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:13 -08:00
Wengang Wang
0b94a909eb ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast
During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to
downconverted.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:50:55 -08:00
Tao Ma
34e6c59af0 ocfs2: Use compat_ptr in reflink_arguments.
Although we use u64 to pass userspace pointers to the kernel
to avoid compat_ioctl, it doesn't work in some ppc platform.
So wrap them with compat_ptr and add compat_ioctl.

The detailed discussion about compat_ptr can be found in thread
http://lkml.org/lkml/2009/10/27/423.

We indeed met with a bug when testing on ppc(-EFAULT is returned
when using old_path). This patch try to fix this.
I have tested in ppc64(with 32 bit reflink) and x86_64(with i686
reflink), both works.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:56:37 -08:00
Sunil Mushran
cd34edd8cf ocfs2/dlm: Handle EAGAIN for compatibility - v2
Mainline commit aad1b15310 made the
dlm_begin_reco_handler() return -EAGAIN instead of EAGAIN.

As this error is transmitted over the wire, we want the receiver,
dlm_send_begin_reco_message(), to understand both the older EAGAIN and
the newer -EAGAIN, to allow rolling upgrade of the cluster nodes.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:56:34 -08:00
Tao Ma
60c486744c ocfs2: Add parenthesis to wrap the check for O_DIRECT.
Add parenthesis to wrap the check for O_DIRECT.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:15:37 -08:00
Tao Ma
0a1ea437d8 ocfs2: Only bug out when page size is larger than cluster size.
In CoW, we have to make sure that the page is already written
out to the disk. So we have a BUG_ON(PageDirty(page)).

In ppc platform we have pagesize=64K, so if the cs=4K, if the
file have fragmented clusters, we will map the page many times.
See this file as an example.
Tree Depth: 0   Count: 19   Next Free Rec: 14
	## Offset        Clusters       Block#          Flags
	0  0             4              2164864         0x2 Refcounted
	1  4             2              9302792         0x2 Refcounted
...

We have to replace the extent recs one by one, so the page with index 0
will be mapped and dirtied twice.

I'd like to leave the BUG_ON there while adding a check so that in
case we meet with an error in other platforms, we can find it easily.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:15:35 -08:00
Tao Ma
d622b89a2f ocfs2: Fix memory overflow in cow_by_page.
In ocfs2_duplicate_clusters_by_page, we calculate map_end
by shifting page_index. But actually in case we meet with
a large offset(say in a i686 box, poff_t is only 32 bits
and page_index=2056240), we will overflow. So change the
type of page_index to loff_t.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:14:20 -08:00
anfei zhou
931e80e4b3 mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic.  So the right steps should be:

	flush_dcache_page(page);
	kmap_atomic(page);
	write to page;
	kunmap_atomic(page);
	flush_dcache_page(page);

More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:

	int val = 0x11111111;
	fd = open("abc", O_RDWR);
	addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	*(addr+0) = 0x44444444;
	tmp = *(addr+0);
	*(addr+1) = 0x77777777;
	write(fd, &val, sizeof(int));
	close(fd);

The results are not always 0x11111111 0x77777777 at the beginning as expected.  Sometimes we see 0x44444444 0x77777777.

Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 18:11:21 -08:00
Linus Torvalds
1a45dcfe25 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: Do not idle on async queues
  blk-cgroup: Fix potential deadlock in blk-cgroup
  block: fix bugs in bio-integrity mempool usage
  block: fix bio_add_page for non trivial merge_bvec_fn case
  drbd: null dereference bug
  drbd: fix max_segment_size initialization
2010-02-02 12:54:37 -08:00
Linus Torvalds
4dab75ec3e Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use GFP_NOFS for alloc structure
  GFS2: Fix previous patch
  GFS2: Don't withdraw on partial rindex entries
  GFS2: Fix refcnt leak on gfs2_follow_link() error path
2010-02-02 12:48:26 -08:00
Linus Torvalds
7ab02af428 Fix 'flush_old_exec()/setup_new_exec()' split
Commit 221af7f87b ("Split 'flush_old_exec' into two functions") split
the function at the point of no return - ie right where there were no
more error cases to check.  That made sense from a technical standpoint,
but when we then also combined it with the actual personality setting
going in between flush_old_exec() and setup_new_exec(), it needs to be a
bit more careful.

In particular, we need to make sure that we really flush the old
personality bits in the 'flush' stage, rather than later in the 'setup'
stage, since otherwise we might be flushing the _new_ personality state
that we're just setting up.

So this moves the flags and personality flushing (and 'flush_thread()',
which is the arch-specific function that generally resets lazy FP state
etc) of the old process into flush_old_exec(), so that it doesn't affect
any state that execve() is setting up for the new process environment.

This was reported by Michal Simek as breaking his Microblaze qemu
environment.

Reported-and-tested-by: Michal Simek <michal.simek@petalogix.com>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 12:37:44 -08:00
Linus Torvalds
13af75740f Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Fix vmalloc call under reiserfs lock
2010-02-01 10:46:18 -08:00
Steven Whitehouse
ea8d62dadd GFS2: Use GFP_NOFS for alloc structure
This is called under a glock, so its a good plan to use GFP_NOFS

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 10:01:34 +00:00
Steven Whitehouse
7fe3ec6fe5 GFS2: Fix previous patch
The do_div() call needs to remain.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 10:00:23 +00:00
Benjamin Marzinski
55f0b4c546 GFS2: Don't withdraw on partial rindex entries
ince gfs2 writes the rindex file a block at a time, and releases the
exclusive lock after each block, it is possible that another process
will grab the lock in the middle of the write.  Since rindex entries are
not an even divisor of blocks, that other process may see partial
entries.  On grows, this is fine.  The process can simply ignore the the
partial entires. Previously, the code withdrew when it saw partial
entries. Now it simply ignores them.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 09:59:54 +00:00
Ryusuke Konishi
3256a05531 nilfs2: fix potential leak of dirty data on umount
This fixes incorrect usage of nilfs_segctor_confirm() test function in
nilfs_segctor_destroy(); nilfs_segctor_confirm() returns zero if the
filesystem is not clean, so its use in nilfs_segctor_destroy() needs
inversion.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-01-31 14:57:31 +09:00
Chuck Ebbert
9e9432c267 block: fix bugs in bio-integrity mempool usage
Fix two bugs in the bio integrity code:

 use_bip_pool() always returns 0 because it checks against the wrong limit,
 causing the mempool to be used only when regular allocation fails.

 When the mempool is used as a fallback we don't free the data properly.

Signed-Off-By: Chuck Ebbert <cebbert@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-30 20:28:19 +01:00
Linus Torvalds
67f15b06c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check total number of devices when removing missing
  Btrfs: check return value of open_bdev_exclusive properly
  Btrfs: do not mark the chunk as readonly if in degraded mode
  Btrfs: run orphan cleanup on default fs root
  Btrfs: fix a memory leak in btrfs_init_acl
  Btrfs: Use correct values when updating inode i_size on fallocate
  Btrfs: remove tree_search() in extent_map.c
  Btrfs: Add mount -o compress-force
2010-01-29 10:27:37 -08:00
Linus Torvalds
221af7f87b Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed.  It doesn't just flush the old executable
environment, it also starts up the new one.

Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.

As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.

This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()).  All callers are changed
to trivially comply with the new world order.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 08:22:01 -08:00
Josef Bacik
035fe03a7a Btrfs: check total number of devices when removing missing
If you have a disk failure in RAID1 and then add a new disk to the
array, and then try to remove the missing volume, it will fail.  The
reason is the sanity check only looks at the total number of rw devices,
which is just 2 because we have 2 good disks and 1 bad one.  Instead
check the total number of devices in the array to make sure we can
actually remove the device.  Tested this with a failed disk setup and
with this test we can now run

btrfs-vol -r missing /mount/point

and it works fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
7f59203abe Btrfs: check return value of open_bdev_exclusive properly
Hit this problem while testing RAID1 failure stuff.  open_bdev_exclusive
returns ERR_PTR(), not NULL.  So change the return value properly.  This
is important if you accidently specify a device that doesn't exist when
trying to add a new device to an array, you will panic the box
dereferencing bdev.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
f48b90756b Btrfs: do not mark the chunk as readonly if in degraded mode
If a RAID setup has chunks that span multiple disks, and one of those
disks has failed, btrfs_chunk_readonly will return 1 since one of the
disks in that chunk's stripes is dead and therefore not writeable.  So
instead if we are in degraded mode, return 0 so we can go ahead and
allocate stuff.  Without this patch all of the block groups in a RAID1
setup will end up read-only, which will mean we can't add new disks to
the array since we won't be able to make allocations.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
e3acc2a685 Btrfs: run orphan cleanup on default fs root
This patch revert's commit

6c090a11e1

Since it introduces this problem where we can run orphan cleanup on a
volume that can have orphan entries re-added.  Instead of my original
fix, Yan Zheng pointed out that we can just revert my original fix and
then run the orphan cleanup in open_ctree after we look up the fs_root.
I have tested this with all the tests that gave me problems and this
patch fixes both problems.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Yang Hongyang
f858153c36 Btrfs: fix a memory leak in btrfs_init_acl
In btrfs_init_acl() cloned acl is not released

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Aneesh Kumar K.V
d1ea6a6145 Btrfs: Use correct values when updating inode i_size on fallocate
commit f2bc9dd07e3424c4ec5f3949961fe053d47bc825
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Jan 20 12:57:53 2010 +0530

    Btrfs: Use correct values when updating inode i_size on fallocate

    Even though we allocate more, we should be updating inode i_size
    as per the arguments passed

    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:38 -05:00
Miao Xie
b8d9bfeb18 Btrfs: remove tree_search() in extent_map.c
This patch removes tree_search() in extent_map.c because it is not called by
anything.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:38 -05:00
Chris Mason
a555f810af Btrfs: Add mount -o compress-force
The default btrfs mount -o compress mode will quickly back off
compressing a file if it notices that compression does not reduce the
size of the data being written.  This can save considerable CPU because
all future writes to the file go through uncompressed.

But some files are both very large and have mixed data stored in
them.  In that case, we want to add the ability to always try
compressing data before writing it.

This commit adds mount -o compress-force.  A later commit will add
a new inode flag that does the same thing.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:18:15 -05:00
Dmitry Monakhov
1d6165851c block: fix bio_add_page for non trivial merge_bvec_fn case
We have to properly decrease bi_size in order to merge_bvec_fn return
right result.  Otherwise this result in false merge rejects for two
absolutely valid bio_vecs.  This may cause significant performance
penalty for example fs_block_size == 1k and block device is raid0 with
small chunk_size = 8k. Then it is impossible to merge 7-th fs-block in
to bio which already has 6 fs-blocks.

Cc: <stable@kernel.org>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-28 15:08:29 +01:00
Frederic Weisbecker
bbec919150 reiserfs: Fix vmalloc call under reiserfs lock
Vmalloc is called to allocate journal->j_cnode_free_list but
we hold the reiserfs lock at this time, which raises a
{RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} lock inversion.

Just drop the reiserfs lock at this time, as it's not even
needed but kept for paranoid reasons.

This fixes:

[ INFO: inconsistent lock state ]
2.6.33-rc5 #1
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/313 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&REISERFS_SB(s)->lock){+.+.?.}, at: [<c11118c8>]
reiserfs_write_lock_once+0x28/0x50
{RECLAIM_FS-ON-W} state was registered at:
  [<c104ee32>] mark_held_locks+0x62/0x90
  [<c104eefa>] lockdep_trace_alloc+0x9a/0xc0
  [<c108f7b6>] kmem_cache_alloc+0x26/0xf0
  [<c108621c>] __get_vm_area_node+0x6c/0xf0
  [<c108690e>] __vmalloc_node+0x7e/0xa0
  [<c1086aab>] vmalloc+0x2b/0x30
  [<c110e1fb>] journal_init+0x6cb/0xa10
  [<c10f90a2>] reiserfs_fill_super+0x342/0xb80
  [<c1095665>] get_sb_bdev+0x145/0x180
  [<c10f68e1>] get_super_block+0x21/0x30
  [<c1094520>] vfs_kern_mount+0x40/0xd0
  [<c1094609>] do_kern_mount+0x39/0xd0
  [<c10aaa97>] do_mount+0x2c7/0x6d0
  [<c10aaf06>] sys_mount+0x66/0xa0
  [<c16198a7>] mount_block_root+0xc4/0x245
  [<c1619a81>] mount_root+0x59/0x5f
  [<c1619b98>] prepare_namespace+0x111/0x14b
  [<c1619269>] kernel_init+0xcf/0xdb
  [<c100303a>] kernel_thread_helper+0x6/0x1c
irq event stamp: 63236801
hardirqs last  enabled at (63236801): [<c134e7fa>]
__mutex_unlock_slowpath+0x9a/0x120
hardirqs last disabled at (63236800): [<c134e799>]
__mutex_unlock_slowpath+0x39/0x120
softirqs last  enabled at (63218800): [<c102f451>] __do_softirq+0xc1/0x110
softirqs last disabled at (63218789): [<c102f4ed>] do_softirq+0x4d/0x60

other info that might help us debug this:
2 locks held by kswapd0/313:
 #0:  (shrinker_rwsem){++++..}, at: [<c1074bb4>] shrink_slab+0x24/0x170
 #1:  (&type->s_umount_key#19){++++..}, at: [<c10a2edd>]
shrink_dcache_memory+0xfd/0x1a0

stack backtrace:
Pid: 313, comm: kswapd0 Not tainted 2.6.33-rc5 #1
Call Trace:
 [<c134db2c>] ? printk+0x18/0x1c
 [<c104e7ef>] print_usage_bug+0x15f/0x1a0
 [<c104ebcf>] mark_lock+0x39f/0x5a0
 [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
 [<c1052c50>] ? check_usage_forwards+0x0/0xf0
 [<c1050c24>] __lock_acquire+0x214/0xa70
 [<c10438c5>] ? sched_clock_cpu+0x95/0x110
 [<c10514fa>] lock_acquire+0x7a/0xa0
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c134f03f>] mutex_lock_nested+0x5f/0x2b0
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c11118c8>] reiserfs_write_lock_once+0x28/0x50
 [<c10f05b0>] reiserfs_delete_inode+0x50/0x140
 [<c10a653f>] ? generic_delete_inode+0x5f/0x150
 [<c10f0560>] ? reiserfs_delete_inode+0x0/0x140
 [<c10a657c>] generic_delete_inode+0x9c/0x150
 [<c10a666d>] generic_drop_inode+0x3d/0x60
 [<c10a5597>] iput+0x47/0x50
 [<c10a2a4f>] dentry_iput+0x6f/0xf0
 [<c10a2af4>] d_kill+0x24/0x50
 [<c10a2d3d>] __shrink_dcache_sb+0x21d/0x2b0
 [<c10a2f0f>] shrink_dcache_memory+0x12f/0x1a0
 [<c1074c9e>] shrink_slab+0x10e/0x170
 [<c1075177>] kswapd+0x477/0x6a0
 [<c1072d10>] ? isolate_pages_global+0x0/0x1b0
 [<c103e160>] ? autoremove_wake_function+0x0/0x40
 [<c1074d00>] ? kswapd+0x0/0x6a0
 [<c103de6c>] kthread+0x6c/0x80
 [<c103de00>] ? kthread+0x0/0x80
 [<c100303a>] kernel_thread_helper+0x6/0x1c

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Chris Mason <chris.mason@oracle.com>
2010-01-28 13:43:50 +01:00
Al Viro
083c73c253 fix oops in fs/9p late mount failure
if 9P ->get_sb() fails late (at root inode or root dentry
allocation), we'll hit its ->kill_sb() with NULL ->s_root

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:27 -05:00
Al Viro
7e32b7bb73 fix leak in romfs_fill_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:26 -05:00
Al Viro
ef52c75e4b get rid of pointless checks after simple_pin_fs()
if we'd just got success from it, vfsmount won't be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:26 -05:00
Al Viro
5998649f77 Fix failure exits in bfs_fill_super()
double iput(), leaks...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:25 -05:00
Al Viro
217686e983 fix affs parse_options()
Error handling in that sucker got broken back in 2003.  If function
returns 0 on failure, it's not nice to add return -EINVAL into it.
Adding return 1 on other failure exits is also not a good thing (and
yes, original success exits with 1 and some of failure exits with 0
are still there; so's the original logics in callers).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:25 -05:00
Al Viro
29333920a5 Fix remount races with symlink handling in affs
A couple of fields in affs_sb_info is used in follow_link() and
symlink() for handling AFFS "absolute" symlinks.  Need locking
against affs_remount() updates.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:24 -05:00
Al Viro
afc70ed05a Fix a leak in affs_fill_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:24 -05:00
Greg Kroah-Hartman
b04da8bfdf fnctl: f_modown should call write_lock_irqsave/restore
Commit 7036251180 exposed that f_modown()
should call write_lock_irqsave instead of just write_lock_irq so that
because a caller could have a spinlock held and it would not be good to
renable interrupts.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tavis Ormandy <taviso@google.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-26 17:25:38 -08:00
Trond Myklebust
a2c0b9e291 NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly
Even if the server is crazy, we should be able to mark the stateid as being
bad, to ensure it gets recovered.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:47 -05:00
Trond Myklebust
03391693a9 NFSv4.1: Don't call nfs4_schedule_state_recovery() unnecessarily
Currently, nfs4_handle_exception() will call it twice if called with an
error of -NFS4ERR_STALE_CLIENTID, -NFS4ERR_STALE_STATEID or
-NFS4ERR_EXPIRED.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:38 -05:00
Trond Myklebust
8e469ebd6d NFSv4: Don't allow posix locking against servers that don't support it
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:30 -05:00
Trond Myklebust
2bee72a6aa NFSv4: Ensure that the NFSv4 locking can recover from stateid errors
In most cases, we just want to mark the lock_stateid sequence id as being
uninitialised.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:21 -05:00
David Howells
b0706ca415 NFS: Avoid warnings when CONFIG_NFS_V4=n
Avoid the following warnings when CONFIG_NFS_V4=n:

	fs/nfs/sysctl.c:19: warning: unused variable `nfs_set_port_max'
	fs/nfs/sysctl.c:18: warning: unused variable `nfs_set_port_min'

by making those variables contingent on NFSv4 being configured.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:11 -05:00
H Hartley Sweeten
0aa05887af NFS: Make nfs_commitdata_release static
The symbol nfs_commitdata_release is only used locally
in this file. Make it static to prevent the following sparse warning:

warning: symbol 'nfs_commitdata_release' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:03 -05:00
Trond Myklebust
82be934a59 NFS: Try to commit unstable writes in nfs_release_page()
If someone calls nfs_release_page(), we presumably already know that the
page is clean, however it may be holding an unstable write.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:41:53 -05:00
Trond Myklebust
c9edda7140 NFS: Fix a reference leak in nfs_wb_cancel_page()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:41:34 -05:00
Sunil Mushran
26636bf6b2 ocfs2/dlm: Print more messages during lock migration
When a lock resource is migrated, the dlm compares the migrated
locks with that that was already existing on the new node. If the
comparison fails, it BUGs. This patch prints more messages when the
comparison fails inorder to help with the root cause analyis.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1206
This does not fix bz1206. However, if we run into it again, we will
have more information to chew on.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:21:09 -08:00
Sunil Mushran
71656fa6ec ocfs2/dlm: Ignore LVBs of locks in the Blocked list
During lock resource migration, o2dlm fills the packet with a LVB from the
first valid lock. For sanity, it ensures that the other valid locks have the
same LVB. If not, it BUGs.

The valid locks are ones that have granted EX or PR lock levels and are either
on the Granted or Converting lists. Locks in the Blocked list cannot have a
valid LVB.

This patch ensures that we skip the locks in the Blocked list.

Fixes oss bugzilla#1202
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1202

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:57 -08:00
Sunil Mushran
2bd632165c ocfs2/trivial: Remove trailing whitespaces
Patch removes trailing whitespaces.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:51 -08:00
Wengang Wang
e5f2cb2b1a ocfs2: fix a misleading variable name
a local variable "dlm_version" is used as a fs locking version.
rename it fs_version.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:48 -08:00
Tao Ma
1097df3ffe ocfs2: Sync max_inline_data_with_xattr from tools.
In ocfs2-tools, we have added ocfs2_max_inline_data_with_xattr,
so add it in the kernel's ocfs2_fs.h.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:45 -08:00
Linus Torvalds
9a3cbe3265 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Drop EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE flag
  ext4: Fix quota accounting error with fallocate
  ext4: Handle -EDQUOT error on write
2010-01-25 19:05:06 -08:00
Davide Libenzi
cb289d6244 eventfd - allow atomic read and waitqueue remove
KVM needs a wait to atomically remove themselves from the eventfd ->poll()
wait queue head, in order to handle correctly their IRQfd deassign
operation.

This patch introduces such API, plus a way to read an eventfd from its
context.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-01-25 12:26:38 -02:00
Linus Torvalds
bdeef61cd0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
  tty: fix race in tty_fasync
  serial: serial_cs: oxsemi quirk breaks resume
  serial: imx: bit &/| confusion
  serial: Fix crash if the minimum rate of the device is > 9600 baud
  serial-core: resume serial hardware with no_console_suspend
  serial: 8250_pnp: use wildcard for serial Wacom tablets
  nozomi: quick fix for the close/close bug
  compat_ioctl: Supress "unknown cmd" message on serial /dev/console
2010-01-21 07:37:20 -08:00
Linus Torvalds
456eac9478 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  fs/bio.c: fix shadows sparse warning
  drbd: The kernel code is now equivalent to out of tree release 8.3.7
  drbd: Allow online resizing of DRBD devices while peer not reachable (needs to be explicitly forced)
  drbd: Don't go into StandAlone mode when authentification failes because of network error
  drivers/block/drbd/drbd_receiver.c: correct NULL test
  cfq-iosched: Respect ioprio_class when preempting
  genhd: overlapping variable definition
  block: removed unused as_io_context
  DM: Fix device mapper topology stacking
  block: bdev_stack_limits wrapper
  block: Fix discard alignment calculation and printing
  block: Correct handling of bottom device misaligment
  drbd: check on CONFIG_LBDAF, not LBD
  drivers/block/drbd: Correct NULL test
  drbd: Silenced an assert that could triggered after changing write ordering method
  drbd: Kconfig fix
  drbd: Fix for a race between IO and a detach operation [Bugz 262]
  drbd: Use drbd_crypto_is_hash() instead of an open coded check
2010-01-21 07:32:11 -08:00
Linus Torvalds
15e551e52b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  ecryptfs: use after free
  ecryptfs: Eliminate useless code
  ecryptfs: fix interpose/interpolate typos in comments
  ecryptfs: pass matching flags to interpose as defined and used there
  ecryptfs: remove unnecessary d_drop calls in ecryptfs_link
  ecryptfs: don't ignore return value from lock_rename
  ecryptfs: initialize private persistent file before dereferencing pointer
  eCryptfs: Remove mmap from directory operations
  eCryptfs: Add getattr function
  eCryptfs: Use notify_change for truncating lower inodes
2010-01-21 07:28:54 -08:00
Linus Torvalds
30a0f5e1fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix possible panic on unmount
  Btrfs: deal with NULL acl sent to btrfs_set_acl
  Btrfs: fix regression in orphan cleanup
  Btrfs: Fix race in btrfs_mark_extent_written
  Btrfs, fix memory leaks in error paths
  Btrfs: align offsets for btrfs_ordered_update_i_size
  btrfs: fix missing last-entry in readdir(3)
2010-01-21 07:28:05 -08:00
Atsushi Nemoto
3f00171125 compat_ioctl: Supress "unknown cmd" message on serial /dev/console
After the commit fb07a5f8 ("compat_ioctl: remove all VT ioctl
handling"), I got this error message on 64-bit mips kernel with 32-bit
busybox userland:

ioctl32(init:1): Unknown cmd fd(0) cmd(00005600){t:'V';sz:0} arg(7fd76480) on /dev/console

The cmd 5600 is VT_OPENQRY.  The busybox's init issues this ioctl to
know vt-console or serial-console.  If the console was serial console,
VT ioctls are not handled by the serial driver.

And by quick search, I found some programs using VT_GETMODE to check
vt-console is available or not.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-20 15:03:26 -08:00
Dan Carpenter
ece550f51b ecryptfs: use after free
The "full_alg_name" variable is used on a couple error paths, so we
shouldn't free it until the end.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:06 -06:00
Julia Lawall
4aa25bcb7d ecryptfs: Eliminate useless code
The variable lower_dentry is initialized twice to the same (side effect-free)
expression.  Drop one initialization.

A simplified version of the semantic match that finds this problem is:
(http://coccinelle.lip6.fr/)

// <smpl>
@forall@
idexpression *x;
identifier f!=ERR_PTR;
@@

x = f(...)
... when != x
(
x = f(...,<+...x...+>,...)
|
* x = f(...)
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:05 -06:00
Erez Zadok
fe0fc013cd ecryptfs: fix interpose/interpolate typos in comments
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:03 -06:00
Erez Zadok
3469b57329 ecryptfs: pass matching flags to interpose as defined and used there
ecryptfs_interpose checks if one of the flags passed is
ECRYPTFS_INTERPOSE_FLAG_D_ADD, defined as 0x00000001 in ecryptfs_kernel.h.
But the only user of ecryptfs_interpose to pass a non-zero flag to it, has
hard-coded the value as "1". This could spell trouble if any of these values
changes in the future.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:02 -06:00
Erez Zadok
c44a66d674 ecryptfs: remove unnecessary d_drop calls in ecryptfs_link
Unnecessary because it would unhash perfectly valid dentries, causing them
to have to be re-looked up the next time they're needed, which presumably is
right after.

Signed-off-by: Aseem Rastogi <arastogi@cs.sunysb.edu>
Signed-off-by: Shrikar archak <shrikar84@gmail.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Saumitra Bhanage <sbhanage@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:00 -06:00
Erez Zadok
0d132f7364 ecryptfs: don't ignore return value from lock_rename
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:35:59 -06:00
Erez Zadok
e27759d7a3 ecryptfs: initialize private persistent file before dereferencing pointer
Ecryptfs_open dereferences a pointer to the private lower file (the one
stored in the ecryptfs inode), without checking if the pointer is NULL.
Right afterward, it initializes that pointer if it is NULL.  Swap order of
statements to first initialize.  Bug discovered by Duckjin Kang.

Signed-off-by: Duckjin Kang <fromdj2k@gmail.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:54 -06:00
Tyler Hicks
38e3eaeedc eCryptfs: Remove mmap from directory operations
Adrian reported that mkfontscale didn't work inside of eCryptfs mounts.
Strace revealed the following:

open("./", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3
fcntl64(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
open("./fonts.scale", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
getdents(3, /* 80 entries */, 32768) = 2304
open("./.", O_RDONLY) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
fstat64(5, {st_mode=S_IFDIR|0755, st_size=16384, ...}) = 0
mmap2(NULL, 16384, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7fcf000
close(5) = 0
--- SIGBUS (Bus error) @ 0 (0) ---
+++ killed by SIGBUS +++

The mmap2() on a directory was successful, resulting in a SIGBUS
signal later.  This patch removes mmap() from the list of possible
ecryptfs_dir_fops so that mmap() isn't possible on eCryptfs directory
files.

https://bugs.launchpad.net/ecryptfs/+bug/400443

Reported-by: Adrian C. <anrxc@sysphere.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:11 -06:00
Tyler Hicks
f8f484d1b6 eCryptfs: Add getattr function
The i_blocks field of an eCryptfs inode cannot be trusted, but
generic_fillattr() uses it to instantiate the blocks field of a stat()
syscall when a filesystem doesn't implement its own getattr().  Users
have noticed that the output of du is incorrect on newly created files.

This patch creates ecryptfs_getattr() which calls into the lower
filesystem's getattr() so that eCryptfs can use its kstat.blocks value
after calling generic_fillattr().  It is important to note that the
block count includes the eCryptfs metadata stored in the beginning of
the lower file plus any padding used to fill an extent before
encryption.

https://bugs.launchpad.net/ecryptfs/+bug/390833

Reported-by: Dominic Sacré <dominic.sacre@gmx.de>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:09 -06:00
Tyler Hicks
5f3ef64f4d eCryptfs: Use notify_change for truncating lower inodes
When truncating inodes in the lower filesystem, eCryptfs directly
invoked vmtruncate(). As Christoph Hellwig pointed out, vmtruncate() is
a filesystem helper function, but filesystems may need to do more than
just a call to vmtruncate().

This patch moves the lower inode truncation out of ecryptfs_truncate()
and renames the function to truncate_upper().  truncate_upper() updates
an iattr for the lower inode to indicate if the lower inode needs to be
truncated upon return.  ecryptfs_setattr() then calls notify_change(),
using the updated iattr for the lower inode, to complete the truncation.

For eCryptfs functions needing to truncate, ecryptfs_truncate() is
reintroduced as a simple way to truncate the upper inode to a specified
size and then truncate the lower inode accordingly.

https://bugs.launchpad.net/bugs/451368

Reported-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:07 -06:00
Thiago Farina
f06f135d86 fs/bio.c: fix shadows sparse warning
fs/bio.c:81:33: warning: symbol 'bslab' shadows an earlier one
fs/bio.c:74:25: originally declared here

Signed-off-by: Thiago Farina <tfransosi@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-19 14:07:09 +01:00
Linus Torvalds
1e868d8e6d Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: xfs_swap_extents needs to handle dynamic fork offsets
  xfs: fix missing error check in xfs_rtfree_range
  xfs: fix stale inode flush avoidance
  xfs: Remove inode iolock held check during allocation
  xfs: reclaim all inodes by background tree walks
  xfs: Avoid inodes in reclaim when flushing from inode cache
  xfs: reclaim inodes under a write lock
2010-01-18 14:08:07 -08:00
Josef Bacik
11dfe35a01 Btrfs: fix possible panic on unmount
We can race with the unmount of an fs and the stopping of a kthread where we
will free the block group before we're done using it.  The reason for this is
because we do not hold a reference on the block group while its caching, since
the allocator drops its reference once it exits or moves on to the next block
group.  This patch fixes the problem by taking a reference to the block group
before we start caching and dropping it when we're done to make sure all
accesses to the block group are safe.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:30 -05:00
Chris Mason
a9cc71a60c Btrfs: deal with NULL acl sent to btrfs_set_acl
It is legal for btrfs_set_acl to be sent a NULL acl.  This
makes sure we don't dereference it.  A similar patch was sent by
Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:22 -05:00
Josef Bacik
6c090a11e1 Btrfs: fix regression in orphan cleanup
Currently orphan cleanup only ever gets triggered if we cross subvolumes during
a lookup, which means that if we just mount a plain jane fs that has orphans in
it, they will never get cleaned up.  This results in panic's like these

http://www.kerneloops.org/oops.php?number=1109085

where adding an orphan entry results in -EEXIST being returned and we panic.  In
order to fix this, we check to see on lookup if our root has had the orphan
cleanup done, and if not go ahead and do it.  This is easily reproduceable by
running this testcase

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

int main(int argc, char **argv)
{
	char data[4096];
	char newdata[4096];
	int fd1, fd2;

	memset(data, 'a', 4096);
	memset(newdata, 'b', 4096);

	while (1) {
		int i;

		fd1 = creat("file1", 0666);
		if (fd1 < 0)
			break;

		for (i = 0; i < 512; i++)
			write(fd1, data, 4096);

		fsync(fd1);
		close(fd1);

		fd2 = creat("file2", 0666);
		if (fd2 < 0)
			break;

		ftruncate(fd2, 4096 * 512);

		for (i = 0; i < 512; i++)
			write(fd2, newdata, 4096);
		close(fd2);

		i = rename("file2", "file1");
		unlink("file1");
	}

	return 0;
}

and then pulling the power on the box, and then trying to run that test again
when the box comes back up.  I've tested this locally and it fixes the problem.
Thanks to Tomas Carnecky for helping me track this down initially.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:21 -05:00
Yan, Zheng
6c7d54ac87 Btrfs: Fix race in btrfs_mark_extent_written
Fix bug reported by Johannes Hirte. The reason of that bug
is btrfs_del_items is called after btrfs_duplicate_item and
btrfs_del_items triggers tree balance. The fix is check that
case and call btrfs_search_slot when needed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:21 -05:00
Jiri Slaby
2423fdfb96 Btrfs, fix memory leaks in error paths
Stanse found 2 memory leaks in relocate_block_group and
__btrfs_map_block. cluster and multi are not freed/assigned on all
paths. Fix that.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:40:20 -05:00
Yan, Zheng
a038fab0cb Btrfs: align offsets for btrfs_ordered_update_i_size
Some callers of btrfs_ordered_update_i_size can now pass in
a NULL for the ordered extent to update against.  This makes
sure we properly align the offset they pass in when deciding
how much to bump the on disk i_size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:06:27 -05:00
Jan Engelhardt
406266ab9a btrfs: fix missing last-entry in readdir(3)
parent 49313cdac7b34c9f7ecbb1780cfc648b1c082cd7 (v2.6.32-1-g49313cd)
commit ff48c08e1c05c67e8348ab6f8a24de8034e0e34d
Author: Jan Engelhardt <jengelh@medozas.de>
Date:   Wed Dec 9 22:57:36 2009 +0100

Btrfs: fix missing last-entry in readdir(3)

When one does a 32-bit readdir(3), the last entry of a directory is
missing. This is however not due to passing a large value to filldir,
but it seems to have to do with glibc doing telldir or something
quirky. In any case, this patch fixes it in practice.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-17 20:06:27 -05:00
Linus Torvalds
7dc9c484a7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  do_add_mount() should sanitize mnt_flags
  CIFS shouldn't make mountpoints shrinkable
  mnt_flags fixes in do_remount()
  attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()
  may_umount() needs namespace_sem
  Fix configfs leak
  Fix the -ESTALE handling in do_filp_open()
  ecryptfs: Fix refcnt leak on ecryptfs_follow_link() error path
  Fix ACC_MODE() for real
  Unrot uml mconsole a bit
  hppfs: handle ->put_link()
  Kill 9p readlink()
  fix autofs/afs/etc. magic mountpoint breakage
2010-01-17 11:01:16 -08:00
David Howells
7e6608724c nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation.  The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.

The following sequence of events can cause the problem:

	fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
	ftruncate(fd, 32 * 1024);
	a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	munmap(a, 32 * 1024);
	ftruncate(fd, 16 * 1024);
	c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'a' creates a vm_region covering 32KB of the file.  Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.

Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'.  However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.

Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.

However:

	d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Mapping 'd' should work, and should end up sharing the region allocated by
'a'.

To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:40 -08:00
David Howells
81759b5b22 nommu: fix race between ramfs truncation and shared mmap
Fix the race between the truncation of a ramfs file and an attempt to make
a shared mmap of region of that file.

The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to
verify that the file region is made of contiguous pages and to find its
base address - but there isn't any locking to guarantee this region until
vma_prio_tree_insert() is called by add_vma_to_mm().

Note that moving the functionality into f_op->mmap() doesn't help as that
is also called before vma_prio_tree_insert().

Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it
does its checks.  This means that this function will wait whilst mmaps
take place.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:40 -08:00
Al Viro
27d55f1f4c do_add_mount() should sanitize mnt_flags
MNT_WRITE_HOLD shouldn't leak into new vfsmount and neither
should MNT_SHARED (the latter will be set properly, along with
the rest of shared-subtree data structures)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:07:36 -05:00
Al Viro
7e1295d9f8 CIFS shouldn't make mountpoints shrinkable
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:06:32 -05:00
Al Viro
7b43a79f32 mnt_flags fixes in do_remount()
* need vfsmount_lock over modifying it
* need to preserve MNT_SHARED/MNT_UNBINDABLE

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 13:01:26 -05:00
Al Viro
df1a1ad297 attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()
race in mnt_flags update

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 12:57:40 -05:00
Al Viro
8ad08d8a0c may_umount() needs namespace_sem
otherwise it races with clone_mnt() changing mnt_share/mnt_slaves

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-16 12:56:08 -05:00
Eric Paris
976ae32be4 inotify: only warn once for inotify problems
inotify will WARN() if it finds that the idr and the fsnotify internals
somehow got out of sync.  It was only supposed to do this once but due
to this stupid bug it would warn every single time a problem was
detected.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15 14:49:23 -08:00
Eric Paris
9e572cc987 inotify: do not reuse watch descriptors
Since commit 7e790dd5fc ("inotify: fix
error paths in inotify_update_watch") inotify changed the manor in which
it gave watch descriptors back to userspace.  Previous to this commit
inotify acted like the following:

  inotify_add_watch(X, Y, Z) = 1
  inotify_rm_watch(X, 1);
  inotify_add_watch(X, Y, Z) = 2

but after this patch inotify would return watch descriptors like so:

  inotify_add_watch(X, Y, Z) = 1
  inotify_rm_watch(X, 1);
  inotify_add_watch(X, Y, Z) = 1

which I saw as equivalent to opening an fd where

  open(file) = 1;
  close(1);
  open(file) = 1;

seemed perfectly reasonable.  The issue is that quite a bit of userspace
apparently relies on the behavior in which watch descriptors will not be
quickly reused.  KDE relies on it, I know some selinux packages rely on
it, and I have heard complaints from other random sources such as debian
bug 558981.

Although the man page implies what we do is ok, we broke userspace so
this patch almost reverts us to the old behavior.  It is still slightly
racey and I have patches that would fix that, but they are rather large
and this will fix it for all real world cases.  The race is as follows:

 - task1 creates a watch and blocks in idr_new_watch() before it updates
   the hint.
 - task2 creates a watch and updates the hint.
 - task1 updates the hint with it's older wd
 - task removes the watch created by task2
 - task adds a new watch and will reuse the wd originally given to task2

it requires moving some locking around the hint (last_wd) but this should
solve it for the real world and be -stable safe.

As a side effect this patch papers over a bug in the lib/idr code which
is causing a large number WARN's to pop on people's system and many
reports in kerneloops.org.  I'm working on the root cause of that idr
bug seperately but this should make inotify immune to that issue.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15 14:49:23 -08:00