kernel-ark

Author	SHA1	Message	Date
Theodore Ts'o	06705bff91	ext4: Regularize mount options Add support for using the mount options "barrier" and "nobarrier", and "auto_da_alloc" and "noauto_da_alloc", which is more consistent than "barrier=<0\|1>" or "auto_da_alloc=<0\|1>". Most other ext3/ext4 mount options use the foo/nofoo naming convention. We allow the old forms of these mount options for backwards compatibility. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-28 10:59:57 -04:00
Theodore Ts'o	e7c9e3e99a	ext4: fix locking typo in mballoc which could cause soft lockup hangs Smatch (http://repo.or.cz/w/smatch.git/) complains about the locking in ext4_mb_add_n_trim() from fs/ext4/mballoc.c 4438 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], 4439 pa_inode_list) { 4440 spin_lock(&tmp_pa->pa_lock); 4441 if (tmp_pa->pa_deleted) { 4442 spin_unlock(&pa->pa_lock); 4443 continue; 4444 } Brown paper bag time... Reported-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-03-27 19:43:21 -04:00
Dan Carpenter	a7b19448dd	ext4: fix typo which causes a memory leak on error path This was found by smatch (http://repo.or.cz/w/smatch.git/) Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-03-27 19:42:54 -04:00
Linus Torvalds	3ae5080f4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits) fs: avoid I_NEW inodes Merge code for single and multiple-instance mounts Remove get_init_pts_sb() Move common mknod_ptmx() calls into caller Parse mount options just once and copy them to super block Unroll essentials of do_remount_sb() into devpts vfs: simple_set_mnt() should return void fs: move bdev code out of buffer.c constify dentry_operations: rest constify dentry_operations: configfs constify dentry_operations: sysfs constify dentry_operations: JFS constify dentry_operations: OCFS2 constify dentry_operations: GFS2 constify dentry_operations: FAT constify dentry_operations: FUSE constify dentry_operations: procfs constify dentry_operations: ecryptfs constify dentry_operations: CIFS constify dentry_operations: AFS ...	2009-03-27 16:23:12 -07:00
Linus Torvalds	2c9e15a011	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits) ext2: Zero our b_size in ext2_quota_read() trivial: fix typos/grammar errors in fs/Kconfig quota: Coding style fixes quota: Remove superfluous inlines quota: Remove uppercase aliases for quota functions. nfsd: Use lowercase names of quota functions jfs: Use lowercase names of quota functions udf: Use lowercase names of quota functions ufs: Use lowercase names of quota functions reiserfs: Use lowercase names of quota functions ext4: Use lowercase names of quota functions ext3: Use lowercase names of quota functions ext2: Use lowercase names of quota functions ramfs: Remove quota call vfs: Use lowercase names of quota functions quota: Remove dqbuf_t and other cleanups quota: Remove NODQUOT macro quota: Make global quota locks cacheline aligned quota: Move quota files into separate directory ext4: quota reservation for delayed allocation ...	2009-03-27 14:48:34 -07:00
Linus Torvalds	805de022b1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: fix length calculation in compat code dlm: ignore cancel on granted lock dlm: clear defunct cancel state dlm: replace idr with hash table for connections dlm: comment typo fixes dlm: use ipv6_addr_copy dlm: Change rwlock which is only used in write mode to a spinlock	2009-03-27 14:48:07 -07:00
Jan Kara	86db97c87f	jbd2: Update locking coments Update information about locking in JBD2 revoke code. Inconsistency in comments found by Lin Tan <tammy000@gmail.com>. CC: Lin Tan <tammy000@gmail.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 17:20:40 -04:00
Aneesh Kumar K.V	cc0fb9ad7d	ext4: Rename pa_linear to pa_type Impact: code cleanup This patch rename pa_linear to pa_type and add MB_INODE_PA and MB_GROUP_PA to indicate inode and group prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 17:16:58 -04:00
Thiemo Nagel	fe2c8191fa	ext4: add checks of block references for non-extent inodes Check block references in the inode and indorect blocks for non-extent inodes to make sure they are valid, and flag an error if they are invalid. Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-31 08:36:10 -04:00
Nick Piggin	aabb8fdb41	fs: avoid I_NEW inodes To be on the safe side, it should be less fragile to exclude I_NEW inodes from inode list scans by default (unless there is an important reason to have them). Normally they will get excluded (eg. by zero refcount or writecount etc), however it is a bit fragile for list walkers to know exactly what parts of the inode state is set up and valid to test when in I_NEW. So along these lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc checks with them too -- this shouldn't be a problem should it?) Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:05 -04:00
Sukadev Bhattiprolu	1bd7903560	Merge code for single and multiple-instance mounts new_pts_mount() (including the get_sb_nodev()), shares a lot of code with init_pts_mount(). The only difference between them is the 'test-super' function passed into sget(). Move all common code into devpts_get_sb() and remove the new_pts_mount() and init_pts_mount() functions, Changelog[v3]: [Serge Hallyn]: Remove unnecessary printk()s Changelog[v2]: (Christoph Hellwig): Merge code in 'do_pts_mount()' into devpts_get_sb() Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	289f00e225	Remove get_init_pts_sb() With mknod_ptmx() moved to devpts_get_sb(), init_pts_mount() becomes a wrapper around get_init_pts_sb(). Remove get_init_pts_sb() and fold code into init_pts_mount(). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	945cf2c79f	Move common mknod_ptmx() calls into caller We create 'ptmx' node in both single-instance and multiple-instance mounts. So devpts_get_sb() can call mknod_ptmx() once rather than have both modes calling mknod_ptmx() separately. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	482984f06d	Parse mount options just once and copy them to super block Since all the mount option parsing is done in devpts, we could do it just once and pass it around in devpts functions and eventually store it in the super block. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:04 -04:00
Sukadev Bhattiprolu	fdbf534866	Unroll essentials of do_remount_sb() into devpts On remount, devpts fs only needs to parse the mount options. Users cannot directly create/dirty files in /dev/pts so the MS_RDONLY flag and shrinking the dcache does not really apply to devpts. So effectively on remount, devpts only parses the mount options and updates these options in its super block. As such, we could replace do_remount_sb() call with a direct parse_mount_options(). Doing so enables subsequent patches to avoid parsing the mount options twice and simplify the code. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Sukadev Bhattiprolu	a3ec947c85	vfs: simple_set_mnt() should return void simple_set_mnt() is defined as returning 'int' but always returns 0. Callers assume simple_set_mnt() never fails and don't properly cleanup if it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev() should: up_write(sb->s_unmount); deactivate_super(sb); if simple_set_mnt() fails. Since simple_set_mnt() never fails, would be cleaner if it did not return anything. [akpm@linux-foundation.org: fix build] Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Nick Piggin	585d3bc06f	fs: move bdev code out of buffer.c Move some block device related code out from buffer.c and put it in block_dev.c. I'm trying to move non-buffer_head code out of buffer.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	3ba13d179e	constify dentry_operations: rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	296c2d8663	constify dentry_operations: configfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	ee1ec32903	constify dentry_operations: sysfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	ad28b4ef19	constify dentry_operations: JFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	d8fba0ffe5	constify dentry_operations: OCFS2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	92cecbbfa3	constify dentry_operations: GFS2 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:02 -04:00
Al Viro	ce6cdc474a	constify dentry_operations: FAT Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	4269590a72	constify dentry_operations: FUSE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	d72f71eb0e	constify dentry_operations: procfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	5a3fd05a9b	constify dentry_operations: ecryptfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	4fd03e84d8	constify dentry_operations: CIFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:01 -04:00
Al Viro	79be57cc7f	constify dentry_operations: AFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00
Al Viro	08f11513fa	constify dentry_operations: autofs, autofs4 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00
Al Viro	a488257ce5	constify dentry_operations: 9p Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00
Al Viro	e16404ed0f	constify dentry_operations: misc filesystems Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:00 -04:00
Al Viro	f786aa90e0	constify dentry_operations: NFS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:59 -04:00
Sukadev Bhattiprolu	a9f184f02a	devpts: Must release s_umount on error We should drop the ->s_umount mutex if an error occurs after the sget()/grab_super() call. This was introduced when adding support for multiple instances of devpts and noticed during a code review/reorg. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:59 -04:00
Cheng Renquan	10f303ae1e	do_pipe cleanup: drop its last user in arch/alpha/ The last user of do_pipe is in arch/alpha/, after replacing it with do_pipe_flags, the do_pipe can be totally dropped. Signed-off-by: Cheng Renquan <crquan@gmail.com> Acked-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:58 -04:00
Duane Griffin	723be1f300	ufs: copy symlink data into the correct union member Copy symlink data into the union member it is accessed through. Although this shouldn't make a difference to behaviour it makes the code easier to follow and grep through. It may also prevent problems if the struct/union definitions change in the future. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:58 -04:00
Duane Griffin	b12903f138	ufs: ensure fast symlinks are NUL-terminated Ensure fast symlink targets are NUL-terminated, even if corrupted on-disk. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:58 -04:00
Duane Griffin	f33219b7a9	ufs: don't truncate longer ufs2 fast symlinks ufs2 fast symlinks can be twice as long as ufs ones, however the code was using the ufs size in various places. Fix that so ufs2 symlinks over 60 characters aren't truncated. Note that we copy the entire area instead of using the maxsymlinklen field from the superblock. This way we will be more robust against corruption (of the superblock). While we are at it, use memcpy instead of open-coding it with for loops. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:58 -04:00
Duane Griffin	9e6766cc8c	ufs: validate maximum fast symlink size from superblock The maximum fast symlink size is set in the superblock of certain types of UFS filesystem. Before using it we need to check that it isn't longer than the available space we have in the inode. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:57 -04:00
Christoph Hellwig	c8fe8f30c7	cleanup may_open Add a switch for the various i_mode fmt cases, and remove the comment about writeability of devices nodes - that part is handled in inode_permission and comment on (briefly) there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:57 -04:00
Christoph Hellwig	b6520c8193	cleanup d_add_ci Make sure that comments describe what's going on and not how, and always use __d_instantiate instead of two separate branches, one with d_instantiate and one with __d_instantiate. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:57 -04:00
Christoph Hellwig	2b1c6bd77d	generic compat_sys_ustat Due to a different size of ino_t ustat needs a compat handler, but currently only x86 and mips provide one. Add a generic compat_sys_ustat and switch all architectures over to it. Instead of doing various user copy hacks compat_sys_ustat just reimplements sys_ustat as it's trivial. This was suggested by Arnd Bergmann. Found by Eric Sandeen when running xfstests/017 on ppc64, which causes stack smashing warnings on RHEL/Fedora due to the too large amount of data writen by the syscall. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:57 -04:00
Christoph Hellwig	ec1ab0abde	affs: fix missing unlocks in affs_remove_link In two error cases affs_remove_link doesn't call affs_unlock_dir to release the i_hash_lock semaphore. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:43:56 -04:00
Linus Torvalds	8e9d208972	Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6 * 'bkl-removal' of git://git.lwn.net/linux-2.6: Rationalize fasync return values Move FASYNC bit handling to f_op->fasync() Use f_lock to protect f_flags Rename struct file->f_ep_lock	2009-03-26 16:14:02 -07:00
Linus Torvalds	21cdbc1378	Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (81 commits) [S390] remove duplicated #includes [S390] cpumask: use mm_cpumask() wrapper [S390] cpumask: Use accessors code. [S390] cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits. [S390] cpumask: remove cpu_coregroup_map [S390] fix clock comparator save area usage [S390] Add hwcap flag for the etf3 enhancement facility [S390] Ensure that ipl panic notifier is called late. [S390] fix dfp elf hwcap/facility bit detection [S390] smp: perform initial cpu reset before starting a cpu [S390] smp: fix memory leak on __cpu_up [S390] ipl: Improve checking logic and remove switch defaults. [S390] s390dbf: Remove needless check for NULL pointer. [S390] s390dbf: Remove redundant initilizations. [S390] use kzfree() [S390] BUG to BUG_ON changes [S390] zfcpdump: Prevent zcore from beeing built as a kernel module. [S390] Use csum_partial in checksum.h [S390] cleanup lowcore.h [S390] eliminate ipl_device from lowcore ...	2009-03-26 16:04:22 -07:00
Linus Torvalds	86d9c07017	Merge branch 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block: Get rid of pdflush_operation() in emergency sync and remount btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty Move the default_backing_dev_info out of readahead.c and into backing-dev.c block: Repeated lines in switching-sched.txt bsg: Remove bogus check against request_queue->max_sectors block: WARN in __blk_put_request() for potential bio leak loop: fix circular locking in loop_clr_fd() loop: support barrier writes bsg: add support for tail queuing cpqarray: enable bus mastering block: genhd.h cleanup patch block: add private bio_set for bio integrity allocations block: genhd.h comment needs updating block: get rid of unused blkdev_free_rq() define block: remove various blk_queue_*() setting functions in blk_init_queue_node() cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment block: don't create bio_vec slabs of less than the inline number block: cleanup bio_alloc_bioset()	2009-03-26 16:03:04 -07:00
Linus Torvalds	13220a94d3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits) ixgbe: Allow Priority Flow Control settings to survive a device reset net: core: remove unneeded include in net/core/utils.c. e1000e: update version number e1000e: fix close interrupt race e1000e: fix loss of multicast packets e1000e: commonize tx cleanup routine to match e1000 & igb netfilter: fix nf_logger name in ebt_ulog. netfilter: fix warning in ebt_ulog init function. netfilter: fix warning about invalid const usage e1000: fix close race with interrupt e1000: cleanup clean_tx_irq routine so that it completely cleans ring e1000: fix tx hang detect logic and address dma mapping issues bridge: bad error handling when adding invalid ether address bonding: select current active slave when enslaving device for mode tlb and alb gianfar: reallocate skb when headroom is not enough for fcb Bump release date to 25Mar2009 and version to 0.22 r6040: Fix second PHY address qeth: fix wait_event_timeout handling qeth: check for completion of a running recovery qeth: unregister MAC addresses during recovery. ... Manually fixed up conflicts in: drivers/infiniband/hw/cxgb3/cxio_hal.h drivers/infiniband/hw/nes/nes_nic.c	2009-03-26 15:54:36 -07:00
Linus Torvalds	39f15003c7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Fix memory overwrite when saving nativeFileSystem field during mount [CIFS] Rename compose_mount_options to cifs_compose_mount_options. [CIFS] work around bug in Samba server handling for posix open [CIFS] Use posix open on file open when server supports it cifs: fix buffer format byte on NT Rename/hardlink [CIFS] Add definitions for remoteably fsctl calls [CIFS] add extra null attr check [CIFS] fix build error [CIFS] reopen file via newer posix open protocol operation if available [CIFS] Add new nostrictsync cifs mount option to avoid slow SMB flush [CIFS] DFS no longer experimental [CIFS] Send SMB flush in cifs_fsync	2009-03-26 15:46:37 -07:00
Jan Kara	9e80d40773	ext3: Avoid starting a transaction in writepage when not necessary We don't have to start a transaction in writepage() when all the blocks are a properly allocated. Even in ordered mode either the data has been written via write() and they are thus already added to transaction's list or the data was written via mmap and then it's random in which transaction they get written anyway. This should help VM to pageout dirty memory without blocking on transaction commits. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 15:44:59 -07:00
David S. Miller	08abe18af1	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/net/wimax/i2400m/usb-notif.c	2009-03-26 15:23:24 -07:00
Linus Torvalds	0c93ea4064	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (61 commits) Dynamic debug: fix pr_fmt() build error Dynamic debug: allow simple quoting of words dynamic debug: update docs dynamic debug: combine dprintk and dynamic printk sysfs: fix some bin_vm_ops errors kobject: don't block for each kobject_uevent sysfs: only allow one scheduled removal callback per kobj Driver core: Fix device_move() vs. dpm list ordering, v2 Driver core: some cleanup on drivers/base/sys.c Driver core: implement uevent suppress in kobject vcs: hook sysfs devices into object lifetime instead of "binding" driver core: fix passing platform_data driver core: move platform_data into platform_device sysfs: don't block indefinitely for unmapped files. driver core: move knode_bus into private structure driver core: move knode_driver into private structure driver core: move klist_children into private structure driver core: create a private portion of struct device driver core: remove polling for driver_probe_done(v5) sysfs: reference sysfs_dirent from sysfs inodes ... Fixed conflicts in drivers/sh/maple/maple.c manually	2009-03-26 11:17:04 -07:00
Linus Torvalds	8ff64b539b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Fix freeze issue Fix a minor bug in the previous patch GFS2: Clean up of glops.c GFS2: Fix locking bug in failed shared to exclusive conversion GFS2: Pagecache usage optimization on GFS2 GFS2: fix sparse warning: Should it be static? GFS2: fix sparse warnings: constant is so big it is ... GFS2: Support quota/noquota mount arguments GFS2: Fix alignment issue and tidy gfs2_bitfit GFS2: Add a "demote a glock" interface to sysfs GFS2: Expose UUID via sysfs/uevent GFS2: Support generation of discard requests GFS2: Fix deadlock on journal flush GFS2: Fix error path ref counting for root inode GFS2: Remove unused field from glock GFS2: Merge lock_dlm module into GFS2 GFS2: Remove "double" locking in quota GFS2: change gfs2_quota_scan into a shrinker GFS2: Bring back lvb-related stuff to lock_nolock to support quotas GFS2: Fix remount argument parsing	2009-03-26 11:08:47 -07:00
Linus Torvalds	8d80ce80e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits) SELinux: inode_doinit_with_dentry drop no dentry printk SELinux: new permission between tty audit and audit socket SELinux: open perm for sock files smack: fixes for unlabeled host support keys: make procfiles per-user-namespace keys: skip keys from another user namespace keys: consider user namespace in key_permission keys: distinguish per-uid keys in different namespaces integrity: ima iint radix_tree_lookup locking fix TOMOYO: Do not call tomoyo_realpath_init unless registered. integrity: ima scatterlist bug fix smack: fix lots of kernel-doc notation TOMOYO: Don't create securityfs entries unless registered. TOMOYO: Fix exception policy read failure. SELinux: convert the avc cache hash list to an hlist SELinux: code readability with avc_cache SELinux: remove unused av.decided field SELinux: more careful use of avd in avc_has_perm_noaudit SELinux: remove the unused ae.used SELinux: check seqno when updating an avc_node ...	2009-03-26 11:03:39 -07:00
Matthew Garrett	0a1c01c947	Make relatime default Change the default behaviour of the kernel to use relatime for all filesystems. This can be overridden with the "strictatime" mount option. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 11:01:10 -07:00
Matthew Garrett	d0adde574b	Add a strictatime mount option Add support for explicitly requesting full atime updates. This makes it possible for kernels to default to relatime but still allow userspace to override it. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 10:56:35 -07:00
Matthew Garrett	11ff6f05f1	Allow relatime to update atime once a day Allow atime to be updated once per day even with relatime. This lets utilities like tmpreaper (which delete files based on last access time) continue working, making relatime a plausible default for distributions. Signed-off-by: Matthew Garrett <mjg@redhat.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Valerie Aurora Henson <vaurora@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 10:48:13 -07:00
Stefan Weinhuber	b44b0ab3ba	[S390] dasd: add large volume support The dasd device driver will now support ECKD devices with more then 65520 cylinders. In the traditional ECKD adressing scheme each track is addressed by a 16-bit cylinder and 16-bit head number. The new addressing scheme makes use of the fact that the actual number of heads is never larger then 15, so 12 bits of the head number can be redefined to be part of the cylinder address. Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2009-03-26 15:24:05 +01:00
Jens Axboe	a2a9537ac0	Get rid of pdflush_operation() in emergency sync and remount Opencode a cheasy approach with kevent. The idea here is that we'll add some generic delayed work infrastructure, which probably wont be based on pdflush (or maybe it will, in which case we can just add it back). This is in preparation for getting rid of pdflush completely. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-26 11:01:36 +01:00
Jens Axboe	6933c02e9c	btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty Chris says it's safe to kill. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-26 11:01:35 +01:00
Theodore Ts'o	563bdd61fe	ext4: Check for an valid i_mode when reading the inode from disk Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-26 00:06:19 -04:00
Theodore Ts'o	7058548cd5	ext4: Use WRITE_SYNC for commits which are caused by fsync() If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-25 23:35:46 -04:00
Manish Katiyar	c16831b4cc	ext2: Zero our b_size in ext2_quota_read() ext2_quota_read() doesn't initialize tmp_bh.b_size before calling ext2_get_block() where we access it. Since it is a local variable it might contain some garbage. Make sure it is filled with reasonable value before passing. Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Matt LaPlante	620372a9ff	trivial: fix typos/grammar errors in fs/Kconfig Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Jan Kara	268157ba67	quota: Coding style fixes Wrap long lines, remove assignments from conditions, rewrite two overcomplicated for loops. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Jan Kara	7a2435d874	quota: Remove superfluous inlines Remove inlines of large functions to decrease code size (saved 1543 bytes). Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:37 +01:00
Jan Kara	90c0af05a5	nfsd: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. CC: bfields@fieldses.org CC: neilb@suse.de Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:37 +01:00
Jan Kara	c94d2a22f2	jfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>	2009-03-26 02:18:37 +01:00
Jan Kara	bacfb7c2e5	udf: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:36 +01:00
Jan Kara	5f5fa796c6	ufs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Evgeniy Dushistov <dushistov@mail.ru>	2009-03-26 02:18:36 +01:00
Jan Kara	77db4f25bc	reiserfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: reiserfs-devel@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	a269eb1829	ext4: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mingming Cao <cmm@us.ibm.com> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	81a0522739	ext3: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	6f90bee506	ext2: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	314649558d	ramfs: Remove quota call Ramfs has no bussiness in quotas. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	9e3509e273	vfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Alexander Viro <viro@zeniv.linux.org.uk>	2009-03-26 02:18:35 +01:00
Jan Kara	d26ac1a812	quota: Remove dqbuf_t and other cleanups Remove bogus typedef which is just a definition of char *. Remove unnecessary type casts. Substitute freedqbuf() with kfree. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	dd6f3c6d5a	quota: Remove NODQUOT macro Remove this macro which is just a definition of NULL. Fix a few coding style issues along the way. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	c516610cfe	quota: Make global quota locks cacheline aligned Andrew Morton has suggested that three global quota locks can end up in the same cacheline which can result in bad cacheline ping-pong on SMP machines. Make locks cacheline aligned so that we avoid this problem (thanks goes to Andrew for the idea). Signed-off-by: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org>	2009-03-26 02:18:35 +01:00
Jan Kara	884d179dff	quota: Move quota files into separate directory Quota subsystem has more and more files. It's time to create a dir for it. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Mingming Cao	60e58e0f30	ext4: quota reservation for delayed allocation Uses quota reservation/claim/release to handle quota properly for delayed allocation in the three steps: 1) quotas are reserved when data being copied to cache when block allocation is defered 2) when new blocks are allocated. reserved quotas are converted to the real allocated quota, 2) over-booked quotas for metadata blocks are released back. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	643d00ccc3	reiserfs: Remove unnecessary quota functions reiserfs_dquot_initialize() and reiserfs_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	edf7245362	ext4: Remove unnecessary quota functions ext4_dquot_initialize() and ext4_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	a219ce3748	ext3: Remove unnecessary quota functions ext3_dquot_initialize() and ext3_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Mingming Cao	08d0350ce9	quota: Move EXPORT_SYMBOL immediately next to the functions/varibles According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its function/variable Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Mingming Cao	740d9dcd94	quota: Add quota reservation claim and released operations Reserved quota will be claimed at the block allocation time. Over-booked quota could be returned back with the release callback function. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:24 +01:00
Mingming Cao	f18df22899	quota: Add quota reservation support Delayed allocation defers the block allocation at the dirty pages flush-out time, doing quota charge/check at that time is too late. But we can't charge the quota blocks until blocks are really allocated, otherwise users could get overcharged after reboot from system crash. This patch adds quota reservation for delayed allocation. Quota blocks are reserved in memory, inode and quota won't gets dirtied until later block allocation time. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:15:50 +01:00
Chris Mason	1a81af4d1d	Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod btrfs_update_delayed_ref is optimized to add and remove different references in one pass through the delayed ref tree. It is a zero sum on the total number of refs on a given extent. But, the code was recording an extra ref in the head node. This never made it down to the disk but was used when deciding if it was safe to free the extent while dropping snapshots. The fix used here is to make sure the ref_mod count is unchanged on the head ref when btrfs_update_delayed_ref is called. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-25 09:55:11 -04:00
Hugh Dickins	095160aee9	sysfs: fix some bin_vm_ops errors Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a "sysfs: don't block indefinitely for unmapped files" in linux-next crashes the PowerMac G5 when X starts up. It's caught out by the way powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting a new vma->vm_file whose private_data no longer points to the bin_buffer (substitution done because some versions of X crash if that mmap fails). The fix to this is straightforward: the original vm_file is fput() in that case, so this mmap won't block sysfs at all, so just don't switch over to bin_vm_ops if vm_file has changed. But more fixes made before realizing that was the problem:- It should not be an error if bin_page_mkwrite() finds no underlying page_mkwrite(). Check that a file already mmap'ed has the same underlying vm_ops _before_ pointing vma->vm_ops at bin_vm_ops. If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap on CONFIG_NUMA=y, just because that has a set_policy and get_policy: provide bin_set_policy, bin_get_policy and bin_migrate. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Eric Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Alex Chiang	669420644c	sysfs: only allow one scheduled removal callback per kobj The only way for a sysfs attribute to remove itself (without deadlock) is to use the sysfs_schedule_callback() interface. Vegard Nossum discovered that a poorly written sysfs ->store callback can repeatedly schedule remove callbacks on the same device over and over, e.g. $ while true ; do echo 1 > /sys/devices/.../remove ; done If the 'remove' attribute uses the sysfs_schedule_callback API and also does not protect itself from concurrent accesses, its callback handler will be called multiple times, and will eventually attempt to perform operations on a freed kobject, leading to many problems. Instead of requiring all callers of sysfs_schedule_callback to implement their own synchronization, provide the protection in the infrastructure. Now, sysfs_schedule_callback will only allow one scheduled callback per kobject. On subsequent calls with the same kobject, return -EAGAIN. This is a short term fix. The long term fix is to allow sysfs attributes to remove themselves directly, without any of this callback hokey pokey. [cornelia.huck@de.ibm.com: s390 ccwgroup bits] Reported-by: vegard.nossum@gmail.com Signed-off-by: Alex Chiang <achiang@hp.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Ming Lei	f67f129e51	Driver core: implement uevent suppress in kobject This patch implements uevent suppress in kobject and removes it from struct device, based on the following ideas: 1,Uevent sending should be one attribute of kobject, so suppressing it in kobject layer is more natural than in device layer. By this way, we can do it for other objects embedded with kobject. 2,It may save several bytes for each instance of struct device.(On my omap3(32bit ARM) based box, can save 8bytes per device object) This patch also introduces dev_set\|get_uevent_suppress() helpers to set and query uevent_suppress attribute in case to help kobject as private part of struct device in future. [This version is against the latest driver-core patch set of Greg,please ignore the last version.] Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Eric W. Biederman	e0edd3c65a	sysfs: don't block indefinitely for unmapped files. Modify sysfs bin files so that we can remove the bin file while they are still mapped. When the kobject is removed we unmap the bin file and arrange for future accesses to the mapping to receive SIGBUS. Implementing this prevents a nasty DOS when pci devices are hot plugged and unplugged. Where if any of their resources were mmaped the kernel could not free up their pci resources or release their pci data structures. [akpm@linux-foundation.org: remove unused var] Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Eric W. Biederman	04256b4a8f	sysfs: reference sysfs_dirent from sysfs inodes The sysfs_dirent serves as both an inode and a directory entry for sysfs. To prevent the sysfs inode numbers from being freed prematurely hold a reference to sysfs_dirent from the sysfs inode. [akpm@linux-foundation.org: add comment] Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:25 -07:00
Alex Chiang	425cb02912	sysfs: sysfs_add_one WARNs with full path to duplicate filename sysfs: sysfs_add_one WARNs with full path to duplicate filename As a debugging aid, it can be useful to know the full path to a duplicate file being created in sysfs. We now will display warnings such as: sysfs: cannot create duplicate filename '/foo' when attempting to create multiple files named 'foo' in the sysfs root, or: sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo' when attempting to create multiple files named 'foo' under a given directory in sysfs. The path displayed is always a relative path to sysfs_root. The leading '/' in the path name refers to the sysfs_root mount point, and should not be confused with the "real" '/'. Thanks to Alex Williamson for essentially writing sysfs_pathname. Cc: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:25 -07:00
Eric W. Biederman	4a67a1bc0b	sysfs: Take sysfs_mutex when fetching the root inode. sysfs_get_inode ultimately calls sysfs_count_nlink when the a directory inode is fectched. sysfs_count_nlink needs to be called under the sysfs_mutex to guard against the unlikely but possible scenario that the root directory is changing as we are counting the number entries in it, and just in general to be consistent. Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:24 -07:00
Qinghuang Feng	8231f2f99a	SYSFS: use standard magic.h for sysfs SYSFS_MAGIC has been added into magic.h, so only use that definition in magic.h to avoid potential consistency problem. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:24 -07:00
Chris Mason	af4176b49c	Btrfs: optimize fsyncs on old files The fsync log has code to make sure all of the parents of a file are in the log along with the file. It uses a minimal log of the parent directory inodes, just enough to get the parent directory on disk. If the transaction that originally created a file is fully on disk, and the file hasn't been renamed or linked into other directories, we can safely skip the parent directory walk. We know the file is on disk somewhere and we can go ahead and just log that single file. This is more important now because unrelated unlinks in the parent directory might make us force a commit if we try to log the parent. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	12fcfd22fe	Btrfs: tree logging unlink/rename fixes The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	a74ac32207	Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay During log replay, inodes are copied from the log to the main filesystem btrees. Sometimes they have a zero link count in the log but they actually gain links during the replay or have some in the main btree. This patch updates the link count to be at least one after copying the inode out of the log. This makes sure the inode is deleted during an iput while the rest of the replay code is still working on it. The log replay has fixup code to make sure that link counts are correct at the end of the replay, so we could use any non-zero number here and it would work fine. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	a4b6e07d1a	Btrfs: limit balancing work while flushing delayed refs The delayed reference mechanism is responsible for all updates to the extent allocation trees, including those updates created while processing the delayed references. This commit tries to limit the amount of work that gets created during the final run of delayed refs before a commit. It avoids cowing new blocks unless it is required to finish the commit, and so it avoids new allocations that were not really required. The goal is to avoid infinite loops where we are always making more work on the final run of delayed refs. Over the long term we'll make a special log for the last delayed ref updates as well. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	5d13a98f3b	Btrfs: readahead checksums during btrfs_finish_ordered_io This reads in blocks in the checksum btree before starting the transaction in btrfs_finish_ordered_io. It makes it much more likely we'll be able to do operations inside the transaction without needing any btree reads, which limits transaction latencies overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	b9473439d3	Btrfs: leave btree locks spinning more often btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	89573b9c51	Btrfs: Only let very young transactions grow during commit Commits are fairly expensive, and so btrfs has code to sit around for a while during the commit and let new writers come in. But, while we're sitting there, new delayed refs might be added, and those can be expensive to process as well. Unless the transaction is very very young, it makes sense to go ahead and let the commit finish without hanging around. The commit grow loop isn't as important as it used to be, the fsync logging code handles most performance critical syncs now. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	66d7e85ea7	Btrfs: Check for a blocking lock before taking the spin This reduces contention on the extent buffer spin locks by testing for a blocking lock before trying to take the spinlock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	7f366cfecf	Btrfs: reduce stack in cow_file_range The fs/btrfs/inode.c code to run delayed allocation during writout needed some stack usage optimization. This is the first pass, it does the check for compression earlier on, which allows us to do the common (no compression) case higher up in the call chain. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	b7ec40d784	Btrfs: reduce stalls during transaction commit To avoid deadlocks and reduce latencies during some critical operations, some transaction writers are allowed to jump into the running transaction and make it run a little longer, while others sit around and wait for the commit to finish. This is a bit unfair, especially when the callers that jump in do a bunch of IO that makes all the others procs on the box wait. This commit reduces the stalls this produces by pre-reading file extent pointers during btrfs_finish_ordered_io before the transaction is joined. It also tunes the drop_snapshot code to politely wait for transactions that have started writing out their delayed refs to finish. This avoids new delayed refs being flooded into the queue while we're trying to close off the transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	c3e69d58e8	Btrfs: process the delayed reference queue in clusters The delayed reference queue maintains pending operations that need to be done to the extent allocation tree. These are processed by finding records in the tree that are not currently being processed one at a time. This is slow because it uses lots of time searching through the rbtree and because it creates lock contention on the extent allocation tree when lots of different procs are running delayed refs at the same time. This commit changes things to grab a cluster of refs for processing, using a cursor into the rbtree as the starting point of the next search. This way we walk smoothly through the rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	1887be66dc	Btrfs: try to cleanup delayed refs while freeing extents When extents are freed, it is likely that we've removed the last delayed reference update for the extent. This checks the delayed ref tree when things are freed, and if no ref updates area left it immediately processes the delayed ref. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	44871b1b24	Btrfs: reduce stack usage in some crucial tree balancing functions Many of the tree balancing functions follow the same pattern. 1) cow a block 2) do something to the result This commit breaks them up into two functions so the variables and code required for part two don't suck down stack during part one. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	56bec294de	Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	9fa8cfe706	Btrfs: don't preallocate metadata blocks during btrfs_search_slot In order to avoid doing expensive extent management with tree locks held, btrfs_search_slot will preallocate tree blocks for use by COW without any tree locks held. A later commit moves all of the extent allocation work for COW into a delayed update mechanism, and this preallocation will no longer be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Martin K. Petersen	6d2a78e783	block: add private bio_set for bio integrity allocations The integrity bio allocation needs its own bio_set to avoid violating the mempool allocation rules and risking deadlocks. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:17 +01:00
Jens Axboe	a7fcd37cdc	block: don't create bio_vec slabs of less than the inline number If we don't have CONFIG_BLK_DEV_INTEGRITY set, then we don't have any external dependencies on the bio_vec slabs. So don't create the ones that we will inline anyway. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:16 +01:00
Ingo Molnar	34053979fb	block: cleanup bio_alloc_bioset() this warning (which got fixed by commit `b2bf968`): fs/bio.c: In function ‘bio_alloc_bioset’: fs/bio.c:305: warning: ‘p’ may be used uninitialized in this function Triggered because the code flow in bio_alloc_bioset() is correct but a bit complex for the compiler to see through. Streamline it a bit - this also makes the code a tiny bit more compact: text data bss dec hex filename 7540 256 40 7836 1e9c bio.o.before 7539 256 40 7835 1e9b bio.o.after Also remove an older compiler-warnings annotation from this function, it's not needed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:16 +01:00
Steven Whitehouse	df3647b245	GFS2: Fix freeze issue This removes some old code that was causing issues during filesystem freeze. Reported-by: Andrew Price <andy@andrewprice.me.uk> Tested-by: Andrew Price <andy@andrewprice.me.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:31:30 +00:00
Steven Whitehouse	9c538837d8	Fix a minor bug in the previous patch The logic requires that we mark the glock dirty in page_mkwrite otherwise we might not flush correctly in the case that no allocation was required in the process of dirying the page. Also we need to set the shared write flag early for the same reason. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:27 +00:00
Steven Whitehouse	6bac243f07	GFS2: Clean up of glops.c This cleans up a number of bits of code mostly based in glops.c. A couple of simple functions have been merged into the callers to make it more obvious what is going on, the mysterious raising of i_writecount around the truncate_inode_pages() call has been removed. The meta_go_* operations have been renamed rgrp_go_* since that is the only lock type that they are used with. The unused argument of gfs2_read_sb has been removed. Also a bug has been fixed where a check for the rindex inode was in the wrong callback. More comments are added, and the debugging code is improved too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:27 +00:00
Benjamin Marzinski	02ffad08e8	GFS2: Fix locking bug in failed shared to exclusive conversion After calling out to the dlm, GFS2 sets the new state of a glock to gl_target in gdlm_ast(). However, gl_target is not always the lock state that was requested. If a conversion from shared to exclusive fails, finish_xmote() will call do_xmote() with LM_ST_UNLOCKED, instead of gl->gl_target, so that it can reacquire the lock in exlusive the next time around. In this case, setting the lock to gl_target in gdlm_ast() will make GFS2 think that it has the glock in exclusive mode, when really, it doesn't have the glock locked at all. This patch adds a new field to the gfs2_glock structure, gl_req, to track the mode that was requested. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:26 +00:00
Hisashi Hifumi	229615def3	GFS2: Pagecache usage optimization on GFS2 I introduced "is_partially_uptodate" aops for GFS2. A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. I did a performance test using the sysbench. #sysbench --num-threads=16 --max-requests=200000 --test=fileio --file-num=1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --file-rw-ratio=1 run -2.6.29-rc6 Test execution summary: total time: 202.6389s total number of events: 200000 total time taken by event execution: 2580.0480 per-request statistics: min: 0.0000s avg: 0.0129s max: 49.5852s approx. 95 percentile: 0.0462s -2.6.29-rc6-patched Test execution summary: total time: 177.8639s total number of events: 200000 total time taken by event execution: 2419.0199 per-request statistics: min: 0.0000s avg: 0.0121s max: 52.4306s approx. 95 percentile: 0.0444s arch: ia64 pagesize: 16k blocksize: 4k Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:25 +00:00
Hannes Eder	02ab172159	GFS2: fix sparse warning: Should it be static? Impact: Make symbol static. Fix this sparse warning: fs/gfs2/rgrp.c:188:5: warning: symbol 'gfs2_bitfit' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:25 +00:00
Hannes Eder	075ac44875	GFS2: fix sparse warnings: constant is so big it is ... Fix this sparse warnings: fs/gfs2/rgrp.c:156:23: warning: constant 0xffffffffffffffff is so big it is unsigned long long fs/gfs2/rgrp.c:157:23: warning: constant 0xaaaaaaaaaaaaaaaa is so big it is unsigned long long fs/gfs2/rgrp.c:158:23: warning: constant 0x5555555555555555 is so big it is long long fs/gfs2/rgrp.c:194:20: warning: constant 0x5555555555555555 is so big it is long long fs/gfs2/rgrp.c:204:44: warning: constant 0x5555555555555555 is so big it is long long Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:24 +00:00
Steven Whitehouse	b9a9694570	GFS2: Support quota/noquota mount arguments This adds support for "quota" and "noquota" mount options in addition to the existing "quota=on/off/account" so that we are compatible with the names by which these options are more generally known. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:23 +00:00
Steven Whitehouse	223b2b889f	GFS2: Fix alignment issue and tidy gfs2_bitfit An alignment issue with the existing bitfit algorithm was reported on IA64. This patch attempts to fix that, and also to tidy up the code a bit. There is now more documentation about how this works and it has survived a number of different tests. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:22 +00:00
Steven Whitehouse	64d576ba23	GFS2: Add a "demote a glock" interface to sysfs This adds a sysfs file called demote_rq to GFS2's per filesystem directory. Its possible to use this file to demote arbitrary glocks in exactly the same way as if a request had come in from a remote node. This is intended for testing issues relating to caching of data under glocks. Despite that, the interface is generic enough to send requests to any type of glock, but be careful as its not always safe to send an arbitrary message to an arbitrary glock. For that reason and to prevent DoS, this interface is restricted to root only. The messages look like this: <type>:<glocknumber> <mode> Example: echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq Which means "please demote inode glock (type 2) number 13324 so that I can get an EX (exclusive) lock". The lock modes are those which would normally be sent by a remote node in its callback so if you want to unlock a glock, you use EX, to demote to shared, use SH or PR (depending on whether you like GFS2 or DLM lock modes better!). If the glock doesn't exist, you'll get -ENOENT returned. If the arguments don't make sense, you'll get -EINVAL returned. The plan is that this interface will be used in combination with the blktrace patch which I recently posted for comments although it is, of course, still useful in its own right. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:22 +00:00
Steven Whitehouse	02e3cc70ec	GFS2: Expose UUID via sysfs/uevent Since we have a UUID, we ought to expose it to the user via sysfs and uevents. We already have the fs name in both of these places (a combination of the lock proto and lock table name) so if we add the UUID as well, we have a full set. For older filesystems (i.e. those created before mkfs.gfs2 was writing UUIDs by default) the sysfs file will appear zero length, and no UUID env var will be added to the uevents. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:21 +00:00
Steven Whitehouse	f15ab5619d	GFS2: Support generation of discard requests This patch allows GFS2 to generate discard requests for blocks which are no longer useful to the filesystem (i.e. those which have been freed as the result of an unlink operation). The requests are generated at the time which those blocks become available for reuse in the filesystem. In order to use this new feature, you have to specify the "discard" mount option. The code coalesces adjacent blocks into a single extent when generating the discard requests, thus generating the minimum number. If an error occurs when the request has been sent to the block device, then it will print a message and turn off the requests for that filesystem. If the problem is temporary, then you can use remount to turn the option back on again. There is also a nodiscard mount option so that you can use remount to turn discard requests off, if required. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:20 +00:00
Steven Whitehouse	d8348de06f	GFS2: Fix deadlock on journal flush This patch fixes a deadlock when the journal is flushed and there are dirty inodes other than the one which caused the journal flush. Originally the journal flushing code was trying to obtain the transaction glock while running the flush code for an inode glock. We no longer require the transaction glock at this point in time since we know that any attempt to get the transaction glock from another node will result in a journal flush. So if we are flushing the journal, we can be sure that the transaction lock is still cached from when the transaction was started. By inlining a version of gfs2_trans_begin() (minus the bit which gets the transaction glock) we can avoid the deadlock problems caused if there is a demote request queued up on the transaction glock. In addition I've also moved the umount rwsem so that it covers the glock workqueue, since it all demotions are done by this workqueue now. That fixes a bug on umount which I came across while fixing the original problem. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:18 +00:00
Steven Whitehouse	e7c8707ea2	GFS2: Fix error path ref counting for root inode We were keeping hold of an extra ref to the root inode in one of the error paths, that resulted in a hang. Reported-by: Nate Straz <nstraz@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-by: Robert Peterson <rpeterso@redhat.com>	2009-03-24 11:21:17 +00:00
Steven Whitehouse	ac2425e7d3	GFS2: Remove unused field from glock The time stamp field is unused in the glock now that we are using a shrinker, so that we can remove it and save sizeof(unsigned long) bytes in each glock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:17 +00:00
Steven Whitehouse	f057f6cdf6	GFS2: Merge lock_dlm module into GFS2 This is the big patch that I've been working on for some time now. There are many reasons for wanting to make this change such as: o Reducing overhead by eliminating duplicated fields between structures o Simplifcation of the code (reduces the code size by a fair bit) o The locking interface is now the DLM interface itself as proposed some time ago. o Fewer lookups of glocks when processing replies from the DLM o Fewer memory allocations/deallocations for each glock o Scope to do further optimisations in the future (but this patch is more than big enough for now!) Please note that (a) this patch relates to the lock_dlm module and not the DLM itself, that is still a separate module; and (b) that we retain the ability to build GFS2 as a standalone single node filesystem with out requiring the DLM. This patch needs a lot of testing, hence my keeping it I restarted my -git tree after the last merge window. That way, this has the maximum exposure before its merged. This is (modulo a few minor bug fixes) the same patch that I've been posting on and off the the last three months and its passed a number of different tests so far. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:14 +00:00
Steven Whitehouse	22077f57de	GFS2: Remove "double" locking in quota We only really need a single spin lock for the quota data, so lets just use the lru lock for now. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>	2009-03-24 11:21:13 +00:00
Abhijith Das	0a7ab79c5b	GFS2: change gfs2_quota_scan into a shrinker Deallocation of gfs2_quota_data objects now happens on-demand through a shrinker instead of routinely deallocating through the quotad daemon. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:12 +00:00
Abhijith Das	2db2aac255	GFS2: Bring back lvb-related stuff to lock_nolock to support quotas The quota code uses lvbs and this is currently not implemented in lock_nolock, thereby causing panics when quota is enabled with lock_nolock. This patch adds the relevant bits. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:11 +00:00
Steven Whitehouse	6f04c1c7fe	GFS2: Fix remount argument parsing The following patch fixes an issue relating to remount and argument parsing. After this fix is applied, remount becomes atomic in that it either succeeds changing the mount to the new state, or it fails and leaves it in the old state. Previously it was possible for the parsing of options to fail part way though and for the fs to be left in a state where some of the new arguments had been applied, but some had not. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:10 +00:00
James Morris	703a3cd728	Merge branch 'master' into next	2009-03-24 10:52:46 +11:00
Gertjan van Wingerde	f762dd6821	Update my email address Update all previous incarnations of my email address to the correct one. Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:28:37 -07:00
Tyler Hicks	2aac0cf886	eCryptfs: NULL crypt_stat dereference during lookup If ecryptfs_encrypted_view or ecryptfs_xattr_metadata were being specified as mount options, a NULL pointer dereference of crypt_stat was possible during lookup. This patch moves the crypt_stat assignment into ecryptfs_lookup_and_interpose_lower(), ensuring that crypt_stat will not be NULL before we attempt to dereference it. Thanks to Dan Carpenter and his static analysis tool, smatch, for finding this bug. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:20:43 -07:00
Tyler Hicks	8faece5f90	eCryptfs: Allocate a variable number of pages for file headers When allocating the memory used to store the eCryptfs header contents, a single, zeroed page was being allocated with get_zeroed_page(). However, the size of an eCryptfs header is either PAGE_CACHE_SIZE or ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE (8192), whichever is larger, and is stored in the file's private_data->crypt_stat->num_header_bytes_at_front field. ecryptfs_write_metadata_to_contents() was using num_header_bytes_at_front to decide how many bytes should be written to the lower filesystem for the file header. Unfortunately, at least 8K was being written from the page, despite the chance of the single, zeroed page being smaller than 8K. This resulted in random areas of kernel memory being written between the 0x1000 and 0x1FFF bytes offsets in the eCryptfs file headers if PAGE_SIZE was 4K. This patch allocates a variable number of pages, calculated with num_header_bytes_at_front, and passes the number of allocated pages along to ecryptfs_write_metadata_to_contents(). Thanks to Florian Streibelt for reporting the data leak and working with me to find the problem. 2.6.28 is the only kernel release with this vulnerability. Corresponds to CVE-2009-0787 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Greg KH <greg@kroah.com> Cc: dann frazier <dannf@dannf.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Florian Streibelt <florian@f-streibelt.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:20:43 -07:00
Jeff Moyer	65c24491b4	aio: lookup_ioctx can return the wrong value when looking up a bogus context The libaio test harness turned up a problem whereby lookup_ioctx on a bogus io context was returning the 1 valid io context from the list (harness/cases/3.p). Because of that, an extra put_iocontext was done, and when the process exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio (since we expect a users count of 1 and instead get 0). The problem was introduced by "aio: make the lookup_ioctx() lockless" (commit `abf137dd77`). Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not return with a NULL tpos at the end of the loop, even if the entry was not found. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 15:57:18 -07:00
Davide Libenzi	87c3a86e1c	eventfd: remove fput() call from possible IRQ context Remove a source of fput() call from inside IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was able to, with the attached test program. Independently from this, the bug is conceptually there, so we might be better off fixing it. This patch adds an optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty in general, but the alternative here would be to add a brand new delayed fput() infrastructure, that I'm not sure is worth it. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 15:57:18 -07:00
Linus Torvalds	fe2fd6cc34	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: Clear space_info full when adding new devices Btrfs: Fix locking around adding new space_info	2009-03-19 14:49:55 -07:00
Trond Myklebust	7fe5c398fc	NFS: Optimise NFS close() Close-to-open cache consistency rules really only require us to flush out writes on calls to close(), and require us to revalidate attributes on the very last close of the file. Currently we appear to be doing a lot of extra attribute revalidation and cache flushes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:35:50 -04:00
Trond Myklebust	b1e4adf4ea	NFS: Fix the notifications when renaming onto an existing file NFS appears to be returning an unnecessary "delete" notification when we're doing an atomic rename. See http://bugzilla.gnome.org/show_bug.cgi?id=575684 The fix is to get rid of the redundant call to d_delete(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:35:49 -04:00
Trond Myklebust	47c6256420	NFS: Fix up a mismerged patch Move the definition of nfs_need_commit() into the #ifdef CONFIG_NFS_V3 section as originally intended in the patch "NFS: cleanup - remove struct nfs_inode->ncommit" Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:17:40 -04:00
Linus Torvalds	a8e7d49aa7	Fix race in create_empty_buffers() vs __set_page_dirty_buffers() Nick Piggin noticed this (very unlikely) race between setting a page dirty and creating the buffers for it - we need to hold the mapping private_lock until we've set the page dirty bit in order to make sure that create_empty_buffers() might not build up a set of buffers without the dirty bits set when the page is dirty. I doubt anybody has ever hit this race (and it didn't solve the issue Nick was looking at), but as Nick says: "Still, it does appear to solve a real race, which we should close." Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 11:32:05 -07:00
Linus Torvalds	85bff8857c	Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: nfsd: nfsd should drop CAP_MKNOD for non-root NFSD: provide encode routine for OP_OPENATTR	2009-03-18 09:27:20 -07:00
Steve French	b363b3304b	[CIFS] Fix memory overwrite when saving nativeFileSystem field during mount CIFS can allocate a few bytes to little for the nativeFileSystem field during tree connect response processing during mount. This can result in a "Redzone overwritten" message to be logged. Signed-off-by: Sridhar Vinay <vinaysridhar@in.ibm.com> Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-18 05:57:22 +00:00
Steve French	c6c00919ab	[CIFS] Rename compose_mount_options to cifs_compose_mount_options. Make it available to others for reuse. Signed-off-by: Igor Mammedov <niallain@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-18 05:50:07 +00:00
Linus Torvalds	58cefd2b1e	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix bb_prealloc_list corruption due to wrong group locking ext4: fix bogus BUG_ONs in in mballoc code ext4: Print the find_group_flex() warning only once ext4: fix header check in ext4_ext_search_right() for deep extent trees.	2009-03-17 20:55:40 -07:00
Benny Halevy	84f09f46b4	NFSD: provide encode routine for OP_OPENATTR Although this operation is unsupported by our implementation we still need to provide an encode routine for it to merely encode its (error) status back in the compound reply. Thanks for Bill Baker at sun.com for testing with the Sun OpenSolaris' client, finding, and reporting this bug at Connectathon 2009. This bug was introduced in 2.6.27 Signed-off-by: Benny Halevy <bhalevy@panasas.com> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-03-17 14:54:45 -04:00
Linus Torvalds	ee568b25ee	Avoid 64-bit "switch()" statements on 32-bit architectures Commit `ee6f779b9e` ("filp->f_pos not correctly updated in proc_task_readdir") changed the proc code to use filp->f_pos directly, rather than through a temporary variable. In the process, that caused the operations to be done on the full 64 bits, even though the offset is never that big. That's all fine and dandy per se, but for some unfathomable reason gcc generates absolutely horrid code when using 64-bit values in switch() statements. To the point of actually calling out to gcc helper functions like __cmpdi2 rather than just doing the trivial comparisons directly the way gcc does for normal compares. At which point we get link failures, because we really don't want to support that kind of crazy code. Fix this by just casting the f_pos value to "unsigned long", which is plenty big enough for /proc, and avoids the gcc code generation issue. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Zhang Le <r0bertz@gentoo.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-17 10:02:35 -07:00
Eric Sandeen	d33a1976fb	ext4: fix bb_prealloc_list corruption due to wrong group locking This is for Red Hat bug 490026: EXT4 panic, list corruption in ext4_mb_new_inode_pa ext4_lock_group(sb, group) is supposed to protect this list for each group, and a common code flow to remove an album is like this: ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL); ext4_lock_group(sb, grp); list_del(&pa->pa_group_list); ext4_unlock_group(sb, grp); so it's critical that we get the right group number back for this prealloc context, to lock the right group (the one associated with this pa) and prevent concurrent list manipulation. however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a comment, "-1 is to protect from crossing allocation group". This makes sense for the group_pa, where pa_pstart is advanced by the length which has been used (in ext4_mb_release_context()), and when the entire length has been used, pa_pstart has been advanced to the first block of the next group. However, for inode_pa, pa_pstart is never advanced; it's just set once to the first block in the group and not moved after that. So in this case, if we subtract one in ext4_mb_put_pa(), we are actually locking the previous group, and opening the race with the other threads which do not subtract off the extra block. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-16 23:25:40 -04:00
Theodore Ts'o	afd4672dc7	ext4: Add auto_da_alloc mount option Add a mount option which allows the user to disable automatic allocation of blocks whose allocation by delayed allocation when the file was originally truncated or when the file is renamed over an existing file. This feature is intended to save users from the effects of naive application writers, but it reduces the effectiveness of the delayed allocation code. This mount option disables this safety feature, which may be desirable for prodcutions systems where the risk of unclean shutdowns or unexpected system crashes is low. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-16 23:12:23 -04:00
Zhang Le	ee6f779b9e	filp->f_pos not correctly updated in proc_task_readdir filp->f_pos only get updated at the end of the function. Thus d_off of those dirents who are in the middle will be 0, and this will cause a problem in glibc's readdir implementation, specifically endless loop. Because when overflow occurs, f_pos will be set to next dirent to read, however it will be 0, unless the next one is the last one. So it will start over again and again. There is a sample program in man 2 gendents. This is the output of the program running on a multithread program's task dir before this patch is applied: $ ./a.out /proc/3807/task --------------- nread=128 --------------- i-node# file type d_reclen d_off d_name 506442 directory 16 1 . 506441 directory 16 0 .. 506443 directory 16 0 3807 506444 directory 16 0 3809 506445 directory 16 0 3812 506446 directory 16 0 3861 506447 directory 16 0 3862 506448 directory 16 8 3863 This is the output after this patch is applied $ ./a.out /proc/3807/task --------------- nread=128 --------------- i-node# file type d_reclen d_off d_name 506442 directory 16 1 . 506441 directory 16 2 .. 506443 directory 16 3 3807 506444 directory 16 4 3809 506445 directory 16 5 3812 506446 directory 16 6 3861 506447 directory 16 7 3862 506448 directory 16 8 3863 Signed-off-by: Zhang Le <r0bertz@gentoo.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-16 07:51:33 -07:00
Jonathan Corbet	60aa49243d	Rationalize fasync return values Most fasync implementations do something like: return fasync_helper(...); But fasync_helper() will return a positive value at times - a feature used in at least one place. Thus, a number of other drivers do: err = fasync_helper(...); if (err < 0) return err; return 0; In the interests of consistency and more concise code, it makes sense to map positive return values onto zero where ->fasync() is called. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:34:35 -06:00
Jonathan Corbet	76398425bb	Move FASYNC bit handling to f_op->fasync() Removing the BKL from FASYNC handling ran into the challenge of keeping the setting of the FASYNC bit in filp->f_flags atomic with regard to calls to the underlying fasync() function. Andi Kleen suggested moving the handling of that bit into fasync(); this patch does exactly that. As a result, we have a couple of internal API changes: fasync() must now manage the FASYNC bit, and it will be called without the BKL held. As it happens, every fasync() implementation in the kernel with one exception calls fasync_helper(). So, if we make fasync_helper() set the FASYNC bit, we can avoid making any changes to the other fasync() functions - as long as those functions, themselves, have proper locking. Most fasync() implementations do nothing but call fasync_helper() - which has its own lock - so they are easily verified as correct. The BKL had already been pushed down into the rest. The networking code has its own version of fasync_helper(), so that code has been augmented with explicit FASYNC bit handling. Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Miller <davem@davemloft.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Jonathan Corbet	db1dd4d376	Use f_lock to protect f_flags Traditionally, changes to struct file->f_flags have been done under BKL protection, or with no protection at all. This patch causes all f_flags changes after file open/creation time to be done under protection of f_lock. This allows the removal of some BKL usage and fixes a number of longstanding (if microscopic) races. Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Jonathan Corbet	6849991490	Rename struct file->f_ep_lock This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now, epoll remains the only user, but a future patch will use it to protect f_flags as well. Cc: Davide Libenzi <davidel@xmailserver.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Linus Torvalds	18553c38bc	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix Xilinx SystemACE driver to handle empty CF slot block: fix memory leak in bio_clone() block: Add gfp_mask parameter to bio_integrity_clone()	2009-03-14 13:43:18 -07:00
Li Zefan	059ea3318c	block: fix memory leak in bio_clone() If bio_integrity_clone() fails, bio_clone() returns NULL without freeing the newly allocated bio. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-14 21:06:52 +01:00
un'ichi Nomura	87092698c6	block: Add gfp_mask parameter to bio_integrity_clone() Stricter gfp_mask might be required for clone allocation. For example, request-based dm may clone bio in interrupt context so it has to use GFP_ATOMIC. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-14 21:06:51 +01:00
Linus Torvalds	f1823acfbc	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined... SUNRPC: xprt_connect() don't abort the task if the transport isn't bound SUNRPC: Fix an Oops due to socket not set up yet... Bug 11061, NFS mounts dropped NFS: Handle -ESTALE error in access() NLM: Fix GRANT callback address comparison when IPv6 is enabled NLM: Shrink the IPv4-only version of nlm_cmp_addr() NFSv3: Fix posix ACL code NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2) SUNRPC: Tighten up the task locking rules in __rpc_execute()	2009-03-14 12:00:18 -07:00
Linus Torvalds	ff9cb43ce0	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: ocfs2: Use xs->bucket to set xattr value outside ocfs2: Fix a bug found by sparse check. ocfs2: tweak to get the maximum inline data size with xattr ocfs2: reserve xattr block for new directory with inline data	2009-03-14 11:59:22 -07:00
Tyler Hicks	84814d642a	eCryptfs: don't encrypt file key with filename key eCryptfs has file encryption keys (FEK), file encryption key encryption keys (FEKEK), and filename encryption keys (FNEK). The per-file FEK is encrypted with one or more FEKEKs and stored in the header of the encrypted file. I noticed that the FEK is also being encrypted by the FNEK. This is a problem if a user wants to use a different FNEK than their FEKEK, as their file contents will still be accessible with the FNEK. This is a minimalistic patch which prevents the FNEKs signatures from being copied to the inode signatures list. Ultimately, it keeps the FEK from being encrypted with a FNEK. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Johannes Weiner	15e7b87676	nommu: ramfs: don't leak pages when adding to page cache fails When a ramfs nommu mapping is expanded, contiguous pages are allocated and added to the pagecache. The caller's reference is then passed on by moving whole pagevecs to the file lru list. If the page cache adding fails, make sure that the error path also moves the pagevec contents which might still contain up to PAGEVEC_SIZE successfully added pages, of which we would leak references otherwise. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Howells <dhowells@redhat.com> Cc: Enrik Berkhan <Enrik.Berkhan@ge.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Enrik Berkhan	020fe22ff1	nommu: ramfs: pages allocated to an inode's pagecache may get wrongly discarded The pages attached to a ramfs inode's pagecache by truncation from nothing - as done by SYSV SHM for example - may get discarded under memory pressure. The problem is that the pages are not marked dirty. Anything that creates data in an MMU-based ramfs will cause the pages holding that data will cause the set_page_dirty() aop to be called. For the NOMMU-based mmap, set_page_dirty() may be called by write(), but it won't be called by page-writing faults on writable mmaps, and it isn't called by ramfs_nommu_expand_for_mapping() when a file is being truncated from nothing to allocate a contiguous run. The solution is to mark the pages dirty at the point of allocation by the truncation code. Signed-off-by: Enrik Berkhan <Enrik.Berkhan@ge.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Eric Sandeen	8d03c7a0c5	ext4: fix bogus BUG_ONs in in mballoc code Thiemo Nagel reported that: # dd if=/dev/zero of=image.ext4 bs=1M count=2 # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \ -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4 # mount -o loop image.ext4 mnt/ # dd if=/dev/zero of=mnt/file oopsed, with a BUG_ON in ext4_mb_normalize_request because size == EXT4_BLOCKS_PER_GROUP It appears to me (esp. after talking to Andreas) that the BUG_ON is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should be allowed, though larger sizes do indicate a problem. Fix that an another (apparently rare) codepath with a similar check. Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-14 11:51:46 -04:00
Tao Ma	712e53e46a	ocfs2: Use xs->bucket to set xattr value outside A long time ago, xs->base is allocated a 4K size and all the contents in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to abstract xattr bucket and xs->base is initialized to the start of the bu_bhs[0]. So xs->base + offset will overflow when the value root is stored outside the first block. Then why we can survive the xattr test by now? It is because we always read the bucket contiguously now and kernel mm allocate continguous memory for us. We are lucky, but we should fix it. So just get the right value root as other callers do. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:46:09 -07:00
Tao Ma	74e77eb30d	ocfs2: Fix a bug found by sparse check. We need to use le32_to_cpu to test rec->e_cpos in ocfs2_dinode_insert_check. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:46:01 -07:00
Tiger Yang	d9ae49d6e2	ocfs2: tweak to get the maximum inline data size with xattr Replace max_inline_data with max_inline_data_with_xattr to ensure it correct when xattr inlined. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:45:46 -07:00
Tiger Yang	6c9fd1dc0a	ocfs2: reserve xattr block for new directory with inline data If this is a new directory with inline data, we choose to reserve the entire inline area for directory contents and force an external xattr block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:45:40 -07:00
Linus Torvalds	188de5ec56	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch	2009-03-12 16:32:36 -07:00
Nick Piggin	7ef0d7377c	fs: new inode i_state corruption fix There was a report of a data corruption http://lkml.org/lkml/2008/11/14/121. There is a script included to reproduce the problem. During testing, I encountered a number of strange things with ext3, so I tried ext2 to attempt to reduce complexity of the problem. I found that fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be cleared, even though instrumentation showed that unlock_new_inode had already been called for that inode. This points to memory scribble, or synchronisation problme. i_state of I_NEW inodes is not protected by inode_lock because other processes are not supposed to touch them until I_LOCK (and I_NEW) is cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify i_state revealed that generic_sync_sb_inodes is picking up new inodes from the inode lists and passing them to __writeback_single_inode without waiting for I_NEW. Subsequently modifying i_state causes corruption. In my case it would look like this: CPU0 CPU1 unlock_new_inode() __sync_single_inode() reg <- inode->i_state reg -> reg & ~(I_LOCK\|I_NEW) reg <- inode->i_state reg -> inode->i_state reg -> reg \| I_SYNC reg -> inode->i_state Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK\|I_NEW again. Fix for this is rather than wait for I_NEW inodes, just skip over them: inodes concurrently being created are not subject to data integrity operations, and should not significantly contribute to dirty memory either. After this change, I'm unable to reproduce any of the added warnings or hangs after ~1hour of running. Previously, the new warnings would start immediately and hang would happen in under 5 minutes. I'm also testing on ext3 now, and so far no problems there either. I don't know whether this fixes the problem reported above, but it fixes a real problem for me. Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:24 -07:00
Li Zefan	a3cfbb53b1	vfs: add missing unlock in sget() In sget(), destroy_super(s) is called with s->s_umount held, which makes lockdep unhappy. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:23 -07:00
Oleg Nesterov	e5bc49ba74	pipe_rdwr_fasync: fix the error handling to prevent the leak/crash If the second fasync_helper() fails, pipe_rdwr_fasync() returns the error but leaves the file on ->fasync_readers. This was always wrong, but since `233e70f422` "saner FASYNC handling on file close" we have the new problem. Because in this case setfl() doesn't set FASYNC bit, __fput() will not do ->fasync(0), and we leak fasync_struct with ->fa_file pointing to the freed file. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:23 -07:00
Trond Myklebust	9f4c899c0d	NFS: Fix the fix to Bugzilla #11061 , when IPv6 isn't defined... Stephen Rothwell reports: Today's linux-next build (powerpc ppc64_defconfig) failed like this: fs/built-in.o: In function `.nfs_get_client': client.c:(.text+0x115010): undefined reference to `.__ipv6_addr_type' Fix by moving the IPV6 specific parts of commit `d7371c41b0` ("Bug 11061, NFS mounts dropped") into the '#ifdef IPV6..." section. Also fix up a couple of formatting issues. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-12 14:51:32 -04:00
Theodore Ts'o	2842c3b544	ext4: Print the find_group_flex() warning only once This is a short-term warning, and even printk_ratelimit() can result in too much noise in system logs. So only print it once as a warning. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-12 12:20:01 -04:00
Phillip Lougher	363911d027	Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch The corrupted filesystem patch added a check against zlib trying to output too much data in the presence of data corruption. This check triggered if zlib_inflate asked to be called again (Z_OK) with avail_out == 0 and no more output buffers available. This check proves to be rather dumb, as it incorrectly catches the case where zlib has generated all the output, but there are still input bytes to be processed. This patch does a number of things. It removes the original check and replaces it with code to not move to the next output buffer if there are no more output buffers available, relying on zlib to error if it wants an extra output buffer in the case of data corruption. It also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH flag, and makes the error messages more understandable to non-technical users. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk> Reported-by: Stefan Lippers-Hollmann <s.L-H@gmx.de>	2009-03-12 03:23:48 +00:00
Steve French	64cc2c6369	[CIFS] work around bug in Samba server handling for posix open Samba server (version 3.3.1 and earlier, and 3.2.8 and earlier) incorrectly required the O_CREAT flag on posix open (even when a file was not being created). This disables posix open (create is still ok) after the first attempt returns EINVAL (and logs an error, once, recommending that they update their server). Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	276a74a483	[CIFS] Use posix open on file open when server supports it Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Jeff Layton	fcc7c09d94	cifs: fix buffer format byte on NT Rename/hardlink Discovered at Connnectathon 2009... The buffer format byte and the pad are transposed in NT_RENAME calls (which are used to set hardlinks). Most servers seem to ignore this fact, but NetApp filers throw back an error due to this problem. This patch fixes it. CC: Stable <stable@kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	0382457744	[CIFS] Add definitions for remoteably fsctl calls There are about 60 fsctl calls which Windows claims would be able to be sent remotely and handled by the server. This adds the #defines for them. A few of them look immediately useful, but need to also add the structure definitions for them so they can be sent as SMBs. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	1adcb71092	[CIFS] add extra null attr check Although attr == NULL can not happen, this makes cifs_set_file_info safer in the future since it may not be obvious that the caller can not set attr to NULL. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	4717bed680	[CIFS] fix build error Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	7fc8f4e95b	[CIFS] reopen file via newer posix open protocol operation if available If the network connection crashes, and we have to reopen files, preferentially use the newer cifs posix open protocol operation if the server supports it. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	be652445fd	[CIFS] Add new nostrictsync cifs mount option to avoid slow SMB flush If this mount option is set, when an application does an fsync call then the cifs client does not send an SMB Flush to the server (to force the server to write all dirty data for this file immediately to disk), although cifs still sends all dirty (cached) file data to the server and waits for the server to respond to the write write. Since SMB Flush can be very slow, and some servers may be reliable enough (to risk delaying slightly flushing the data to disk on the server), turning on this option may be useful to improve performance for applications that fsync too much, at a small risk of server crash. If this mount option is not set, by default cifs will send an SMB flush request (and wait for a response) on every fsync call. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	10e70afa75	[CIFS] DFS no longer experimental Also updates some DFS flag definitions Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	b298f22355	[CIFS] Send SMB flush in cifs_fsync In contrast to the now-obsolete smbfs, cifs does not send SMB_COM_FLUSH in response to an explicit fsync(2) to guarantee that all volatile data is written to stable storage on the server side, provided the server honors the request (which, to my knowledge, is true for Windows and Samba with 'strict sync' enabled). This patch modifies the cifs_fsync implementation to restore the fsync-behavior of smbfs by triggering SMB_COM_FLUSH after sending outstanding data on the client side to the server. Signed-off-by: Horst Reiterer <horst.reiterer@gmail.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Linus Torvalds	0789d8fccb	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: only issues a cache flush on unmount if barriers are enabled xfs: prevent lockdep false positive in xfs_iget_cache_miss xfs: prevent kernel crash due to corrupted inode log format	2009-03-11 14:29:03 -07:00
OGAWA Hirofumi	3a95ea1155	Fix _fat_bmap() locking On swapon() path, it has already i_mutex. So, this uses i_alloc_sem instead of it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Laurent GUERBY <laurent@guerby.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-11 12:04:18 -07:00
Tom Talpey	a67d18f89f	NFS: load the rpc/rdma transport module automatically When mounting an NFS/RDMA server with the "-o proto=rdma" or "-o rdma" options, attempt to dynamically load the necessary "xprtrdma" client transport module. Doing so improves usability, while avoiding a static module dependency and any unnecesary resources. Signed-off-by: Tom Talpey <tmtalpey@gmail.com> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:37:56 -04:00
Trond Myklebust	e1ebfd33be	NFS: Kill the "defined but not used" compile error on nommu machines Bryan Wu reports that when compiling NFS on nommu machines he gets a "defined but not used" error on nfs_file_mmap(). The easiest fix is simply to get rid of the special casing in NFS, and just always call generic_file_mmap() to set up the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:37:54 -04:00
Trond Myklebust	72cb77f4a5	NFS: Throttle page dirtying while we're flushing to disk The following patch is a combination of a patch by myself and Peter Staubach. Trond: If we allow other processes to dirty pages while a process is doing a consistency sync to disk, we can end up never making progress. Peter: Attached is a patch which addresses a continuing problem with the NFS client generating out of order WRITE requests. While this is compliant with all of the current protocol specifications, there are servers in the market which can not handle out of order WRITE requests very well. Also, this may lead to sub-optimal block allocations in the underlying file system on the server. This may cause the read throughputs to be reduced when reading the file from the server. Peter: There has been a lot of work recently done to address out of order issues on a systemic level. However, the NFS client is still susceptible to the problem. Out of order WRITE requests can occur when pdflush is in the middle of writing out pages while the process dirtying the pages calls generic_file_buffered_write which calls generic_perform_write which calls balance_dirty_pages_rate_limited which ends up calling writeback_inodes which ends up calling back into the NFS client to writes out dirty pages for the same file that pdflush happens to be working with. Signed-off-by: Peter Staubach <staubach@redhat.com> [modification by Trond to merge the two similar patches] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:30 -04:00
Trond Myklebust	fb8a1f11b6	NFS: cleanup - remove struct nfs_inode->ncommit Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:29 -04:00
Trond Myklebust	a65318bf3a	NFSv4: Simplify some cache consistency post-op GETATTRs Certain asynchronous operations such as write() do not expect (or care) that other metadata such as the file owner, mode, acls, ... change. All they want to do is update and/or check the change attribute, ctime, and mtime. By skipping the file owner and group update, we also avoid having to do a potential idmapper upcall for these asynchronous RPC calls. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:28 -04:00
Trond Myklebust	69aaaae18f	NFSv4: A referral is assumed to always point to a directory. Fix a bug whereby we would fail to create a mount point for a referral. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:28 -04:00
Trond Myklebust	409924e4c9	NFSv4: Make decode_getfattr() set fattr->valid to reflect what was decoded Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:27 -04:00
Trond Myklebust	f26c7a7887	NFSv4: Clean up decode_getfattr() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:26 -04:00
Trond Myklebust	bca794785c	NFS: Fix the type of struct nfs_fattr->mode There is no point in using anything other than umode_t, since we copy the content pretty much directly into inode->i_mode. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:26 -04:00
Trond Myklebust	1ca277d88d	NFS: Shrink the struct nfs_fattr We don't need the bitmap[] field anymore, since the 'valid' field tells us all we need to know about which attributes were filled in... Also move the pre-op attributes in order to improve the structure packing. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:25 -04:00
Trond Myklebust	9e6e70f8d8	NFSv4: Support NFSv4 optional attributes in the struct nfs_fattr Currently, filling struct nfs_fattr is more or less an all or nothing operation, since NFSv2 and NFSv3 have only mandatory attributes. In NFSv4, some attributes are optional, and so we may simply not be able to fill in those fields. Furthermore, NFSv4 allows you to specify which attributes you are interested in retrieving, thus permitting you to optimise away retrieval of attributes that you know will no change... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:24 -04:00

... 2 3 4 5 6 ...

13209 Commits