Commit Graph

3937 Commits

Author SHA1 Message Date
Filipe Manana
21543baddc Btrfs: fix race when updating existing ref head
While we update an existing ref head's extent_op, we're not holding
its spinlock, so while we're updating its extent_op contents (key,
flags) we can have a task running __btrfs_run_delayed_refs() that
holds the ref head's lock and sets its extent_op to NULL right after
the task updating the ref head just checked its extent_op was not NULL.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:28 -07:00
Qu Wenruo
c3a468915a btrfs: Add trace for btrfs_workqueue alloc/destroy
Since most of the btrfs_workqueue is printed as pointer address,
for easier analysis, add trace for btrfs_workqueue alloc/destroy.
So it is possible to determine the workqueue that a given work belongs
to(by comparing the wq pointer address with alloc trace event).

Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:28 -07:00
Filipe Manana
f094c9bd3e Btrfs: less fs tree lock contention when using autodefrag
When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.

Test:

    sysbench --test=fileio --file-num=512 --file-total-size=16G \
       --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
       --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
       --max-requests=10000000000 [prepare|run]

(fileystem mounted with -o autodefrag, averages of 5 runs)

Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change:  63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb

Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Guangyu Sun
72de6b5393 Btrfs: return EPERM when deleting a default subvolume
The error message is confusing:

 # btrfs sub delete /mnt/mysub/
 Delete subvolume '/mnt/mysub'
 ERROR: cannot delete '/mnt/mysub' - Directory not empty

The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.

Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")

Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Filipe Manana
ef66af101a Btrfs: add missing kfree in btrfs_destroy_workqueue
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Filipe Manana
308d9800b2 Btrfs: cache extent states in defrag code path
When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Josef Bacik
3bbb24b20a Btrfs: fix deadlock with nested trans handles
Zach found this deadlock that would happen like this

btrfs_end_transaction <- reduce trans->use_count to 0
  btrfs_run_delayed_refs
    btrfs_cow_block
      find_free_extent
	btrfs_start_transaction <- increase trans->use_count to 1
          allocate chunk
	btrfs_end_transaction <- decrease trans->use_count to 0
	  btrfs_run_delayed_refs
	    lock tree block we are cowing above ^^

We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone.  This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction.  Thanks,

cc: stable@vger.kernel.org
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-03-20 17:15:27 -07:00
Miao Xie
573bfb72f7 Btrfs: fix possible empty list access when flushing the delalloc inodes
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:29 -04:00
Miao Xie
31f3d255c6 Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:28 -04:00
Miao Xie
6c255e67ce Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:27 -04:00
Miao Xie
24af7dd188 Btrfs: reclaim delalloc metadata more aggressively
generic/074 in xfstests failed sometimes because of the enospc error,
the reason of this problem is that we just reclaimed the space we need
from the reserved space for delalloc, and then tried to reserve the space,
but if some task did no-flush reservation between the above reclamation
and reservation,
	Task1			Task2
	shrink_delalloc()
	reclaim 1 block
	(The space that can
	 be reserved now is 1
	 block)
				do no-flush reservation
				reserve 1 block
				(The space that can
				 be reserved now is 0
				 block)
	reserving 1 block failed
the reservation of Task1 failed, but in fact, there was enough space to
reserve if we could reclaim more space before.

Fix this problem by the aggressive reclamation of the reserved delalloc
metadata space.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:26 -04:00
Miao Xie
0424c54897 Btrfs: remove unnecessary lock in may_commit_transaction()
The reason is:
- The per-cpu counter has its own lock to protect itself.
- Here we needn't get a exact value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:25 -04:00
Miao Xie
b88935bf98 Btrfs: remove the unnecessary flush when preparing the pages
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:25 -04:00
Miao Xie
41bd9ca459 Btrfs: just do dirty page flush for the inode with compression before direct IO
As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.

And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:24 -04:00
Miao Xie
af7a65097b Btrfs: wake up the tasks that wait for the io earlier
The tasks that wait for the IO_DONE flag just care about the io of the dirty
pages, so it is better to wake up them immediately after all the pages are
written, not the whole process of the io completes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:23 -04:00
Miao Xie
8b9d83cd6b Btrfs: fix early enospc due to the race of the two ordered extent wait
btrfs_wait_ordered_roots() moves all the list entries to a new list,
and then deals with them one by one. But if the other task invokes this
function at that time, it would get a empty list. It makes the enospc
error happens more early. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:22 -04:00
Miao Xie
8257b2dc3c Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:22 -04:00
Qu Wenruo
52483bc26f btrfs: Add ftrace for btrfs_workqueue
Add ftrace for btrfs_workqueue for further workqueue tunning.
This patch needs to applied after the workqueue replace patchset.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:21 -04:00
Qu Wenruo
6db8914f97 btrfs: Cleanup the btrfs_workqueue related function type
The new btrfs_workqueue still use open-coded function defition,
this patch will change them into btrfs_func_t type which is much the
same as kernel workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:20 -04:00
Liu Bo
2131bcd38b Btrfs: add readahead for send_write
Btrfs send reads data from disk and then writes to a stream via pipe or
a file via flush.

Currently we're going to read each page a time, so every page results
in a disk read, which is not friendly to disks, esp. HDD.  Given that,
the performance can be gained by adding readahead for those pages.

Here is a quick test:
$ btrfs subvolume create send
$ xfs_io -f -c "pwrite 0 1G" send/foobar
$ btrfs subvolume snap -r send ro
$ time "btrfs send ro -f /dev/null"

           w/o             w
real    1m37.527s       0m9.097s
user    0m0.122s        0m0.086s
sys     0m53.191s       0m12.857s

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:19 -04:00
Liu Bo
a4d96d6254 Btrfs: share the same code for __record_{new,deleted}_ref
This has no functional change, only picks out the same part of two functions,
and makes it shared.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:19 -04:00
Filipe Manana
fcbd2154d1 Btrfs: avoid unnecessary utimes update in incremental send
When we're finishing processing of an inode, if we're dealing with a
directory inode that has a pending move/rename operation, we don't
need to send a utimes update instruction to the send stream, as we'll
do it later after doing the move/rename operation. Therefore we save
some time here building paths and doing btree lookups.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:18 -04:00
Filipe Manana
e2127cf008 Btrfs: make defrag not fragment files when using prealloc extents
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.

Example:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0 1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync

Before defragmenting the file:

$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 0 nr 4096
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 4096 nr 102400 ram 1048576
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 106496 nr 90112
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 196608 nr 106496 ram 1048576
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 897024 nr 106496 ram 1048576
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 1003520 nr 45056
(...)

Now defragmenting the file results in more data space used than before:

$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 0 nr 4096 ram 1048576
		extent compression 0
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 13893632 nr 102400
		extent data offset 0 nr 102400 ram 102400
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 106496 nr 90112 ram 1048576
		extent compression 0
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 13996032 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 14102528 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 1003520 nr 45056 ram 1048576
		extent compression 0
(...)

With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte 12845056.

In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:17 -04:00
Filipe Manana
dec8ef9055 Btrfs: correctly flush data on defrag when compression is enabled
When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
enabled, we weren't flushing completely, as writing compressed extents
is a 2 steps process, one to compress the data and another one to write
the compressed data to disk.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00
Qu Wenruo
d458b0540e btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00
Qu Wenruo
a046e9c88b btrfs: Cleanup the old btrfs_worker.
Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.

Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:15 -04:00
Qu Wenruo
0339ef2f42 btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.
Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:14 -04:00
Qu Wenruo
fc97fab0ea btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.
Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:13 -04:00
Qu Wenruo
5b3bc44e2e btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue.
Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:12 -04:00
Qu Wenruo
dc6e320998 btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:12 -04:00
Qu Wenruo
736cfa15e8 btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue.
Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:11 -04:00
Qu Wenruo
e66f0bb144 btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.
Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:10 -04:00
Qu Wenruo
d05a33ac26 btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue.
Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:09 -04:00
Qu Wenruo
fccb5d86d8 btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:08 -04:00
Qu Wenruo
a44903abe9 btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:07 -04:00
Qu Wenruo
a8c93d4ef6 btrfs: Replace fs_info->submit_workers with btrfs_workqueue.
Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:07 -04:00
Qu Wenruo
afe3d24267 btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:06 -04:00
Qu Wenruo
5cdc7ad337 btrfs: Replace fs_info->workers with btrfs_workqueue.
Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:05 -04:00
Qu Wenruo
0bd9289c28 btrfs: Add threshold workqueue based on kernel workqueue
The original btrfs_workers has thresholding functions to dynamically
create or destroy kthreads.

Though there is no such function in kernel workqueue because the worker
is not created manually, we can still use the workqueue_set_max_active
to simulated the behavior, mainly to achieve a better HDD performance by
setting a high threshold on submit_workers.
(Sadly, no resource can be saved)

So in this patch, extra workqueue pending counters are introduced to
dynamically change the max active of each btrfs_workqueue_struct, hoping
to restore the behavior of the original thresholding function.

Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
which is not meant to be called too frequently, so a new interval
mechanism is applied, that will only call workqueue_set_max_active after
a count of work is queued. Hoping to balance both the random and
sequence performance on HDD.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:04 -04:00
Qu Wenruo
1ca08976ae btrfs: Add high priority workqueue support for btrfs_workqueue_struct
Add high priority function to btrfs_workqueue.

This is implemented by embedding a btrfs_workqueue into a
btrfs_workqueue and use some helper functions to differ the normal
priority wq and high priority wq.
So the high priority wq is completely independent from the normal
workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:03 -04:00
Qu Wenruo
08a9ff3264 btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue
Use kernel workqueue to implement a new btrfs_workqueue_struct, which
has the ordering execution feature like the btrfs_worker.

The func is executed in a concurrency way, and the
ordred_func/ordered_free is executed in the sequence them are queued
after the corresponding func is done.

The new btrfs_workqueue works much like the original one, one workqueue
for normal work and a list for ordered work.
When a work is queued, ordered work will be added to the list and helper
function will be queued into the workqueue.
The helper function will execute a normal work and then check and execute as many
ordered work as possible in the sequence they were queued.

At this patch, high priority work queue or thresholding is not added yet.
The high priority feature and thresholding will be added in the following patches.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:03 -04:00
Qu Wenruo
f5961d41d7 btrfs: Cleanup the unused struct async_sched.
The struct async_sched is not used by any codes and can be removed.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fusionio.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:02 -04:00
Liu Bo
644d1940ab Btrfs: skip search tree for REG files
It is really unnecessary to search tree again for @gen, @mode and @rdev
in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
and @gen, @mode and @rdev can easily be fetched.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:01 -04:00
Miao Xie
7b2b70851f Btrfs: fix preallocate vs double nocow write
We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.

Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:00 -04:00
Miao Xie
c933956ddf Btrfs: fix wrong lock range and write size in check_can_nocow()
The write range may not be sector-aligned, for example:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|--------|  <- correct lock range, size: 3blocks

But according to the old code, we used the size of write range to calculate
the lock range directly, not considered the offset, we would get a wrong lock
range:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|		<- wrong lock range, size: 2blocks

And besides that, the old code also had the same problem when calculating
the real write size. Correct them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:00 -04:00
David Sterba
9c9ca00bd3 btrfs: send: simplify allocation code in fs_path_ensure_buf
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:59 -04:00
David Sterba
1b2782c8ed btrfs: send: fix old buffer length in fs_path_ensure_buf
In "btrfs: send: lower memory requirements in common case" the code to
save the old_buf_len was incorrectly moved to a wrong place and broke
the original logic.

Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:58 -04:00
Filipe Manana
176840b3aa Btrfs: more efficient btrfs_drop_extent_cache
While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.

This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.

The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:

    sysbench --test=fileio --file-num=1 --file-total-size=10G \
        --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
        --max-requests=500000 --file-rw-ratio=2 [prepare|run]

I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:

    Count: 122462
    Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
    Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
       1.000 -    5.231:   452 |
       5.231 -  187.392:    87 |
     187.392 -  585.911:   206 |
     585.911 - 1827.438:   623 |
    1827.438 - 5695.245:  1962 #
    5695.245 - 17744.861:  6204 ####
   17744.861 - 55283.764: 21115 ############
   55283.764 - 172231.000: 91813 #####################################################

Benchmark:

    sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
        --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
        --file-io-mode=sync --file-fsync-freq=0 [prepare|run]

Before this change: 122.1Mb/sec
After this change:  125.07Mb/sec
(averages of 5 test runs)

Test machine: quad core intel i5-3570K, 32Gb of ram, SSD

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:57 -04:00
Filipe Manana
f2071b2155 Btrfs: more efficient split extent state insertion
When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:57 -04:00
Filipe Manana
cbc0e9287d Btrfs: remove unneeded field / smaller extent_map structure
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.

This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:56 -04:00