kernel-ark

Author	SHA1	Message	Date
Tejun Heo	c5cc2070b4	blk-throttle: add throtl_qnode for dispatch fairness With flat hierarchy, there's only single level of dispatching happening and fairness beyond that point is the responsibility of the rest of the block layer and driver, which usually works out okay; however, with the planned hierarchy support, service_queue->bio_lists[] can be filled up by bios from a single source. While the limits would still be honored, it'd be very easy to starve IOs from siblings or children. To avoid such starvation, this patch implements throtl_qnode and converts service_queue->bio_lists[] to lists of per-source qnodes which in turn contains the bio's. For example, when a bio is dispatched from a child group, the bio doesn't get queued on ->bio_lists[] directly but it first gets queued on the group's qnode which in turn gets queued on service_queue->queued[]. When dispatching for the upper level, the ->queued[] list is consumed in round-robing order so that the dispatch windows is consumed fairly by all IO sources. There are two ways a bio can come to a throtl_grp - directly queued to the group or dispatched from a child. For the former throtl_grp->qnode_on_self[rw] is used. For the latter, the child's ->qnode_on_parent[rw]. Note that this means that the child which is contributing a bio to its parent should stay pinned until all its bios are dispatched to its grand-parent. This patch moves blkg refcnting from bio add/remove spots to qnode activation/deactivation so that the blkg containing an active qnode is always pinned. As child pins the parent, this is sufficient for keeping the relevant sub-tree pinned while bios are in flight. The starvation issue was spotted by Vivek Goyal. v2: The original patch used the same throtl_grp->qnode_on_self/parent for reads and writes causing RWs to be queued incorrectly if there already are outstanding IOs in the other direction. They should be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs can use different qnodes. Spotted by Vivek Goyal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:38 -07:00
Tejun Heo	2e48a530a3	blk-throttle: make throtl_pending_timer_fn() ready for hierarchy throtl_pending_timer_fn() currently assumes that the parent_sq is the top level one and the bio's dispatched are ready to be issued; however, this assumption will be wrong with proper hierarchy support. This patch makes the following changes to make throtl_pending_timer_fn() ready for hiearchy. * If the parent_sq isn't the top-level one, update the parent throtl_grp's dispatch time and schedule the next dispatch as necessary. If the parent's dispatch time is now, repeat the function for the parent throtl_grp. * If the parent_sq is the top-level one, kick issue work_item as before. * The debug message printed by throtl_log() now prints out the service_queue's nr_queued[] instead of the total nr_queued as the latter becomes uninteresting and misleading with hierarchical dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:38 -07:00
Tejun Heo	6bc9c2b464	blk-throttle: make tg_dispatch_one_bio() ready for hierarchy tg_dispatch_one_bio() currently assumes that the parent_sq is the top level one and the bio being dispatched is ready to be issued; however, this assumption will be wrong with proper hierarchy support. This patch makes the following changes to make tg_dispatch_on_bio() ready for hiearchy. * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to transfer a bio from a child tg to its parent. * tg_dispatch_one_bio() is updated to distinguish whether its parent is another throtl_grp or the throtl_data. If former, the bio is transferred to the parent throtl_grp using throtl_add_bio_tg(). If latter, the bio is ready to be issued and put on the top-level service_queue's bio_lists[] and throtl_data->nr_queued is decremented. As all throtl_grps currently have the top level service_queue as their ->parent_sq, this patch in itself doesn't make any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:38 -07:00
Tejun Heo	9e660acffc	blk-throttle: make blk_throtl_bio() ready for hierarchy Currently, blk_throtl_bio() issues the passed in bio directly if it's within limits of its associated tg (throtl_grp). This behavior becomes incorrect with hierarchy support as the bio should be accounted to and throttled by the ancestor throtl_grps too. This patch makes the direct issue path of blk_throtl_bio() to loop until it reaches the top-level service_queue or gets throttled. If the former, the bio can be issued directly; otherwise, it gets queued at the first layer it was above limits. As tg->parent_sq is always the top-level service queue currently, this patch in itself doesn't make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:38 -07:00
Tejun Heo	2a12f0dcda	blk-throttle: make blk_throtl_drain() ready for hierarchy The current blk_throtl_drain() assumes that all active throtl_grps are queued on throtl_data->service_queue, which won't be true once hierarchy support is implemented. This patch makes blk_throtl_drain() perform post-order walk of the blkg hierarchy draining each associated throtl_grp, which guarantees that all bios will eventually be pushed to the top-level service_queue in throtl_data. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:37 -07:00
Tejun Heo	6e1a5704cb	blk-throttle: dispatch from throtl_pending_timer_fn() Currently, blk_throtl_dispatch_work_fn() is responsible for both dispatching bio's from throtl_grp's according to their limits and then issuing the dispatched bios. This patch moves the dispatch part to throtl_pending_timer_fn() so that the work item is kicked iff there are bio's to issue. This is to avoid work item execution at each step when hierarchy support is enabled. bio's will be dispatched towards the top-level service_queue from the timers at each layer and the work item will only be used to issue the bio's which reached the top-level service_queue. While fetching bio's to issue from bio_lists[], blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While the original code also dispatched READs first, if multiple throtl_grps are dispatched on the same run, WRITEs from throtl_grp which is dispatched first would precede READs from throtl_grps which are dispatched later. While this is a behavior change, given that the previous code already prioritized READs and block layer generally prioritizes and segregates READs from WRITEs, this isn't likely to make any noticeable differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:37 -07:00
Tejun Heo	7f52f98c2a	blk-throttle: implement dispatch looping throtl_select_dispatch() only dispatches throtl_quantum bios on each invocation. blk_throtl_dispatch_work_fn() in turn depends on throtl_schedule_next_dispatch() scheduling the next dispatch window immediately so that undue delays aren't incurred. This effectively chains multiple dispatch work item executions back-to-back when there are more than throtl_quantum bios to dispatch on a given tick. There is no reason to finish the current work item just to repeat it immediately. This patch makes throtl_schedule_next_dispatch() return %false without doing anything if the current dispatch window is still open and updates blk_throtl_dispatch_work_fn() repeat dispatching after cpu_relax() on %false return. This change will help implementing hierarchy support as dispatching will be done from pending_timer and immediate reschedule of timer function isn't supported and doesn't make much sense. While this patch changes how dispatch behaves when there are more than throtl_quantum bios to dispatch on a single tick, the behavior change is immaterial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:37 -07:00
Tejun Heo	69df0ab030	blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work Currently, throtl_data->dispatch_work is a delayed_work item which handles both delayed dispatch and issuing bios. The two tasks will be separated to support proper hierarchy. To prepare for that, this patch separates out the timer into throtl_service_queue->pending_timer from throtl_data->dispatch_work and make the latter a work_struct. * As the timer is now per-service_queue, it's initialized and del_sync'd as its corresponding service_queue is created and destroyed. The timer, when triggered, simply schedules throtl_data->dispathc_work for execution. * throtl_schedule_delayed_work() is renamed to throtl_schedule_pending_timer() and takes @sq and @expires now. * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which should be the parent_sq of the service_queue which just got a new bio or updated. As the parent_sq is always the top-level service_queue now, this doesn't change anything at this point. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:36 -07:00
Tejun Heo	2a0f61e6ec	blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it With proper hierarchy support, a bio can be dispatched multiple times until it reaches the top-level service_queue and we don't want to update dispatch stats at each step. They are local stats and will be kept local. If recursive stats are necessary, they should be implemented separately and definitely not by updating counters recursively on each dispatch. This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate stats update with it so that dispatch stats are updated only on the first time the bio is charged to a throtl_grp, which will always be the throtl_grp the bio was originally queued to. This means that REQ_THROTTLED would be set even for bios which don't get throttled. As we don't want bios to leave blk-throtl with the flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio() and clear if the bio is being issued directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:36 -07:00
Tejun Heo	fda6f272c7	blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() Now that both throtl_data and throtl_grp embed throtl_service_queue, we can unify throtl_log() and throtl_log_tg(). * sq_to_tg() is added. This returns the throtl_grp a service_queue is embedded in. If the service_queue is the top-level one embedded in throtl_data, NULL is returned. * sq_to_td() is added. A service_queue is always associated with a throtl_data. This function finds the associated td and returns it. * throtl_log() is updated to take throtl_service_queue instead of throtl_data. If the service_queue is one embedded in throtl_grp, it prints the same header as throtl_log_tg() did. If it's one embedded in throtl_data, it behaves the same as before. This renders throtl_log_tg() unnecessary. Removed. This change is necessary for hierarchy support as we're gonna be using the same code paths to dispatch bios to intermediate service_queues embedded in throtl_grps and the top-level service_queue embedded in throtl_data. This patch doesn't make any behavior changes. v2: throtl_log() didn't print a space after blkg path. Updated so that it prints a space after throtl_grp path. Spotted by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:36 -07:00
Tejun Heo	77216b0484	blk-throttle: add throtl_service_queue->parent_sq To prepare for hierarchy support, this patch adds throtl_service_queue->service_sq which points to the arent service_queue. Currently, for all service_queues embedded in throtl_grps, it points to throtl_data->service_queue. As throtl_data->service_queue doesn't have a parent its parent_sq is set to NULL. There are a number of functions which take both throtl_grp tg and throtl_service_queue parent_sq. With this patch, the parent service_queue can be determined from @tg and the @parent_sq arguments are removed. This patch doesn't make any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:36 -07:00
Tejun Heo	0e9f4164ba	blk-throttle: generalize update_disptime optimization in blk_throtl_bio() When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it avoids invoking tg_update_disptime() and throtl_schedule_next_dispatch() if the tg already has bios queued in that direction. As a new bio is appeneded after the existing ones, it can't change the tg's next dispatch time or the parent's dispatch schedule. This optimization is currently open coded in blk_throtl_bio(). Whether the target biolist was occupied was recorded in a local variable and later used to skip disptime update. This patch moves generalizes it so that throtl_add_bio_tg() sets a new flag THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was added. tg_update_disptime() clears the flag automatically. blk_throtl_bio() is updated to simply test the flag before updating disptime. This patch doesn't make any functional differences now but will enable using the same optimization for recursive dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:35 -07:00
Tejun Heo	651930bc1c	blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] throtl_service_queues will eventually form a tree which is anchored at throtl_data->service_queue and queue bios will climb the tree to the top service_queue to be executed. This patch makes the dispatch paths in blk_throtl_dispatch_work_fn() and blk_throtl_drain() to dispatch bios to throtl_data->service_queue.bio_lists[] instead of the on-stack bio_lists. This will keep the final dispatch to the top level service_queue share the same mechanism as dispatches through the rest of the hierarchy. As bio's should be issued in a sleepable context, blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the service_queue bio_lists[] into an onstack one before dropping queue_lock and issuing the bio's. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:35 -07:00
Tejun Heo	73f0d49a96	blk-throttle: move bio_lists[] and friends to throtl_service_queue throtl_service_queues will eventually form a tree which is anchored at throtl_data->service_queue and queue bios will climb the tree to the top service_queue to be executed. This patch moves bio_lists[] and nr_queued[] from throtl_grp to its service_queue to prepare for that. As currently only the throtl_data->service_queue is in use, this patch just ends up moving throtl_grp->bio_lists[] and ->nr_queued[] to throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:35 -07:00
Tejun Heo	49a2f1e3f2	blk-throttle: add throtl_grp->service_queue Currently, there's single service_queue per queue - throtl_data->service_queue. All active throtl_grp's are queued on the queue and dispatched according to their limits. To support hierarchy, this will be expanded such that active throtl_grp's form a tree anchored at throtl_data->service_queue and chained through each intermediate throtl_grp's service_queue. This patch adds throtl_grp->service_queue to prepare for hierarchy support. The initialization function - throtl_service_queue_init() - is added and replaces the macro initializer. The newly added tg->service_queue isn't used yet. Following patches will do. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:34 -07:00
Tejun Heo	0049af73bb	blk-throttle: reorganize throtl_service_queue passed around as argument throtl_service_queue will be the building block of hierarchy support and will form a tree. This patch updates its usages as arguments to reduce confusion. * When a service queue is used as the parent role - the host of the rbtree - use @parent_sq instead of @sq. * For functions taking both @tg and @parent_sq, reorder them so that the order is (@tg, @parent_sq) not the other way around. This makes the code follow the usual convention of specifying the primary target of the operation as the first argument. This patch doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:33 -07:00
Tejun Heo	e2d57e6019	blk-throttle: pass around throtl_service_queue instead of throtl_data throtl_service_queue will be used as the basic block to implement hierarchy support. Pass around throtl_service_queue sq instead of throtl_data td in the following functions which will be used across multiple levels of hierarchy. * [__]throtl_enqueue/dequeue_tg() * throtl_add_bio_tg() * tg_update_disptime() * throtl_select_dispatch() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:33 -07:00
Tejun Heo	0f3457f60e	blk-throttle: add backlink pointer from throtl_grp to throtl_data Add throtl_grp->td so that the td (throtl_data) a given tg (throtl_grp) belongs to can be determined, and remove @td argument from functions which take both @td and @tg as the former now can be determined from the latter. This generally simplifies the code and removes a number of cases where @td is passed as an argument without being actually used. This will also help hierarchy support implementation. While at it, in multi-line conditions, move the logical operators leading broken lines to the end of the previous line. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:32 -07:00
Tejun Heo	5b2c16aae0	blk-throttle: simplify throtl_grp flag handling blk-throttle is still using function-defining macros to define flag handling functions, which went out style at least a decade ago. Just define the flag as bitmask and use direct bit operations. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:32 -07:00
Tejun Heo	c9e0332e87	blk-throttle: rename throtl_rb_root to throtl_service_queue throtl_rb_root will be expanded to cover more roles for hierarchy support. Rename it to throtl_service_queue and make its fields more descriptive. * rb -> pending_tree * left -> first_pending * count -> nr_pending * min_disptime -> first_pending_disptime This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:32 -07:00
Tejun Heo	6a525600ff	blk-throttle: remove pointless throtl_nr_queued() optimizations throtl_nr_queued() is used in several places to avoid performing certain operations when the throtl_data is empty. This usually is useless as those paths usually aren't traveled if there's no bio queued. * throtl_schedule_delayed_work() skips scheduling dispatch work item if @td doesn't have any bios queued; however, the only case it can be called when @td is empty is from tg_set_conf() which isn't something we should be optimizing for. * throtl_schedule_next_dispatch() takes a quick exit if @td is empty; however, right after that it triggers BUG if the service tree is empty. The two conditions are equivalent and it can just test @st->count for the quick exit. * blk_throtl_dispatch_work_fn() skips dispatch if @td is empty. This work function isn't usually invoked when @td is empty. The only possibility is from tg_set_conf() and when it happens the normal dispatching path can handle empty @td fine. No need to add special skip path. This patch removes the above three unnecessary optimizations, which leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user of throtl_nr_queued(). Remove throtl_nr_queued() and open code it in throtl_log(). I don't think we need td->nr_queued[] at all. Maybe we can remove it later. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:32 -07:00
Tejun Heo	a9131a27e2	blk-throttle: relocate throtl_schedule_delayed_work() Move throtl_schedule_delayed_work() above its first user so that the forward declaration can be removed. This patch is pure relocaiton. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	cb76199c36	blk-throttle: collapse throtl_dispatch() into the work function blk-throttle is about to go through major restructuring to support hierarchy. Do cosmetic updates in preparation. * s/throtl_data->throtl_work/throtl_data->dispatch_work/ * s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/ * Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn() This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	632b44935f	blk-throttle: remove deferred config application mechanism When bps or iops configuration changes, blk-throttle records the new configuration and sets a flag indicating that the config has changed. The flag is checked in the bio dispatch path and applied. This deferred config application was necessary due to limitations in blkcg framework, which haven't existed for quite a while now. This patch removes the deferred config application mechanism and applies new configurations directly from tg_set_conf(), which is simpler. v2: Dropped unnecessary throtl_schedule_delayed_work() call from tg_set_conf() as suggested by Vivek Goyal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	2db6314c21	blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch() throtl_select_dispatch() calls throtl_enqueue_tg() right after tg_update_disptime(), which always calls the function anyway. The call is, while harmless, unnecessary. Remove it. This patch doesn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	2a4fd070ee	blkcg: move bulk of blkcg_gq release operations to the RCU callback Currently, when the last reference of a blkcg_gq is put, all then release operations sans the actual freeing happen directly in blkg_put(). As blkg_put() may be called under queue_lock, all pd_exit_fn()s may be too. This makes it impossible for pd_exit_fn()s to use del_timer_sync() on timers which grab the queue_lock which is an irq-safe lock due to the deadlock possibility described in the comment on top of del_timer_sync(). This can be easily avoided by perfoming the release operations in the RCU callback instead of directly from blkg_put(). This patch moves the blkcg_gq release operations to the RCU callback. As this leaves __blkg_release() with only call_rcu() invocation, blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and call_rcu() invocation is now done directly from blkg_put() instead of going through __blkg_release() which is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	db61367038	blkcg: invoke blkcg_policy->pd_init() after parent is linked Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is invoked in blkg_alloc() before the parent is linked. This makes it difficult for policies to perform initializations which are dependent on the parent. This patch moves pd_init_fn() invocations to blkg_create() after the parent blkg is linked where the new blkg is fully initialized. As this means that blkg_free() can't assume that pd's are initialized, pd_exit_fn() invocations are moved to __blkg_release(). This guarantees that pd_exit_fn() is also invoked with fully initialized blkgs with valid parent pointers. This will help implementing hierarchy support in blk-throttle. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	aa539cb38f	blkcg: implement blkg_for_each_descendant_post() This will be used by blk-throttle hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:31 -07:00
Tejun Heo	dd4a4ffc0a	blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h blk-throttle hierarchy support will make use of it. Move blkg_for_each_descendant_pre() from block/blk-cgroup.c to block/blk-cgroup.h. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:30 -07:00
Tejun Heo	2423c9c3f0	blkcg: fix error return path in blkg_create() In blkg_create(), after lookup of parent fails, the control jumps to error path with the error code encoded into @blkg. The error path doesn't use @blkg for the return value. It returns ERR_PTR(ret). Make lookup fail path set @ret instead of @blkg. Note that the parent lookup is guaranteed to succeed at that point and the condition check is purely for sanity and triggers WARN when fails. As such, I don't think it's necessary to mark it for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-05-14 13:52:30 -07:00
Viresh Kumar	695588f945	block: queue work on power efficient wq Block layer uses workqueues for multiple purposes. There is no real dependency of scheduling these on the cpu which scheduled them. On a idle system, it is observed that and idle cpu wakes up many times just to service this work. It would be better if we can schedule it on a cpu which the scheduler believes to be the most appropriate one. This patch replaces normal workqueues with power efficient versions. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 10:50:07 -07:00
Linus Torvalds	4de13d7aa8	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...	2013-05-08 10:13:35 -07:00
Kent Overstreet	a27bb332c0	aio: don't include aio.h in sched.h Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-05-07 20:16:25 -07:00
Linus Torvalds	736a2dd257	Lots of virtio work which wasn't quite ready for last merge window. Plus I dived into lguest again, reworking the pagetable code so we can move the switcher page: our fixmaps sometimes take more than 2MB now... Cheers, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJRga7lAAoJENkgDmzRrbjx/yIQAKpqIBtxOJeYH3SY+Uoe7Cfp toNYcpJEldvb0UcWN8M2cSZpHoxl1SUoq9djwcM29tcKa7EZAjHaGtb/Q1qMTDgv +B3WAfiGU2pmXFxLAkbrlLNGnysy24JspqJQ5hcYV84EiBxQdZp+nCYgOphd+GMK ww16vo9ya8jFjzt3GeRp/Heb3vEzV4Cp6BC3i0m8A3WNpEpbRb66pqXNk5o8ggJO SxQOKSXmUM+0m+jKSul5xn3e2Ls2LOrZZ8/DIHA+gW66N4Zab7n2/j1Q9VRxb4lh FqnR7KwgBX8OCh9IsBDqQYS7MohvMYge6eUdLtFrq84jvMleMEhrC8q9v2tucFUb 5t18CLwvyK7Gdg6UCKiZ7YSPcuURAILO16al9bh5IseeBDsuX+43VsvQoBmFn9k6 cLOVTZ6BlOmahK5PyRYFSvLa9Rxzr/05Mr7oYq9UgshD9io78dnqczFYIORF53rW zD7C4HuTZfYJFfNd0wAJ0RfVXnf8QvDlMdo7zPC26DSXNWqj8OexCY0qqSWUB+2F vcfJP6NkV4fZB8aawWIFUVwc64yqtt2uPVLa7ATZWqk16PgKrchGewmw3tiEwOgu 1l7xgffTRRUIJsqaCZoXdgw3yezcKRjuUBcOxL09lDAAhc+NxWNvzZBsKp66DwDk yZQKn0OdXnuf0CeEOfFf =1tYL -----END PGP SIGNATURE----- Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio & lguest updates from Rusty Russell: "Lots of virtio work which wasn't quite ready for last merge window. Plus I dived into lguest again, reworking the pagetable code so we can move the switcher page: our fixmaps sometimes take more than 2MB now..." Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename. Hopefully correctly resolved. * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits) caif_virtio: Remove bouncing email addresses lguest: improve code readability in lg_cpu_start. virtio-net: fill only rx queues which are being used lguest: map Switcher below fixmap. lguest: cache last cpu we ran on. lguest: map Switcher text whenever we allocate a new pagetable. lguest: don't share Switcher PTE pages between guests. lguest: expost switcher_pages array (as lg_switcher_pages). lguest: extract shadow PTE walking / allocating. lguest: make check_gpte et. al return bool. lguest: assume Switcher text is a single page. lguest: rename switcher_page to switcher_pages. lguest: remove RESERVE_MEM constant. lguest: check vaddr not pgd for Switcher protection. lguest: prepare to make SWITCHER_ADDR a variable. virtio: console: replace EMFILE with EBUSY for already-open port virtio-scsi: reset virtqueue affinity when doing cpu hotplug virtio-scsi: introduce multiqueue support virtio-scsi: push vq lock/unlock into virtscsi_vq_done virtio-scsi: pass struct virtio_scsi to virtqueue completion function ...	2013-05-02 14:14:04 -07:00
Philippe De Muyter	ea56505bed	partitions/efi.c: replace useless kzalloc's by kmalloc's In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated zones are either totally overwritten by the following read_lba call, or freed. As kmalloc is cheaper than kzalloc, use kmalloc. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Matt Domsch <Matt_Domsch@dell.com> Cc: Panagiotis Issaris <takis@issaris.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-30 08:34:25 +02:00
Linus Torvalds	191a712090	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...	2013-04-29 19:14:20 -07:00
Linus Torvalds	2794b5d408	Driver core update for 3.10-rc1 Here's the merge request for the driver core tree for 3.10-rc1 It's pretty small, just a number of driver core and sysfs updates and fixes, all of which have been in linux-next for a while now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlF+m4cACgkQMUfUDdst+ymp+wCgv/F7zAhZsKW5YT9A/FsTNl3m Ge8AnRlfYPwxM1Zt4kIuDAwfKuLTYV/B =swS7 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg Kroah-Hartman: "Here's the merge request for the driver core tree for 3.10-rc1 It's pretty small, just a number of driver core and sysfs updates and fixes, all of which have been in linux-next for a while now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better fix for the same sysfs file mode problem. * tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: PM / Runtime: Idle devices asynchronously after probe\|release driver core: handle user namespaces properly with the uid/gid devtmpfs change driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS devtmpfs: add base.h include driver core: add uid and gid to devtmpfs sysfs: check if one entry has been removed before freeing sysfs: fix crash_notes_size build warning sysfs: fix use after free in case of concurrent read/write and readdir rtmutex-tester: fix mode of sysfs files Documentation: Add ABI entry for crash_notes and crash_notes_size sysfs: Add crash_notes_size to export percpu note size driver core: platform_device.h: fix checkpatch errors and warnings driver core: platform.c: fix checkpatch errors and warnings driver core: warn that platform_driver_probe can not use deferred probing sysfs: use atomic_inc_unless_negative in sysfs_get_active base: core: WARN() about bogus permissions on device attributes device: separate all subsys mutexes	2013-04-29 11:31:50 -07:00
Linus Torvalds	0a82a8d132	Revert "block: add missing block_bio_complete() tracepoint" This reverts commit `3a366e614d`. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-18 09:00:26 -07:00
Greg Kroah-Hartman	0d1d392f01	Merge 3.9-rc7 into driver-core-next Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-04-14 18:37:05 -07:00
Greg Kroah-Hartman	4e4098a3e0	driver core: handle user namespaces properly with the uid/gid devtmpfs change Now that devtmpfs is caring about uid/gid, we need to use the correct internal types so users who have USER_NS enabled will have things work properly for them. Thanks to Eric for pointing this out, and the patch review. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-04-11 11:43:29 -07:00
Jun'ichi Nomura	e5072664f8	blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Since `749fefe677` in v3.7 ("block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()"), the following warning appears when multipath is used with CONFIG_PREEMPT=y. This patch moves blk_queue_bypass_start() before radix_tree_preload() to avoid the sleeping call while preemption is disabled. BUG: scheduling while atomic: multipath/2460/0x00000002 1 lock held by multipath/2460: #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod] Modules linked in: ... Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1 Call Trace: [<ffffffff810723ae>] __schedule_bug+0x6a/0x78 [<ffffffff81428ba2>] __schedule+0xb4/0x5e0 [<ffffffff814291e6>] schedule+0x64/0x66 [<ffffffff8142773a>] schedule_timeout+0x39/0xf8 [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29 [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb [<ffffffff814289e3>] wait_for_common+0x9d/0xee [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206 [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a [<ffffffff8106fb18>] ? complete+0x21/0x53 [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20 [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62 [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270 [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108 [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod] [<ffffffff811d8c41>] elevator_init+0xe1/0x115 [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59 [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod] [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod] [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod] [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod] [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod] [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441 [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99 [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82 [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-09 15:01:21 +02:00
Kay Sievers	3c2670e651	driver core: add uid and gid to devtmpfs Some drivers want to tell userspace what uid and gid should be used for their device nodes, so allow that information to percolate through the driver core to userspace in order to make this happen. This means that some systems (i.e. Android and friends) will not need to even run a udev-like daemon for their device node manager and can just rely in devtmpfs fully, reducing their footprint even more. Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-04-08 08:21:48 -07:00
Jens Axboe	c2fccc1c9f	Revert "loop: cleanup partitions when detaching loop device" This reverts commit `8761a3dc1f`. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-08 10:12:11 +02:00
Arnd Bergmann	c678ef5286	block: avoid using uninitialized value in from queue_var_store As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions that use a value generated by queue_var_store independent of whether that value was set or not. block/blk-sysfs.c: In function 'queue_store_nonrot': block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] Unlike most other such warnings, this one is not a false positive, writing any non-number string into the sysfs files indeed has an undefined result, rather than returning an error. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-03 21:53:57 +02:00
Jens Axboe	64f8de4da7	Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. `e3620a3ad5` ("MD RAID5: Avoid accessing gendisk or queue structs when not available") `2f6db2a707` ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-02 10:04:39 +02:00
Jens Axboe	705cd0ea1c	Merge branch 'for-jens' of http://evilpiepirate.org/git/linux-bcache into for-3.10/core This contains Kents prep work for the immutable bio_vecs.	2013-03-24 21:38:59 -06:00
Kent Overstreet	f73a1c7d11	block: Add bio_end_sector() Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2013-03-23 14:15:29 -07:00
Kent Overstreet	f79ea41614	block: Refactor blk_update_request() Converts it to use bio_advance(), simplifying it quite a bit in the process. Note that req_bio_endio() now always calls bio_advance() - which means it always loops over the biovec, not just on partial completions. Don't expect it to affect performance, but worth noting. Tested it by forcing partial updates, and dumping before and after on various bio/bvec fields when doing a partial update. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>	2013-03-23 14:15:28 -07:00
Lin Ming	c8158819d5	block: implement runtime pm strategy When a request is added: If device is suspended or is suspending and the request is not a PM request, resume the device. When the last request finishes: Call pm_runtime_mark_last_busy(). When pick a request: If device is resuming/suspending, then only PM request is allowed to go. The idea and API is designed by Alan Stern and described here: http://marc.info/?l=linux-scsi&m=133727953625963&w=2 Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:22:15 -06:00
Lin Ming	6c95466758	block: add runtime pm helpers Add runtime pm helper functions: void blk_pm_runtime_init(struct request_queue q, struct device dev) - Initialization function for drivers to call. int blk_pre_runtime_suspend(struct request_queue q) - If any requests are in the queue, mark last busy and return -EBUSY. Otherwise set q->rpm_status to RPM_SUSPENDING and return 0. void blk_post_runtime_suspend(struct request_queue q, int err) - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED. Otherwise set it to RPM_ACTIVE and mark last busy. void blk_pre_runtime_resume(struct request_queue q) - Set q->rpm_status to RPM_RESUMING. void blk_post_runtime_resume(struct request_queue q, int err) - If the resume succeeded then set q->rpm_status to RPM_ACTIVE and call __blk_run_queue, then mark last busy and autosuspend. Otherwise set q->rpm_status to RPM_SUSPENDED. The idea and API is designed by Alan Stern and described here: http://marc.info/?l=linux-scsi&m=133727953625963&w=2 Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:22:15 -06:00
Alice Ferrazzi	f2fc7d0edd	Block: blk-flush: Fixed indent code style Fixed code indent should use tabs where possible. Signed-off-by: Alice Ferrazzi <alice.ferrazzi@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 12:22:51 -06:00
Phillip Susi	8761a3dc1f	loop: cleanup partitions when detaching loop device Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 12:21:53 -06:00
Paolo Bonzini	c8164d8931	scatterlist: introduce sg_unmark_end This is useful in places that recycle the same scatterlist multiple times, and do not want to incur the cost of sg_init_table every time in hot paths. Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2013-03-20 15:43:04 +10:30
Li Zefan	65dff759d2	cgroup: fix cgroup_path() vs rename() race rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. As accessing dentry->name must be protected by dentry->d_lock or parent inode's i_mutex, while on the other hand cgroup-path() can be called with some irq-safe spinlocks held, we can't generate cgroup path using dentry->d_name. Alternatively we make a copy of dentry->d_name and save it in cgrp->name when a cgroup is created, and update cgrp->name at rename(). v5: use flexible array instead of zero-size array. v4: - allocate root_cgroup_name and all root_cgroup->name points to it. - add cgroup_name() wrapper. v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path. v2: make cgrp->name RCU safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 09:50:08 -08:00
Linus Torvalds	ee89f81252	Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit `b67bfe0d42`: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch\|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front\|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...	2013-02-28 12:52:24 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Ming Lei	ac2e5327a5	block/partitions: optimize memory allocation in check_partition() Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so it is easy to trigger page allocation failure by check_partition, especially in hotplug block device situation(such as, USB mass storage, MMC card, ...), and Felipe Balbi has observed the failure. This patch does below optimizations on the allocation of struct parsed_partitions to try to address the issue: - make parsed_partitions.parts as pointer so that the pointed memory can fit in 32KB buffer, then approximate 32KB memory can be saved - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is still a bit big for kmalloc - given that many devices have the partition count limit, so only allocate disk_max_parts() partitions instead of 256 partitions always Signed-off-by: Ming Lei <ming.lei@canonical.com> Reported-by: Felipe Balbi <balbi@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:21 -08:00
Ming Lei	06004e6eeb	block/partitions/mac.c: obey the state->limit constraint It isn't necessary to read the information of partitions whose number is equal and more than state->limit since only maximum state->limit partitions will be added inside rescan_partitions(). That is also what other kind of partitions are doing. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:21 -08:00
Peter Jones	8b8a6e1881	block/partitions/efi.c: ensure that the GPT header is at least the size of the structure. UEFI 2.3.1D will include a change to the spec language mandating that a GPT header must be greater than or equal to the size of the defined structure. While verifying that this would work on Linux, I discovered that we're not actually checking the minimum bound at all. The result of this is that when we verify the checksum, it's possible that on a malformed header (with header_size of 0), we won't actually verify any data. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:21 -08:00
Philippe De Muyter	86ee8ba64d	block/partition/msdos: detect AIX formatted disks even without 55aa AIX formatted disks do not always have the MSDOS 55aa signature. This happens e.g. for unbootable AIX disks. Up to now, such disks were not recognized as AIX disks, because of the missing 55aa. Fix that by inverting the two tests. Let's first check for the AIX magic strings, and only if that fails check for the MSDOS magic word. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Andreas Mohr <andi@lisas.de> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:21 -08:00
Tejun Heo	bab998d62f	block: convert to idr_alloc() Convert to the much saner new idr interface. Both bsg and genhd protect idr w/ mutex making preloading unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:15 -08:00
Tejun Heo	ce23bba842	block: fix synchronization and limit check in blk_alloc_devt() idr allocation in blk_alloc_devt() wasn't synchronized against lookup and removal, and its limit check was off by one - 1 << MINORBITS is the number of minors allowed, not the maximum allowed minor. Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit checking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:14 -08:00
Tomas Henzl	7b74e91278	block: fix ext_devt_idr handling While adding and removing a lot of disks disks and partitions this sometimes shows up: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted) Hardware name: sysfs: cannot create duplicate filename '/dev/block/259:751' Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan] Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1 Call Trace: warn_slowpath_common+0x87/0xc0 warn_slowpath_fmt+0x46/0x50 sysfs_add_one+0xc9/0x130 sysfs_do_create_link+0x12b/0x170 sysfs_create_link+0x13/0x20 device_add+0x317/0x650 idr_get_new+0x13/0x50 add_partition+0x21c/0x390 rescan_partitions+0x32b/0x470 sd_open+0x81/0x1f0 [sd_mod] __blkdev_get+0x1b6/0x3c0 blkdev_get+0x10/0x20 register_disk+0x155/0x170 add_disk+0xa6/0x160 sd_probe_async+0x13b/0x210 [sd_mod] add_wait_queue+0x46/0x60 async_thread+0x102/0x250 default_wake_function+0x0/0x20 async_thread+0x0/0x250 kthread+0x96/0xa0 child_rip+0xa/0x20 kthread+0x0/0xa0 child_rip+0x0/0x20 This most likely happens because dev_t is freed while the number is still used and idr_get_new() is not protected on every use. The fix adds a mutex where it wasn't before and moves the dev_t free function so it is called after device del. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:12 -08:00
Ming Lei	25e823c8c3	block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices Apply the introduced pm_runtime_set_memalloc_noio on block device so that PM core will teach mm to not allocate memory with GFP_IOFS when calling the runtime_resume and runtime_suspend callback for block devices and its ancestors. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:16 -08:00
Glauber Costa	a3cc86c2f0	cfq: fix lock imbalance with failed allocations While stress-running very-small container scenarios with the Kernel Memory Controller, I've run into a lockdep-detected lock imbalance in cfq-iosched.c. I'll apologize beforehand for not posting a backlog: I didn't anticipate it would be so hard to reproduce, so I didn't save my serial output and went directly on debugging. Turns out that it did not happen again in more than 20 runs, making it a quite rare pattern. But here is my analysis: When we are in very low-memory situations, we will arrive at cfq_find_alloc_queue and may not find a queue, having to resort to the oom queue, in an rcu-locked condition: if (!cfqq \|\| cfqq == &cfqd->oom_cfqq) [ ... ] Next, we will release the rcu lock, and try to allocate a queue, retrying if we succeed: rcu_read_unlock(); spin_unlock_irq(cfqd->queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask \| __GFP_ZERO, cfqd->queue->node); spin_lock_irq(cfqd->queue->queue_lock); if (new_cfqq) goto retry; We are unlocked at this point, but it should be fine, since we will reacquire the rcu_read_lock when we retry. Except of course, that we may not retry: the allocation may very well fail and we'll keep on going through the flow: The next branch is: if (cfqq) { [ ... ] } else cfqq = &cfqd->oom_cfqq; And right before exiting, we'll issue rcu_read_unlock(). Being already unlocked, this is the likely source of our imbalance. Since cfqq is either already NULL or made NULL in the first statement of the outter branch, the only viable alternative here seems to be to return the oom queue right away in case of allocation failure. Please review the following patch and apply if you agree with my analysis. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:42:46 +01:00
Mikulas Patocka	79d0b7f0e3	block: don't select PERCPU_RWSEM The block device doesn't use percpu rw-semaphore anymore, so don't select it for compilation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:42:45 +01:00
Darrick J. Wong	ffecfd1a72	block: optionally snapshot page contents to provide stable pages during write This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-21 17:22:20 -08:00
Darrick J. Wong	7d311cdab6	bdi: allow block devices to say that they require stable page writes This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-21 17:22:19 -08:00
Linus Torvalds	ece8e0b2f9	Merge branch 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull async changes from Tejun Heo: "These are followups for the earlier deadlock issue involving async ending up waiting for itself through block requesting module[1]. The following changes are made by these commits. - Instead of requesting default elevator on each request_queue init, block now requests it once early during boot. - Kmod triggers warning if invoked from an async worker. - Async synchronization implementation has been reimplemented. It's a lot simpler now." * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: async: initialise list heads to fix crash async: replace list of active domains with global list of pending items async: keep pending tasks on async_domain and remove async_pending async: use ULLONG_MAX for infinity cookie value async: bring sanity to the use of words domain and running async, kmod: warn on synchronous request_module() from async workers block: don't request module during elevator init init, block: try to load default elevator module early during boot	2013-02-19 22:10:26 -08:00
Linus Torvalds	d652e1eb8e	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.	2013-02-19 18:19:48 -08:00
Vladimir Davydov	5577022f4e	block: account iowait time when waiting for completion of IO request Using wait_for_completion() for waiting for a IO request to be executed results in wrong iowait time accounting. For example, a system having the only task doing write() and fdatasync() on a block device can be reported being idle instead of iowaiting as it should because blkdev_issue_flush() calls wait_for_completion() which in turn calls schedule() that does not increment the iowait proc counter and thus does not turn on iowait time accounting. The patch makes block layer use wait_for_completion_io() instead of wait_for_completion() where appropriate to account iowait time correctly. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-15 16:45:07 +01:00
Clark Williams	cf4aebc292	sched: Move sched.h sysctl bits into separate header Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-07 20:50:54 +01:00
Tejun Heo	21c3c5d280	block: don't request module during elevator init Block layer allows selecting an elevator which is built as a module to be selected as system default via kernel param "elevator=". This is achieved by automatically invoking request_module() whenever a new block device is initialized and the elevator is not available. This led to an interesting deadlock problem involving async and module init. Block device probing running off an async job invokes request_module(). While the module is being loaded, it performs async_synchronize_full() which ends up waiting for the async job which is already waiting for request_module() to finish, leading to deadlock. Invoking request_module() from deep in block device init path is already nasty in itself. It seems best to avoid these situations from the beginning by moving on-demand module loading out of block init path. The previous patch made sure that the default elevator module is loaded early during boot if available. This patch removes on-demand loading of the default elevator from elevator init path. As the module would have been loaded during boot, userland-visible behavior difference should be minimal. For more details, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1420814 v2: The bool parameter was named @request_module which conflicted with request_module(). This built okay w/ CONFIG_MODULES because request_module() was defined as a macro. W/o CONFIG_MODULES, it causes build breakage. Rename the parameter to @try_loading. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com>	2013-01-22 16:48:03 -08:00
Tejun Heo	bb813f4c93	init, block: try to load default elevator module early during boot This patch adds default module loading and uses it to load the default block elevator. During boot, it's called right after initramfs or initrd is made available and right before control is passed to userland. This ensures that as long as the modules are available in the usual places in initramfs, initrd or the root filesystem, the default modules are loaded as soon as possible. This will replace the on-demand elevator module loading from elevator init path. v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Fengguang We <fengguang.wu@intel.com>	2013-01-18 14:05:56 -08:00
Tejun Heo	8c1cf6bb02	block: add @req to bio_{front\|back}_merge tracepoints bio_{front\|back}_merge tracepoints report a bio merging into an existing request but didn't specify which request the bio is being merged into. Add @req to it. This makes it impossible to share the event template with block_bio_queue - split it out. @req isn't used or exported to userland at this point and there is no userland visible behavior change. Later changes will make use of the extra parameter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Tejun Heo	3a366e614d	block: add missing block_bio_complete() tracepoint bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Jens Axboe	ac9a197451	Merge branch 'blkcg-cfq-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.9/core Tejun writes: Hello, Jens. Please consider pulling from the following branch to receive cfq blkcg hierarchy support. The branch is based on top of v3.8-rc2. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy The patchset was reviewd in the following thread. http://thread.gmane.org/gmane.linux.kernel.cgroups/5571	2013-01-11 19:53:53 +01:00
Jianpeng Ma	422765c263	block: Remove should_sort judgement when flush blk_plug In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing. Although this commit was used for the situation which blk_plug handled multi devices on the same time like md device. I think there must be some situations like this but only single device. So remove the should_sort judgement. Because the parameter should_sort is only for this purpose,it can delete should_sort from blk_plug. CC: Shaohua Li <shli@kernel.org> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:46:09 +01:00
Sasha Levin	242d98f077	block,elevator: use new hashtable implementation Switch elevator to use the new hashtable implementation. This reduces the amount of generic unrelated code in the elevator. This also removes the dymanic allocation of the hash table. The size of the table is constant so there's no point in paying the price of an extra dereference when accessing it. This patch depends on `d9b482c` ("hashtable: introduce a small and naive hashtable") which was merged in v3.6. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:43:13 +01:00
Tejun Heo	43114018cb	cfq-iosched: add hierarchical cfq_group statistics Unfortunately, at this point, there's no way to make the existing statistics hierarchical without creating nasty surprises for the existing users. Just create recursive counterpart of the existing stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	0b39920b5f	cfq-iosched: collect stats from dead cfqgs To support hierarchical stats, it's necessary to remember stats from dead children. Add cfqg->dead_stats and make a dying cfqg transfer its stats to the parent's dead-stats. The transfer happens form ->pd_offline_fn() and it is possible that there are some residual IOs completing afterwards. Currently, we lose these stats. Given that cgroup removal isn't a very high frequency operation and the amount of residual IOs on offline are likely to be nil or small, this shouldn't be a big deal and the complexity needed to handle residual IOs - another callback and rather elaborate synchronization to reach and lock the matching q - doesn't seem justified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	689665af44	cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Separate out cfqg_stats_reset() which takes struct cfqg_stats * from cfq_pd_reset_stats() and move the latter to where other pd methods are defined. cfqg_stats_reset() will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	810ecfa765	blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock Instead of holding blkcg->lock while walking ->blkg_list and executing prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while executing prfill(). This makes prfill() implementations easier as stats are mostly protected by queue lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	548bc8e1b3	block: RCU free request_queue RCU free request_queue so that blkcg_gq->q can be dereferenced under RCU lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	16b3de6652	blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge(). The former two collect the [rw]stats designated by the target policy data and offset from the pd's subtree. The latter two add one [rw]stat to another. Note that the recursive sum functions require the queue lock to be held on entry to make blkg online test reliable. This is necessary to properly handle stats of a dying blkg. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	b50da39f51	blkcg: export __blkg_prfill_rwstat() Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat(). Export it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	4d5e80a760	blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for summing up stats from multiple blkgs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	f427d90964	blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(), which are invoked as the policy_data gets activated and deactivated while holding both blkcg and q locks. Also, add blkcg_gq->online bool, which is set and cleared as the blkcg_gq gets activated and deactivated. This flag also is toggled while holding both blkcg and q locks. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	b276a876a0	blkcg: add blkg_policy_data->plid Add pd->plid so that the policy a pd belongs to can be identified easily. This will be used to implement hierarchical blkg_[rw]stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	d02f7aa8dc	cfq-iosched: enable full blkcg hierarchy support With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	41cad6ab2c	cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction cfq_group_slice() calculates slice by taking a fraction of cfq_target_latency according to the ratio of cfqg->weight against service_tree->total_weight. This currently works only because all cfqgs are treated to be at the same level. To prepare for proper hierarchy support, convert cfq_group_slice() to base the calculation on cfqg->vfraction. As cfqg->vfraction is always a fraction of 1 and represents the fraction allocated to the cfqg with hierarchy considered, the slice can be simply calculated by multiplying cfqg->vfraction to cfq_target_latency (with fixed point shift factored in). As vfraction calculation currently treats all non-root cfqgs as children of the root cfqg, this patch doesn't introduce noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	1d3650f713	cfq-iosched: implement hierarchy-ready cfq_group charge scaling Currently, cfqg charges are scaled directly according to cfqg->weight. Regardless of the number of active cfqgs or the amount of active weights, a given weight value always scales charge the same way. This works fine as long as all cfqgs are treated equally regardless of their positions in the hierarchy, which is what cfq currently implements. It can't work in hierarchical settings because the interpretation of a given weight value depends on where the weight is located in the hierarchy. This patch reimplements cfqg charge scaling so that it can be used to support hierarchy properly. The scheme is fairly simple and light-weight. * When a cfqg is added to the service tree, v(disktime)weight is calculated. It walks up the tree to root calculating the fraction it has in the hierarchy. At each level, the fraction can be calculated as cfqg->weight / parent->level_weight By compounding these, the global fraction of vdisktime the cfqg has claim to - vfraction - can be determined. * When the cfqg needs to be charged, the charge is scaled inversely proportionally to the vfraction. The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point representation as before; however, the smallest scaling factor is now 1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1 was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller scaling factor. While this shifts the global scale of vdisktime a bit, it doesn't change the relative relationships among cfqgs and the scheduling result isn't different. cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY didn't have any relevance to vdisktime before and is unlikely to cause any visible behavior difference now especially as the scale shift isn't that large. As the new scheme now makes proper distinction between cfqg->weight and ->leaf_weight, reverse the weight aliasing for root cfqgs. For root, both weights are now mapped to ->leaf_weight instead of the other way around. Because we're still using cfqg_flat_parent(), this patch shouldn't change the scheduling behavior in any noticeable way. v2: Beefed up comments on vfraction as requested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	7918ffb5b8	cfq-iosched: implement cfq_group->nr_active and ->children_weight To prepare for blkcg hierarchy support, add cfqg->nr_active and ->children_weight. cfqg->nr_active counts the number of active cfqgs at the cfqg's level and ->children_weight is sum of weights of those cfqgs. The level covers itself (cfqg->leaf_weight) and immediate children. The two values are updated when a cfqg enters and leaves the group service tree. Unless the hierarchy is very deep, the added overhead should be negligible. Currently, the parent is determined using cfqg_flat_parent() which makes the root cfqg the parent of all other cfqgs. This is to make the transition to hierarchy-aware scheduling gradual. Scheduling logic will be converted to use cfqg->children_weight without actually changing the behavior. When everything is ready, blkcg_weight_parent() will be replaced with proper parent function. This patch doesn't introduce any behavior chagne. v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	e71357e118	cfq-iosched: add leaf_weight cfq blkcg is about to grow proper hierarchy handling, where a child blkg's weight would nest inside the parent's. This makes tasks in a blkg to compete against both tasks in the sibling blkgs and the tasks of child blkgs. We're gonna use the existing weight as the group weight which decides the blkg's weight against its siblings. This patch introduces a new weight - leaf_weight - which decides the weight of a blkg against the child blkgs. It's named leaf_weight because another way to look at it is that each internal blkg nodes have a hidden child leaf node which contains all its tasks and leaf_weight is the weight of the leaf node and handled the same as the weight of the child blkgs. This patch only adds leaf_weight fields and exposes it to userland. The new weight isn't actually used anywhere yet. Note that cfq-iosched currently offcially supports only single level hierarchy and root blkgs compete with the first level blkgs - ie. root weight is basically being used as leaf_weight. For root blkgs, the two weights are kept in sync for backward compatibility. v2: cfqd->root_group->leaf_weight initialization was missing from cfq_init_queue() causing divide by zero when !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>	2013-01-09 08:05:10 -08:00
Tejun Heo	3c54786590	blkcg: make blkcg_gq's hierarchical Currently a child blkg (blkcg_gq) can be created even if its parent doesn't exist. ie. Given a blkg, it's not guaranteed that its ancestors will exist. This makes it difficult to implement proper hierarchy support for blkcg policies. Always create blkgs recursively and make a child blkg hold a reference to its parent. blkg->parent is added so that finding the parent is easy. blkcg_parent() is also added in the process. This change can be visible to userland. e.g. while issuing IO in a nested cgroup didn't affect the ancestors at all, now it will initialize all ancestor blkgs and zero stats for the request_queue will always appear on them. While this is userland visible, this shouldn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:10 -08:00
Tejun Heo	93e6d5d8f5	blkcg: cosmetic updates to blkg_create() * Rename out_* labels to err_. Do ERR_PTR() conversion once in the error return path. This patch is cosmetic and to prepare for the hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:10 -08:00
Tejun Heo	86cde6b623	blkcg: reorganize blkg_lookup_create() and friends Reorganize such that * __blkg_lookup() takes bool param @update_hint to determine whether to update hint. * __blkg_lookup_create() no longer performs lookup before trying to create. Renamed to blkg_create(). * blkg_lookup_create() now performs lookup and then invokes blkg_create() if lookup fails. * root_blkg creation in blkcg_activate_policy() updated accordingly. Note that blkcg_activate_policy() no longer updates lookup hint if root_blkg already exists. Except for the last lookup hint bit which is immaterial, this is pure reorganization and doesn't introduce any visible behavior change. This is to prepare for proper hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:10 -08:00
Tejun Heo	356d2e5810	blkcg: fix minor bug in blkg_alloc() blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice. The latter test should have been on whether pol->pd_init_fn() exists. This doesn't cause actual problems because both blkcg policies implement pol->pd_init_fn(). Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:10 -08:00
Vivek Goyal	b226e5c411	cfq-iosched: Print sync-noidle information in blktrace messages Currently we attach a character "S" or "A" to the cfqq<pid>, to represent whether queues is sync or async. Add one more character "N" to represent whether it is sync-noidle queue or sync queue. So now three different type of queues will look as follows. cfq1234S --> sync queus cfq1234SN --> sync noidle queue cfq1234A --> Async queue Previously S/A classification was being printed only if group scheduling was enabled. This patch also makes sure that this classification is displayed even if group idling is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	1f23f12151	cfq-iosched: Get rid of unnecessary local variable Use of local varibale "n" seems to be unnecessary. Remove it. This brings it inline with function __cfq_group_st_add(), which is also doing the similar operation of adding a group to a rb tree. No functionality change here. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00

1 2 3 4 5 ...

1983 Commits