kernel-ark

Author	SHA1	Message	Date
Shaohua Li	4a0b75c7d0	block, cfq: fix empty queue crash caused by request merge All requests of a queue could be merged to other requests of other queue. Such queue will not have request in it, but it's in service tree. This will cause kernel oops. I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the issue should exist without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-16 14:00:22 +01:00
Tejun Heo	4eabc94125	block: don't kick empty queue in blk_drain_queue() While probing, fd sets up queue, probes hardware and tears down the queue if probing fails. In the process, blk_drain_queue() kicks the queue which failed to finish initialization and fd is unhappy about that. floppy0: no floppy controllers found ------------[ cut here ]------------ WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0() Hardware name: To Be Filled By O.E.M. VFS: do_fd_request called on non-open device Modules linked in: Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2 Call Trace: [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50 [<ffffffff813d657f>] do_fd_request+0xbf/0xd0 [<ffffffff81322b95>] blk_drain_queue+0x65/0x80 [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0 [<ffffffff818a809d>] floppy_init+0xdeb/0xe28 [<ffffffff818a72b2>] ? daring+0x6b/0x6b [<ffffffff810002af>] do_one_initcall+0x3f/0x170 [<ffffffff81884b34>] kernel_init+0x9d/0x11e [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0 [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10 [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be [<ffffffff815dbb10>] ? gs_change+0xb/0xb Avoid it by making blk_drain_queue() kick queue iff dispatch queue has something on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Sergei Trofimovich <slyich@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-15 20:03:04 +01:00
Tejun Heo	f1f8cc9465	block, cfq: move icq creation and rq->elv.icq association to block core Now block layer knows everything necessary to create and associate icq's with requests. Move ioc_create_icq() to blk-ioc.c and update get_request() such that, if elevator_type->icq_size is set, requests are automatically associated with their matching icq's before elv_set_request(). io_context reference is also managed by block core on request alloc/free. * Only ioprio/cgroup changed handling remains from cfq_get_cic(). Collapsed into cfq_set_request(). * This removes queue kicking on icq allocation failure (for now). As icq allocation failure is rare and the only effect of queue kicking achieved was possibily accelerating queue processing, this change shouldn't be noticeable. There is a larger underlying problem. Unlike request allocation, icq allocation is not guaranteed to succeed eventually after retries. The number of icq is unbound and thus mempool can't be the solution either. This effectively adds allocation dependency on memory free path and thus possibility of deadlock. This usually wouldn't happen because icq allocation is not a hot path and, even when the condition triggers, it's highly unlikely that none of the writeback workers already has icq. However, this is still possible especially if elevator is being switched under high memory pressure, so we better get it fixed. Probably the only solution is just bypassing elevator and appending to dispatch queue on any elevator allocation failure. * Comment added to explain how icq's are managed and synchronized. This completes cleanup of io_context interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	9b84cacd01	block, cfq: restructure io_cq creation path for io_context interface cleanup Add elevator_ops->elevator_init_icq_fn() and restructure cfq_create_cic() and rename it to ioc_create_icq(). The new function expects its caller to pass in io_context, uses elevator_type->icq_cache, handles generic init, calls the new elevator operation for elevator specific initialization, and returns pointer to created or looked up icq. This leaves cfq_icq_pool variable without any user. Removed. This prepares for io_context interface cleanup and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	7e5a879449	block, cfq: move io_cq exit/release to blk-ioc.c With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc and q, and freeing automatically handled by blk-ioc. The elevator operation only need to perform exit operation specific to the elevator - in cfq's case, exiting the cfqq's. Also, clearing of io_cq's on q detach is moved to block core and automatically performed on elevator switch and q release. Because the q io_cq points to might be freed before RCU callback for the io_cq runs, blk-ioc code should remember to which cache the io_cq needs to be freed when the io_cq is released. New field io_cq->__rcu_icq_cache is added for this purpose. As both the new field and rcu_head are used only after io_cq is released and the q/ioc_node fields aren't, they are put into unions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	3d3c2379fe	block, cfq: move icq cache management to block core Let elevators set ->icq_size and ->icq_align in elevator_type and elv_register() and elv_unregister() respectively create and destroy kmem_cache for icq. * elv_register() now can return failure. All callers updated. * icq caches are automatically named "ELVNAME_io_cq". * cfq_slab_setup/kill() are collapsed into cfq_init/exit(). * While at it, minor indentation change for iosched_cfq.elevator_name for consistency. This will help moving icq management to block core. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	47fdd4ca96	block, cfq: move io_cq lookup to blk-ioc.c Now that all io_cq related data structures are in block core layer, io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c. Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with parameter return type changes (cfqd -> request_queue, cfq_io_cq -> io_cq) and cfq_cic_lookup() becomes thin wrapper around cfq_cic_lookup(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	a612fddf0d	block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq Most of icq management is about to be moved out of cfq into blk-ioc. This patch prepares for it. * Move cfqd->icq_list to request_queue->icq_list * Make request explicitly point to icq instead of through elevator private data. ->elevator_private[3] is replaced with sub struct elv which contains icq pointer and priv[2]. cfq is updated accordingly. * Meaningless clearing of ->elevator_private[0] removed from elv_set_request(). At that point in code, the field was guaranteed to be %NULL anyway. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	c586980732	block, cfq: reorganize cfq_io_context into generic and cfq specific parts Currently io_context and cfq logics are mixed without clear boundary. Most of io_context is independent from cfq but cfq_io_context handling logic is dispersed between generic ioc code and cfq. cfq_io_context represents association between an io_context and a request_queue, which is a concept useful outside of cfq, but it also contains fields which are useful only to cfq. This patch takes out generic part and put it into io_cq (io context-queue) and the rest into cfq_io_cq (cic moniker remains the same) which contains io_cq. The following changes are made together. * cfq_ttime and cfq_io_cq now live in cfq-iosched.c. * All related fields, functions and constants are renamed accordingly. * ioc->ioc_data is now "struct io_cq " instead of "void " and renamed to icq_hint. This prepares for io_context API cleanup. Documentation is currently sparse. It will be added later. Changes in this patch are mechanical and don't cause functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	22f746e235	block: remove elevator_queue->ops elevator_queue->ops points to the same ops struct ->elevator_type.ops is pointing to. The only effect of caching it in elevator_queue is shorter notation - it doesn't save any indirect derefence. Relocate elevator_type->list which used only during module init/exit to the end of the structure, rename elevator_queue->elevator_type to ->type, and replace elevator_queue->ops with elevator_queue->type.ops. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	f8fc877d3c	block: reorder elevator switch sequence Elevator switch sequence first attached the new elevator, then tried registering it (sysfs) and if that failed attached back the old elevator. However, sysfs registration doesn't require the elevator to be attached, so there is no reason to do the "detach, attach new, register, maybe re-attach old" sequence. It can just do "register, detach, attach". * elevator_init_queue() is updated to set ->elevator_data directly and return 0 / -errno. This allows elevator_exit() on an unattached elevator. * __elv_unregister_queue() which was necessary to unregister unattached q is removed in favor of __elv_register_queue() which can register unattached q. * elevator_attach() becomes a single assignment and obscures more then it helps. Dropped. This will help cleaning up io_context handling across elevator switch. This patch doesn't introduce visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	f2dbd76a0a	block, cfq: replace current_io_context() with create_io_context() When called under queue_lock, current_io_context() triggers lockdep warning if it hits allocation path. This is because io_context installation is protected by task_lock which is not IRQ safe, so it triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock deadlock warning. Given the restriction, accessor + creator rolled into one doesn't work too well. Drop current_io_context() and let the users access task->io_context directly inside queue_lock combined with explicit creation using create_io_context(). Future ioc updates will further consolidate ioc access and the create interface will be unexported. While at it, relocate ioc internal interface declarations in blk.h and add section comments before and after. This patch does not introduce functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	1238033c79	block, cfq: kill cic->key Now that lazy paths are removed, cfqd_dead_key() is meaningless and cic->q can be used whereever cic->key is used. Kill cic->key. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	b50b636bce	block, cfq: kill ioc_gone Now that cic's are immediately unlinked under both locks, there's no need to count and drain cic's before module unload. RCU callback completion is waited with rcu_barrier(). While at it, remove residual RCU operations on cic_list. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	b9a1920837	block, cfq: remove delayed unlink Now that all cic's are immediately unlinked from both ioc and queue, lazy dropping from lookup path and trimming on elevator unregister are unnecessary. Kill them and remove now unused elevator_ops->trim(). This also leaves call_for_each_cic() without any user. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	b2efa05265	block, cfq: unlink cfq_io_context's immediately cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's far simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	f1a4f4d35f	block, cfq: fix cic lookup locking * cfq_cic_lookup() may be called without queue_lock and multiple tasks can execute it simultaneously for the same shared ioc. Nothing prevents them racing each other and trying to drop the same dead cic entry multiple times. * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing prevents cfq_cic_lookup() seeing stale cic->key. This usually doesn't blow up because by the time cic is exited, all requests have been drained and new requests are terminated before going through elevator. However, it can still be triggered by plug merge path which doesn't grab queue_lock and thus can't check DEAD state reliably. This patch updates lookup locking such that, * Lookup is always performed under queue_lock. This doesn't add any more locking. The only issue is cfq_allow_merge() which can be called from plug merge path without holding any lock. For now, this is worked around by using cic of the request to merge into, which is guaranteed to have the same ioc. For longer term, I think it would be best to separate out plug merge method from regular one. * Spurious ioc->lock locking around cic lookup hint assignment dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	216284c352	block, cfq: fix race condition in cic creation path and tighten locking cfq_get_io_context() would fail if multiple tasks race to insert cic's for the same association. This patch restructures cfq_get_io_context() such that slow path insertion race is handled properly. Note that the restructuring also makes cfq_get_io_context() called under queue_lock and performs both ioc and cfqd insertions while holding both ioc and queue locks. This is part of on-going locking tightening and will be used to simplify synchronization rules. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	dc86900e0a	block, cfq: move ioc ioprio/cgroup changed handling to cic ioprio/cgroup change was handled by marking the changed state in ioc and, on the following access to the ioc, performing RCU-protected iteration through all cic's grabbing the matching queue_lock. This patch moves the changed state to each cic. When ioprio or cgroup changes, the respective bit is set on all cic's of the ioc and when each of those cic (not ioc) is accessed, change is applied for that specific ioc-queue pair. This also fixes the following two race conditions between setting and clearing of changed states. * Missing barrier between assign/load of ioprio and ioprio_changed allowed applying old ioprio. * Change requests could happen between application of change and clearing of changed variables. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	283287a52e	block, cfq: misc updates to cfq_io_context Make the following changes to prepare for ioc/cic management cleanup. * Add cic->q so that ioc can determine the associated queue without querying cfq. This will eventually replace ->key. * Factor out cfq_release_cic() from cic_free_func(). This function assumes that the caller handled locking. * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it take only @cic. * Restructure cfq_cic_link() for future updates. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	09ac46c429	block: misc updates to blk_get_queue() * blk_get_queue() is peculiar in that it returns 0 on success and 1 on failure instead of 0 / -errno or boolean. Update it such that it returns %true on success and %false on failure. * Make sure the caller checks for the return value. * Separate out __blk_get_queue() which doesn't check whether @q is dead and put it in blk.h. This will be used later. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	6e736be7f2	block: make ioc get/put interface more conventional and fix race on alloction Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	42ec57a8f6	block: misc ioc cleanups * int return from put_io_context() wasn't used by anybody. Make it return void like other put functions and docbook-fy the function comment. * Reorder dummy declarations for !CONFIG_BLOCK case a bit. * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of if block and drop 0'ing. * Docbook-fy current_io_context() comment. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	a73f730d01	block, cfq: move cfqd->cic_index to q->id cfq allocates per-queue id using ida and uses it to index cic radix tree from io_context. Move it to q->id and allocate on queue init and free on queue release. This simplifies cfq a bit and will allow for further improvements of io context life-cycle management. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	8ba61435d7	block: add missing blk_queue_dead() checks blk_insert_cloned_request(), blk_execute_rq_nowait() and blk_flush_plug_list() either didn't check whether the queue was dead or did it without holding queue_lock. Update them so that dead state is checked while holding queue_lock. AFAICS, this plugs all holes (requeue doesn't matter as the request is transitioning atomically from in_flight to queued). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	481a7d6479	block: fix drain_all condition in blk_drain_queue() When trying to drain all requests, blk_drain_queue() checked only q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This patch updates blk_drain_queue() such that it looks at all the counters and queues so that request_queue is actually empty on completion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	34f6055c80	block: add blk_queue_dead() There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead() macro and use it. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	1ba64edef6	block, sx8: kill blk_insert_request() The only user left for blk_insert_request() is sx8 and it can be trivially switched to use blk_execute_rq_nowait() - special requests aren't included in io stat and sx8 doesn't use block layer tagging. Switch sx8 and kill blk_insert_requeset(). This patch doesn't introduce any functional difference. Only compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	bb9d97b6df	cgroup: don't use subsys->can_attach_task() or ->attach_task() Now that subsys->can_attach() and attach() take @tset instead of @task, they can handle per-task operations. Convert ->can_attach_task() and ->attach_task() users to use ->can_attach() and attach() instead. Most converions are straight-forward. Noteworthy changes are, * In cgroup_freezer, remove unnecessary NULL assignments to unused methods. It's useless and very prone to get out of sync, which already happened. * In cpuset, PF_THREAD_BOUND test is checked for each task. This doesn't make any practical difference but is conceptually cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2011-12-12 18:12:21 -08:00
Yasuaki Ishimatsu	5eb46851de	cfq-iosched: fix cfq_cic_link() race confition cfq_cic_link() has race condition. When some processes which shared ioc issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST sometimes. The race condition might stop I/O by following steps: step 1: Process A: Issue an I/O to /dev/sda step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not linked with a cic for the device step 3: Process A: Get a new cic for the device (cicA here) in cfq_alloc_io_context() step 4: Process B: Issue an I/O to /dev/sda step 5: Process B: Get iocA in get_io_context() since process A and B share the same ioc step 6: Process B: Get a new cic for the device (cicB here) in cfq_alloc_io_context() since iocA has not been linked with a cic for the device yet step 7: Process A: Link cicA to iocA in cfq_cic_link() step 8: Process A: Dispatch I/O to driver and finish it step 9: Process B: Try to link cicB to iocA in cfq_cic_link() But it fails with showing "cfq: cic link failed!" kernel message, since iocA has already linked with cicA at step 7. step 10: Process B: Wait for finishig I/O in get_request_wait() The function does not wake up, when there is no I/O to the device. When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic. So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup(). Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-02 10:07:07 +01:00
majianpeng	2984ff38cc	cfq-iosched: free cic_index if blkio_alloc_blkg_stats fails If we fail allocating the blkpg stats, we free cfqd and cfgq. But we need to free the IDA cfqd->cic_index as well. Signed-off-by: majianpeng <majianpeng@gmail.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-30 15:47:48 +01:00
Mike Snitzer	5151412dd4	block: initialize request_queue's numa node during struct request_queue is allocated with __GFP_ZERO so its "node" field is zero before initialization. This causes an oops if node 0 is offline in the page allocator because its zonelists are not initialized. From Dave Young's dmesg: SRAT: Node 1 PXM 2 0-d0000000 SRAT: Node 1 PXM 2 100000000-330000000 SRAT: Node 0 PXM 1 330000000-630000000 Initmem setup node 1 0000000000000000-000000000affb000 ... Built 1 zonelists in Node order, mobility grouping on. ... BUG: unable to handle kernel paging request at 0000000000001c08 IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870 and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on zonelist->_zonerefs. The fix is to initialize q->node at the time of allocation so the correct node is passed to the slab allocator later. Since blk_init_allocated_queue_node() is no longer needed, merge it with blk_init_allocated_queue(). [rientjes@google.com: changelog, initializing q->node] Cc: stable@vger.kernel.org [2.6.37+] Reported-by: Dave Young <dyoung@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Dave Young <dyoung@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-23 10:59:13 +01:00
Shaohua Li	019ceb7d5d	block: add missed trace_block_plug After flush plug list, the list has no request, so we need to add a trace_block_plug(). Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-16 09:21:50 +01:00
Shaohua Li	3540d5e89b	block: avoid unnecessary plug list flush get_request_wait() could sleep and flush the plug list. If the list is already flushed, don't flush again. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-16 09:21:50 +01:00
Ben Hutchings	6b76106d8e	block: Always check length of all iov entries in blk_rq_map_user_iov() Even after commit `5478755616` ("block: check for proper length of iov entries earlier ...") we still won't check for zero-length entries after an unaligned entry. Remove the break-statement, so all entries are checked. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-13 19:58:09 +01:00
Tejun Heo	d0985394e7	block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk" This reverts commit `a72c5e5eb7`. The commit introduced alias for block devices which is intended to be used during logging although actual usage hasn't been committed yet. This approach adds very limited benefit (raw log might be easier to follow) which can be trivially implemented in userland but has a lot of problems. It is much worse than netif renames because it doesn't rename the actual device but just adds conveninence name which isn't used universally or enforced. Everything internal including device lookup and sysfs still uses the internal name and nothing prevents two devices from using conflicting alias - ie. sda can have sdb as its alias. This has been nacked by people working on device driver core, block layer and kernel-userland interface and shouldn't have been upstreamed. Revert it. http://thread.gmane.org/gmane.linux.kernel/1155104 http://thread.gmane.org/gmane.linux.scsi/68632 http://thread.gmane.org/gmane.linux.scsi/69776 Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-10 09:03:55 +01:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Linus Torvalds	3d0a8d10cf	Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block * 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits) virtio-blk: use ida to allocate disk index hpsa: add small delay when using PCI Power Management to reset for kump cciss: add small delay when using PCI Power Management to reset for kump xen/blkback: Fix two races in the handling of barrier requests. xen/blkback: Check for proper operation. xen/blkback: Fix the inhibition to map pages when discarding sector ranges. xen/blkback: Report VBD_WSECT (wr_sect) properly. xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests. xen-blkfront: plug device number leak in xlblk_init() error path xen-blkfront: If no barrier or flush is supported, use invalid operation. xen-blkback: use kzalloc() in favor of kmalloc()+memset() xen-blkback: fixed indentation and comments xen-blkfront: fix a deadlock while handling discard response xen-blkfront: Handle discard requests. xen-blkback: Implement discard requests ('feature-discard') xen-blkfront: add BLKIF_OP_DISCARD and discard request struct drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd() drivers/block/loop.c: emit uevent on auto release drivers/block/cpqarray.c: use pci_dev->revision loop: always allow userspace partitions and optionally support automatic scanning ... Fic up trivial header file includsion conflict in drivers/block/loop.c	2011-11-04 17:22:14 -07:00
Linus Torvalds	b4fdcb02f1	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c	2011-11-04 17:06:58 -07:00
Tejun Heo	6dd9ad7df2	block: don't call blk_drain_queue() if elevator is not up blk_cleanup_queue() may be called before elevator is set up on a queue which triggers the following oops. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590 Bochs Bochs RIP: 0010:[<ffffffff8125a69c>] [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Call Trace: [<ffffffff8125da92>] blk_drain_queue+0x42/0x70 [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0 [<ffffffff81469640>] md_free+0x50/0x70 [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0 [<ffffffff81270d56>] kref_put+0x36/0xa0 [<ffffffff8126f2b7>] kobject_put+0x27/0x60 [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40 [<ffffffff81083450>] process_one_work+0x100/0x3b0 [<ffffffff8108527f>] worker_thread+0x15f/0x3a0 [<ffffffff81089937>] kthread+0x87/0x90 [<ffffffff81621834>] kernel_thread_helper+0x4/0x10 Fix it by making blk_cleanup_queue() check whether q->elevator is set up before invoking blk_drain_queue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-03 18:52:11 +01:00
Paul Gortmaker	6adb1236b5	block: Change module.h -> export.h in bsg-lib.c This file isn't using full modular functionality, and hence can be "downgraded" to just using the export.h header. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:31:13 -04:00
Paul Gortmaker	d5decd3b95	block: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros These files were getting <linux/module.h> via an implicit include path, but we want to crush those out of existence since they cost time during compiles of processing thousands of lines of headers for no reason. Give them the lightweight header that just contains the EXPORT_SYMBOL infrastructure. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:31:12 -04:00
Linus Torvalds	ec7ae51753	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (204 commits) [SCSI] qla4xxx: export address/port of connection (fix udev disk names) [SCSI] ipr: Fix BUG on adapter dump timeout [SCSI] megaraid_sas: Fix instance access in megasas_reset_timer [SCSI] hpsa: change confusing message to be more clear [SCSI] iscsi class: fix vlan configuration [SCSI] qla4xxx: fix data alignment and use nl helpers [SCSI] iscsi class: fix link local mispelling [SCSI] iscsi class: Replace iscsi_get_next_target_id with IDA [SCSI] aacraid: use lower snprintf() limit [SCSI] lpfc 8.3.27: Change driver version to 8.3.27 [SCSI] lpfc 8.3.27: T10 additions for SLI4 [SCSI] lpfc 8.3.27: Fix queue allocation failure recovery [SCSI] lpfc 8.3.27: Change algorithm for getting physical port name [SCSI] lpfc 8.3.27: Changed worst case mailbox timeout [SCSI] lpfc 8.3.27: Miscellanous logic and interface fixes [SCSI] megaraid_sas: Changelog and version update [SCSI] megaraid_sas: Add driver workaround for PERC5/1068 kdump kernel panic [SCSI] megaraid_sas: Add multiple MSI-X vector/multiple reply queue support [SCSI] megaraid_sas: Add support for MegaRAID 9360/9380 12GB/s controllers [SCSI] megaraid_sas: Clear FUSION_IN_RESET before enabling interrupts ...	2011-10-28 16:44:18 -07:00
Jens Axboe	334c2b0b8b	blk-throttle: use queue_is_locked() instead of lockdep_is_held() We can't use the latter if !CONFIG_LOCKDEP. Reported-by: Sedat Dilek <sedat.dilek@googlemail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-25 15:51:48 +02:00
Vivek Goyal	a38eb630fa	blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blkcg->policy_list is protected by blkcg->lock. Its not rcu protected list. So even for readers, they need to take blkcg->lock. There are few functions which were reading the list without taking lock. Fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-25 15:48:12 +02:00
Vivek Goyal	e060f00bee	blk-throttle: Free up policy node associated with deleted rule If a rule is being deleted, free up associated policy node. Otherwise that memory is leaked. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-25 15:48:12 +02:00
Tao Ma	5e08159197	block: warn if tag is greater than real_max_depth. In case tag depth is reduced, it is max_depth not real_max_depth. So we should allow a request with tag >= max_depth, but for a tag >= real_max_depth, there really should be some problem. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-25 10:20:05 +02:00
Jens Axboe	83157223de	Merge branch 'for-linus' into for-3.2/core	2011-10-24 16:24:38 +02:00
Tejun Heo	f992ae801a	block: make gendisk hold a reference to its queue The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:24:31 +02:00
Jeff Moyer	e67b77c791	blk-flush: move the queue kick into A dm-multipath user reported[1] a problem when trying to boot a kernel with commit `4853abaae7` (block: fix flush machinery for stacking drivers with differring flush flags) applied. It turns out that an empty flush request can be sent into blk_insert_flush. When the BUG_ON was fixed to allow for this, I/O on the underlying device would stall. The reason is that blk_insert_cloned_request does not kick the queue. In the aforementioned commit, I had added a special case to kick the queue if data was sent down but the queue flags did not require a flush. A better solution is to push the queue kick up into blk_insert_cloned_request. This patch, along with a follow-on which fixes the BUG_ON, fixes the issue reported. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Stable note: 3.1 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:24:31 +02:00
Jeff Moyer	834f9f61a5	blk-flush: fix invalid BUG_ON in blk_insert_flush A user reported a regression due to commit `4853abaae7` (block: fix flush machinery for stacking drivers with differring flush flags). Part of the problem is that blk_insert_flush required a single bio be attached to the request. In reality, having no attached bio is also a valid case, as can be observed with an empty flush. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com Acked-by: Tejun Heo <tj@kernel.org> Stable note: 3.1 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:24:30 +02:00
Tao Ma	9562ad9ab3	block: Remove the control of complete cpu from bio. bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:11:30 +02:00
Jie Liu	e890413af4	block: fix a typo in the blk-cgroup.h file byptes -> bytes. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:08:38 +02:00
Tejun Heo	c9a929dde3	block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown request_queue is refcounted but actually depdends on lifetime management from the queue owner - on blk_cleanup_queue(), block layer expects that there's no request passing through request_queue and no new one will. This is fundamentally broken. The queue owner (e.g. SCSI layer) doesn't have a way to know whether there are other active users before calling blk_cleanup_queue() and other users (e.g. bsg) don't have any guarantee that the queue is and would stay valid while it's holding a reference. With delay added in blk_queue_bio() before queue_lock is grabbed, the following oops can be easily triggered when a device is removed with in-flight IOs. sd 0:0:1:0: [sdb] Stopping disk ata1.01: disabled general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100 ... Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80) ... Call Trace: [<ffffffff8137d774>] elv_merge+0x84/0xe0 [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400 [<ffffffff813838ea>] generic_make_request+0xca/0x100 [<ffffffff81383994>] submit_bio+0x74/0x100 [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0 [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40 [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60 [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff8118c1ca>] do_sync_read+0xda/0x120 [<ffffffff8118ce55>] vfs_read+0xc5/0x180 [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0 [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b This happens because blk_queue_cleanup() destroys the queue and elevator whether IOs are in progress or not and DEAD tests are sprinkled in the request processing path without proper synchronization. Similar problem exists for blk-throtl. On queue cleanup, blk-throtl is shutdown whether it has requests in it or not. Depending on timing, it either oopses or throttled bios are lost putting tasks which are waiting for bio completion into eternal D state. The way it should work is having the usual clear distinction between shutdown and release. Shutdown drains all currently pending requests, marks the queue dead, and performs partial teardown of the now unnecessary part of the queue. Even after shutdown is complete, reference holders are still allowed to issue requests to the queue although they will be immmediately failed. The rest of teardown happens on release. This patch makes the following changes to make blk_queue_cleanup() behave as proper shutdown. * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and queue_lock. * Unsynchronized DEAD check in generic_make_request_checks() removed. This couldn't make any meaningful difference as the queue could die after the check. * blk_drain_queue() updated such that it can drain all requests and is now called during cleanup. * blk_throtl updated such that it checks DEAD on grabbing queue_lock, drains all throttled bios during cleanup and free td when queue is released. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:42:16 +02:00
Tejun Heo	bd87b5898a	block: drop @tsk from attempt_plug_merge() and explain sync rules attempt_plug_merge() accesses elevator without holding queue_lock and may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to be valid because it's accessed iff the plugged list has requests and elevator is never exited with live requests, so as long as the elevator method can deal with unlocked access, this is safe. Explain the sync rules around attempt_plug_merge() and drop the unnecessary @tsk parameter. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:08 +02:00
Tejun Heo	da8303c63b	block: make get_request[_wait]() fail if queue is dead Currently get_request[_wait]() allocates request whether queue is dead or not. This patch makes get_request[_wait]() return NULL if @q is dead. blk_queue_bio() is updated to fail the submitted bio if request allocation fails. While at it, add docbook comments for get_request[_wait](). Note that the current code has rather unclear (there are spurious DEAD tests scattered around) assumption that the owner of a queue guarantees that no request travels block layer if the queue is dead and this patch in itself doesn't change much; however, this will allow fixing the broken assumption in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:05 +02:00
Tejun Heo	bc16a4f933	block: reorganize throtl_get_tg() and blk_throtl_bio() blk_throtl_bio() and throtl_get_tg() have rather unusual interface. * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV), and drops queue_lock in the latter case. Different locking context depending on return value is error-prone and DEAD state is scheduled to be protected by queue_lock anyway. Move DEAD check inside queue_lock and return valid tg or NULL. * blk_throtl_bio() indicates return status both with its return value and in/out param *@bio. The former is used to indicate whether queue is found to be dead during throtl processing. The latter whether the bio is throttled. There's no point in returning DEAD check result from blk_throtl_bio(). The queue can die after blk_throtl_bio() is finished but before make_request_fn() grabs queue lock. Make it take @bio instead and return boolean result indicating whether the request is throttled or not. This patch doesn't cause any visible functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:01 +02:00
Tejun Heo	e3c78ca524	block: reorganize queue draining Reorganize queue draining related code in preparation of queue exit changes. * Factor out actual draining from elv_quiesce_start() to blk_drain_queue(). * Make elv_quiesce_start/end() responsible for their own locking. * Replace open-coded ELVSWITCH clearing in elevator_switch() with elv_quiesce_end(). This patch doesn't cause any visible functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:32:38 +02:00
Tejun Heo	315fceee81	block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are completely bogus. The caller must have a reference to the queue on entry and taking an extra reference doesn't change anything. For scsi_cmd_ioctl(), the only effect is that it ends up checking QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die right after blk_get_queue(). Dead queue should be and is handled in request issue path (it's somewhat broken now but that's a separate problem and doesn't affect this one much). throtl_get_tg() incorrectly assumes that q is rcu freed. Also, it doesn't check return value of blk_get_queue(). If the queue is already dead, it ends up doing an extra put. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:25 +02:00
Tejun Heo	75eb6c372d	block: pass around REQ_* flags instead of broken down booleans during request alloc/free blk_alloc_request() and freed_request() take different combinations of REQ_* @flags, @priv and @is_sync when @flags is superset of the latter two. Make them take @flags only. This cleans up the code a bit and will ease updating allocation related REQ_* flags. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:22 +02:00
Tejun Heo	bc9fcbf9cb	block: move blk_throtl prototypes to block/blk.h blk_throtl interface is block internal and there's no reason to have them in linux/blkdev.h. Move them to block/blk.h. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:18 +02:00
Tejun Heo	ece84241b9	block: fix genhd refcounting in blkio_policy_parse_and_set() blkio_policy_parse_and_set() calls blkio_check_dev_num() to check whether the given dev_t is valid. blkio_check_dev_num() uses get_gendisk() for verification but never puts the returned genhd leaking the reference. This patch collapses blkio_check_dev_num() into its caller and updates it such that the genhd is put before returning. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:15 +02:00
Tejun Heo	523e1d399c	block: make gendisk hold a reference to its queue The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:07 +02:00
Jens Axboe	5c04b426f2	Merge branch 'v3.1-rc10' into for-3.2/core Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:30:42 +02:00
Hannes Reinecke	777eb1bf15	block: Free queue resources at blk_release_queue() A kernel crash is observed when a mounted ext3/ext4 filesystem is physically removed. The problem is that blk_cleanup_queue() frees up some resources eg by calling elevator_exit(), which are not checked for in normal operation. So we should rather move these calls to the destructor function blk_release_queue() as at that point all remaining references are gone. However, in doing so we have to ensure that any externally supplied queue_lock is disconnected as the driver might free up the lock after the call of blk_cleanup_queue(), Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-28 08:07:01 -06:00
Wanlong Gao	d11bb4462c	blk-cgroup: be able to remove the record of unplugged device The bug is we're not able to remove the device from blkio cgroup's per-device control files if it gets unplugged. To reproduce the bug: # mount -t cgroup -o blkio xxx /cgroup # cd /cgroup # echo "8:0 1000" > blkio.throttle.read_bps_device # unplug the device # cat blkio.throttle.read_bps_device 8:0 1000 # echo "8:0 0" > blkio.throttle.read_bps_device -bash: echo: write error: No such device After patching, the device removal will succeed. Thanks for the comments of Paul, Zefan, and Vivek. Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-21 10:22:10 +02:00
Andrew Morton	499337bb65	block/blk-sysfs.c: fix kerneldoc references The kerneldoc for blk_release_queue() is referring to blk_cleanup_queue(). Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-21 10:01:22 +02:00
Suresh Jayaraman	75df713627	block: document blk-plug Thus spake Andrew Morton: "And I have the usual maintainability whine. If someone comes up to vmscan.c and sees it calling blk_start_plug(), how are they supposed to work out why that call is there? They go look at the blk_start_plug() definition and it is undocumented. I think we can do better than this?" Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens Axboe and from an earlier attempt by Shaohua Li to document blk-plug. [akpm@linux-foundation.org: grammatical and spelling tweaks] Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-21 10:00:16 +02:00
Christoph Hellwig	27a84d54c0	block: refactor generic_make_request Move all the checks performed on a bio into a new helper, and call it as soon as bio is submitted even if it is a re-submission from ->make_request. We explicitly mark the new helper as beeing non-inlined as the stack usage for printing the block device name in the failure case is quite high and this a patch where we have to be extremely conservative about stack usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-15 14:01:40 +02:00
Tao Ma	8ad6a56f56	block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request In __blk_complete_request, we check both QUEUE_FLAG_SAME_COMP and req->cpu to decide whether we should use req->cpu. Actually the user can also select the complete cpu by either setting BIO_CPU_AFFINE or by calling bio_set_completion_cpu. Current solution makes these 2 ways don't work any more. So we'd better just check req->cpu. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-14 09:31:01 +02:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
Jens Axboe	c20e8de27f	block: rename __make_request() to blk_queue_bio() Now that it's exported, lets put it in a more sane namespace. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:31 +02:00
Christoph Hellwig	166e1f901b	block: export __make_request Avoid the hacks need for request based device mappers currently by simply exporting the symbol instead of trying to get it through the back door. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:27 +02:00
Wang Sheng-Hui	484fc254b8	elevator: use ELV_NAME_MAX instead of magic number 16 for chosen_elevator We have ELV_NAME_MAX defined to 16, and hence we should use it instead of the magic nubmer 16 for elevator's name string. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 08:59:20 +02:00
Nao Nishijima	a72c5e5eb7	[SCSI] genhd: add a new attribute "alias" in gendisk This patch allows the user to set an "alias" of the disk via sysfs interface. This patch only adds a new attribute "alias" in gendisk structure. To show the alias instead of the device name in kernel messages, we need to revise printk messages and use alias_name() in them. Example: (current) printk("disk name is %s\n", disk->disk_name); (new) printk("disk name is %s\n", alias_name(disk)); Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can have an "alias" which length is up to 255 bytes. This attribute is write-once. Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Suggested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2011-08-29 00:16:19 -07:00
Shaohua Li	56ebdaf2fa	block: simplify force plug flush code a little bit Cleaning up the code a little bit. attempt_plug_merge() traverses the plug list anyway, we can do the request counting there, so stack size is reduced a little bit. The motivation here is I suspect if we should count the requests for each queue (task could handle multiple disks in the meantime), but my test doesn't show it's worthy doing. If somebody proves we should do it, below change will make that more easier. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-24 16:04:34 +02:00
Shaohua Li	a632716275	block: change force plug flush call order Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New request can't be merged to existing requests, but later new requests might be merged with this new one. If blk_flush_plug_list() is done later, the merge doesn't happen. Believe it or not, this fixes a 10% regression running sysbench workload. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-24 16:04:32 +02:00
Eric Seppanen	e8037d4983	block: Fix queue_flag update when rq_affinity goes from 2 to 1 Commit `5757a6d76c` added the QUEUE_FLAG_SAME_FORCE flag, but fails to clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE) and the new state is '1' (SAME_COMP). Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Eric Seppanen <eric@purestorage.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-24 08:51:34 +02:00
Tejun Heo	d27769ec3d	block: add GENHD_FL_NO_PART_SCAN There are cases where suppressing partition scan is useful - e.g. for lo devices and pseudo SATA devices which advertise to be a disk but get upset on partition scan (some port multiplier control devices show such behavior). This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan regardless of the number of possible partitions. disk_partitionable() is renamed to disk_part_scan_enabled() as suppressing partition scan doesn't imply the device can't be partitioned using BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now directly tests disk_max_parts() to maintain backward-compatibility. -v2: Updated to make it clear that only partition scan is suppressed not partitioning itself as suggested by Kay Sievers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-23 20:01:04 +02:00
Christoph Hellwig	65299a3b78	block: separate priority boosting from REQ_META Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-23 14:50:29 +02:00
Linus Torvalds	5ccc38740a	Merge branch 'for-linus' of git://git.kernel.dk/linux-block * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...	2011-08-19 10:47:07 -07:00
Jens Axboe	b53d1ed734	Revert "cfq: Remove special treatment for metadata rqs." We have a kernel build regression since 3.1-rc1, which is about 10% regression. The kernel source is in an ext3 filesystem. Alex Shi bisect it to commit: commit `a07405b780` Author: Justin TerAvest <teravest@google.com> Date: Sun Jul 10 22:09:19 2011 +0200 cfq: Remove special treatment for metadata rqs. Apparently this is caused by lack metadata preemption, where ext3/ext4 do use READ_META. I didn't see a way to fix the issue, so suggest reverting the patch. This reverts commit `a07405b780`. Reported-by: Alex Shi<alex.shi@intel.com> Reported-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-19 08:34:48 +02:00
Jeff Moyer	4853abaae7	block: fix flush machinery for stacking drivers with differring flush flags Commit `ae1b153962`, block: reimplement FLUSH/FUA to support merge, introduced a performance regression when running any sort of fsyncing workload using dm-multipath and certain storage (in our case, an HP EVA). The test I ran was fs_mark, and it dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out that dm-multipath always advertised flush+fua support, and passed commands on down the stack, where those flags used to get stripped off. The above commit changed that behavior: static inline struct request __elv_next_request(struct request_queue q) { struct request rq; while (1) { - while (!list_empty(&q->queue_head)) { + if (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (!(rq->cmd_flags & (REQ_FLUSH \| REQ_FUA)) \|\| - (rq->cmd_flags & REQ_FLUSH_SEQ)) - return rq; - rq = blk_do_flush(q, rq); - if (rq) - return rq; + return rq; } Note that previously, a command would come in here, have REQ_FLUSH\|REQ_FUA set, and then get handed off to blk_do_flush: struct request blk_do_flush(struct request_queue q, struct request rq) { unsigned int fflags = q->flush_flags; /* may change, cache it */ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA; bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH); bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA); unsigned skip = 0; ... if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) { rq->cmd_flags &= ~REQ_FLUSH; if (!has_fua) rq->cmd_flags &= ~REQ_FUA; return rq; } So, the flush machinery was bypassed in such cases (q->flush_flags == 0 && rq->cmd_flags & (REQ_FLUSH\|REQ_FUA)). Now, however, we don't get into the flush machinery at all. Instead, __elv_next_request just hands a request with flush and fua bits set to the scsi_request_fn, even if the underlying request_queue does not support flush or fua. The agreed upon approach is to fix the flush machinery to allow stacking. While this isn't used in practice (since there is only one request-based dm target, and that target will now reflect the flush flags of the underlying device), it does future-proof the solution, and make it function as designed. In order to make this work, I had to add a field to the struct request, inside the flush structure (to store the original req->end_io). Shaohua had suggested overloading the union with rb_node and completion_data, but the completion data is used by device mapper and can also be used by other drivers. So, I didn't see a way around the additional field. I tested this patch on an HP EVA with both ext4 and xfs, and it recovers the lost performance. Comments and other testers, as always, are appreciated. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-15 21:37:25 +02:00
Shaohua Li	bcf30e75b7	block: improve rq_affinity placement This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1 work as expected). The purpose is to avoid an unnecessary IPI. Let's take an example. My test box has cpu 0-7, one socket. Say request is added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU 0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and CPU 7 have no difference from cache sharing point view and we can avoid an ipi if doing it in CPU 7. An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is running in interrupt handler, and currently I/O controller doesn't support multiple interrupts (I checked several LSI cards and AHCI), so only one CPU can run blk_complete_request(). This is still quite different as QUEUE_FLAG_SAME_FORCE. Since only one CPU runs softirq, the only difference with below patch is softirq not always runs at the first CPU of a group. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-11 10:39:04 +02:00
Jeff Moyer	fa1bf42ff9	allow blk_flush_policy to return REQ_FSEQ_DATA independent of FLUSH blk_insert_flush has the following check: / * If there's data but flush is not necessary, the request can be * processed directly without going through flush machinery. Queue * for normal execution. / if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH \| REQ_FSEQ_POSTFLUSH))) { list_add_tail(&rq->queuelist, &q->queue_head); return; } However, blk_flush_policy will not return with policy set to only REQ_FSEQ_DATA: static unsigned int blk_flush_policy(unsigned int fflags, struct request rq) { unsigned int policy = 0; if (fflags & REQ_FLUSH) { if (rq->cmd_flags & REQ_FLUSH) policy \|= REQ_FSEQ_PREFLUSH; if (blk_rq_sectors(rq)) policy \|= REQ_FSEQ_DATA; if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA)) policy \|= REQ_FSEQ_POSTFLUSH; } return policy; } Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set. Fix this mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH check. Tejun notes: Hmmm... yes, this can become a correctness issue if (and only if) blk_queue_flush() is called to change q->flush_flags while requests are in-flight; otherwise, requests wouldn't reach the function at all. Also, I think it would be a generally good idea to always set FSEQ_DATA if the request has data. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-09 20:32:09 +02:00
Tao Ma	35ae66e0a0	block: Make rq_affinity = 1 work as expected Commit `5757a6d76c` introduced a new rq_affinity = 2 so as to make the request completed in the __make_request cpu. But it makes the old rq_affinity = 1 not work any more. The root cause is that if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu, ccpu will be the same as group_cpu, so the completion will be excuted in the 'cpu' not 'group_cpu'. This patch fix problem by simpling removing group_cpu and the codes are more explicit now. If ccpu == cpu, we complete in cpu, otherwise we raise_blk_irq to ccpu. Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-05 09:37:47 +02:00
Akinobu Mita	dd48c085c1	fault-injection: add ability to export fault_attr in arbitrary directory init_fault_attr_dentries() is used to export fault_attr via debugfs. But it can only export it in debugfs root directory. Per Forlin is working on mmc_fail_request which adds support to inject data errors after a completed host transfer in MMC subsystem. The fault_attr for mmc_fail_request should be defined per mmc host and export it in debugfs directory per mmc host like /sys/kernel/debug/mmc0/mmc_fail_request. init_fault_attr_dentries() doesn't help for mmc_fail_request. So this introduces fault_create_debugfs_attr() which is able to create a directory in the arbitrary directory and replace init_fault_attr_dentries(). [akpm@linux-foundation.org: extraneous semicolon, per Randy] Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Tested-by: Per Forlin <per.forlin@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:20 -10:00
Herbert Poetzl	f95fe9cfb4	block/genhd.c: remove useless cast in diskstats_show() Remove the (unsigned long long) cast in diskstats_show() and adjusts the seq_printf() format string to 'unsigned long' diskstats_show() uses part_stat_read() to get the stats, which either accesses the specified field in the struct disk_stats directly (non SMP) or sums up the per CPU values in a variable of the same type as the field, so in any case the result will have the same type and range as the specified field which for all disk_stats entries is unsigned long Also, for unsigned long ranges the output of %lu should be identical to the one of %llu, so no change in the actual proc entry contents. Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-02 12:43:50 +02:00
Jens Axboe	e2a5429ff7	bsg-lib: add module.h include Due to conflicts with the moduleh tree in linux-next, we run into an include file mess. We really need export.h in that tree, but if we add module.h locally then the issue is easier to resolve. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-02 10:43:35 +02:00
Vivek Goyal	a5395b83b7	cfq-iosched: Reduce linked group count upon group destruction FQ keeps track of number of groups which are linked on blkcg->blkg_list. This is useful to avoid races between queue exit and cgroup exit code paths. So if at the request queue exit time linked group count is not zero, that means there are some group out there which is yet to be deleted under rcu read period and queue exit code should wait for on rcu period. In my previous patch I forgot to decrease the number of group count. So in current form, we nr_blkcg_linked_grps is always non-zero and we will always wait one rcu period (if BLK_CGROUP=y). The side effect of this is that it can increase boot time. I am surprised, nobody complained so far. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-02 09:24:09 +02:00
Shaohua Li	e5a94f5684	blk-throttle: correctly determine sync bio read request is always sync. Using rw_is_sync() to determine if a bio is sync. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-01 10:31:06 +02:00
Mike Christie	aa387cc895	block: add bsg helper library This moves the FC classes bsg code to the block layer and makes it a lib so that other classes like iscsi and SAS can use it. It is helpful because working with the request queue, bios, creating scatterlists, etc are a pain that the LLD does not have to worry about with normal IOs and should not have to worry about for bsg requests. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-31 22:05:09 +02:00
Akinobu Mita	b2c9cd3793	fail_make_request: cleanup should_fail_request This changes should_fail_request() to more usable wrapper function of should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in the middle of a function. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:46 -07:00
Jens Axboe	11ccf116d0	block: fix warning with calling smp_processor_id() in preemptible section After commit `5757a6d7` introduced an unsafe calling of smp_processor_id(), with preempt debuggin turned on we spew a lot of: BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514 caller is __make_request+0x1b8/0x308 [<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308) [<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558) [<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138) [<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c) [<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224) [<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94) [<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8) Fix this by just using raw_smp_processor_id(), it's just a hint after all. There's no pinning of the CPU or accessing per-cpu structures involved. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-26 15:01:15 +02:00
Linus Torvalds	8ded371f81	Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-block * 'for-3.1/drivers' of git://git.kernel.dk/linux-block: cciss: do not attempt to read from a write-only register xen/blkback: Add module alias for autoloading xen/blkback: Don't let in-flight requests defer pending ones. bsg: fix address space warning from sparse bsg: remove unnecessary conditional expressions bsg: fix bsg_poll() to return POLLOUT properly	2011-07-25 10:38:18 -07:00
Linus Torvalds	096a705bbc	Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...	2011-07-25 10:33:36 -07:00
Dan Williams	5757a6d76c	block: strict rq_affinity Some systems benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. This is because the first CPU in the group mask ends up being completely overloaded with work, while the others (including the original submitter) has power left to spare. Allow the strict mode to be set by writing '2' to the sysfs control file. This is identical to the scheme used for the nomerges file, where '2' is a more aggressive setting than just being turned on. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-23 20:44:25 +02:00
Jens Axboe	4c64500ead	block: fix patch import error in max_discard_sectors check A '!' snuck in before the unlikely, rendering it useless. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-23 20:34:59 +02:00
Linus Torvalds	d4e06701b8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits) [SCSI] fix crash in scsi_dispatch_cmd() [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu [SCSI] bfa: Update the driver version to 3.0.2.1 [SCSI] bfa: Driver and BSG enhancements. [SCSI] bfa: Added support to query PHY. [SCSI] bfa: Added HBA diagnostics support. [SCSI] bfa: Added support for flash configuration [SCSI] bfa: Added support to obtain SFP info. [SCSI] bfa: Added support for CEE info and stats query. [SCSI] bfa: Extend BSG interface. [SCSI] bfa: FCS bug fixes. [SCSI] bfa: DMA memory allocation enhancement. [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support. [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes. [SCSI] bfa: Added Fabric Assigned Address(FAA) support [SCSI] bfa: IOC bug fixes. [SCSI] bfa: Enable ASIC block configuration and query. [SCSI] bnx2i: Updated copyright and bump version [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported ... Fix up some trivial conflicts in: - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}: Crazy broadcom version number conflicts - drivers/target/tcm_fc/tfc_cmd.c Just trivial cleanups done on adjacent lines	2011-07-23 11:13:11 -07:00
James Bottomley	bfe159a512	[SCSI] fix crash in scsi_dispatch_cmd() USB surprise removal of sr is triggering an oops in scsi_dispatch_command(). What seems to be happening is that USB is hanging on to a queue reference until the last close of the upper device, so the crash is caused by surprise remove of a mounted CD followed by attempted unmount. The problem is that USB doesn't issue its final commands as part of the SCSI teardown path, but on last close when the block queue is long gone. The long term fix is probably to make sr do the teardown in the same way as sd (so remove all the lower bits on ejection, but keep the upper disk alive until last close of user space). However, the current oops can be simply fixed by not allowing any commands to be sent to a dead queue. Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2011-07-21 14:21:18 -07:00

1 2 3 4 5 ...

1640 Commits