2019-04-30 18:42:43 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2014-05-28 16:15:41 +00:00
|
|
|
/*
|
|
|
|
* Block multiqueue core code
|
|
|
|
*
|
|
|
|
* Copyright (C) 2013-2014 Jens Axboe
|
|
|
|
* Copyright (C) 2013-2014 Christoph Hellwig
|
|
|
|
*/
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/blkdev.h>
|
2015-09-14 17:16:02 +00:00
|
|
|
#include <linux/kmemleak.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/smp.h>
|
|
|
|
#include <linux/llist.h>
|
|
|
|
#include <linux/list_sort.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cache.h>
|
|
|
|
#include <linux/sched/sysctl.h>
|
2017-02-01 15:36:40 +00:00
|
|
|
#include <linux/sched/topology.h>
|
2017-02-02 18:15:33 +00:00
|
|
|
#include <linux/sched/signal.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
#include <linux/delay.h>
|
2014-09-17 14:27:03 +00:00
|
|
|
#include <linux/crash_dump.h>
|
2016-08-25 14:07:30 +00:00
|
|
|
#include <linux/prefetch.h>
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
#include <linux/blk-crypto.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
#include <trace/events/block.h>
|
|
|
|
|
|
|
|
#include <linux/blk-mq.h>
|
2019-09-16 15:44:29 +00:00
|
|
|
#include <linux/t10-pi.h>
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
#include "blk.h"
|
|
|
|
#include "blk-mq.h"
|
2017-05-04 14:17:21 +00:00
|
|
|
#include "blk-mq-debugfs.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
#include "blk-mq-tag.h"
|
2018-09-26 21:01:10 +00:00
|
|
|
#include "blk-pm.h"
|
2016-11-08 04:32:37 +00:00
|
|
|
#include "blk-stat.h"
|
2017-01-17 13:03:22 +00:00
|
|
|
#include "blk-mq-sched.h"
|
2018-07-03 15:14:59 +00:00
|
|
|
#include "blk-rq-qos.h"
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
|
2020-06-11 06:44:41 +00:00
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
static void blk_mq_poll_stats_start(struct request_queue *q);
|
|
|
|
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
|
|
|
|
|
2017-04-07 12:24:03 +00:00
|
|
|
static int blk_mq_poll_stats_bkt(const struct request *rq)
|
|
|
|
{
|
2019-05-21 07:59:03 +00:00
|
|
|
int ddir, sectors, bucket;
|
2017-04-07 12:24:03 +00:00
|
|
|
|
2017-04-21 13:55:42 +00:00
|
|
|
ddir = rq_data_dir(rq);
|
2019-05-21 07:59:03 +00:00
|
|
|
sectors = blk_rq_stats_sectors(rq);
|
2017-04-07 12:24:03 +00:00
|
|
|
|
2019-05-21 07:59:03 +00:00
|
|
|
bucket = ddir + 2 * ilog2(sectors);
|
2017-04-07 12:24:03 +00:00
|
|
|
|
|
|
|
if (bucket < 0)
|
|
|
|
return -1;
|
|
|
|
else if (bucket >= BLK_MQ_POLL_STATS_BKTS)
|
|
|
|
return ddir + BLK_MQ_POLL_STATS_BKTS - 2;
|
|
|
|
|
|
|
|
return bucket;
|
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
2019-03-24 09:57:08 +00:00
|
|
|
* Check if any of the ctx, dispatch list or elevator
|
|
|
|
* have pending work in this hardware queue.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
*/
|
2017-11-10 16:13:21 +00:00
|
|
|
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2017-11-10 16:13:21 +00:00
|
|
|
return !list_empty_careful(&hctx->dispatch) ||
|
|
|
|
sbitmap_any_bit_set(&hctx->ctx_map) ||
|
2017-01-17 13:03:22 +00:00
|
|
|
blk_mq_sched_has_work(hctx);
|
2014-05-19 15:23:55 +00:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
|
|
|
* Mark this ctx as having pending work in this hardware queue
|
|
|
|
*/
|
|
|
|
static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-29 19:13:29 +00:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
if (!sbitmap_test_bit(&hctx->ctx_map, bit))
|
|
|
|
sbitmap_set_bit(&hctx->ctx_map, bit);
|
2014-05-19 15:23:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *ctx)
|
|
|
|
{
|
2018-10-29 19:13:29 +00:00
|
|
|
const int bit = ctx->index_hw[hctx->type];
|
|
|
|
|
|
|
|
sbitmap_clear_bit(&hctx->ctx_map, bit);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2017-08-08 23:51:45 +00:00
|
|
|
struct mq_inflight {
|
2020-11-24 08:36:54 +00:00
|
|
|
struct block_device *part;
|
2019-09-30 18:55:34 +00:00
|
|
|
unsigned int inflight[2];
|
2017-08-08 23:51:45 +00:00
|
|
|
};
|
|
|
|
|
2018-11-08 17:24:07 +00:00
|
|
|
static bool blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
|
2017-08-08 23:51:45 +00:00
|
|
|
struct request *rq, void *priv,
|
|
|
|
bool reserved)
|
|
|
|
{
|
|
|
|
struct mq_inflight *mi = priv;
|
|
|
|
|
2020-12-02 11:11:45 +00:00
|
|
|
if ((!mi->part->bd_partno || rq->part == mi->part) &&
|
|
|
|
blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
|
2019-09-30 18:55:33 +00:00
|
|
|
mi->inflight[rq_data_dir(rq)]++;
|
2018-11-08 17:24:07 +00:00
|
|
|
|
|
|
|
return true;
|
2017-08-08 23:51:45 +00:00
|
|
|
}
|
|
|
|
|
2020-11-24 08:36:54 +00:00
|
|
|
unsigned int blk_mq_in_flight(struct request_queue *q,
|
|
|
|
struct block_device *part)
|
2017-08-08 23:51:45 +00:00
|
|
|
{
|
2019-09-30 18:55:34 +00:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2017-08-08 23:51:45 +00:00
|
|
|
|
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2018-12-06 16:41:21 +00:00
|
|
|
|
2019-09-30 18:55:34 +00:00
|
|
|
return mi.inflight[0] + mi.inflight[1];
|
2018-04-26 07:21:59 +00:00
|
|
|
}
|
|
|
|
|
2020-11-24 08:36:54 +00:00
|
|
|
void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
|
|
|
|
unsigned int inflight[2])
|
2018-04-26 07:21:59 +00:00
|
|
|
{
|
2019-09-30 18:55:34 +00:00
|
|
|
struct mq_inflight mi = { .part = part };
|
2018-04-26 07:21:59 +00:00
|
|
|
|
2019-09-30 18:55:33 +00:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
|
2019-09-30 18:55:34 +00:00
|
|
|
inflight[0] = mi.inflight[0];
|
|
|
|
inflight[1] = mi.inflight[1];
|
2018-04-26 07:21:59 +00:00
|
|
|
}
|
|
|
|
|
2017-03-27 12:06:57 +00:00
|
|
|
void blk_freeze_queue_start(struct request_queue *q)
|
2013-12-26 13:31:35 +00:00
|
|
|
{
|
2019-05-21 03:25:55 +00:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
|
|
|
if (++q->mq_freeze_depth == 1) {
|
2015-10-21 17:20:12 +00:00
|
|
|
percpu_ref_kill(&q->q_usage_counter);
|
2019-05-21 03:25:55 +00:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2018-11-15 19:22:51 +00:00
|
|
|
if (queue_is_mq(q))
|
2017-11-09 18:49:53 +00:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2019-05-21 03:25:55 +00:00
|
|
|
} else {
|
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
2014-08-16 12:02:24 +00:00
|
|
|
}
|
2014-11-04 18:52:27 +00:00
|
|
|
}
|
2017-03-27 12:06:57 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
|
2014-11-04 18:52:27 +00:00
|
|
|
|
2017-03-01 19:22:10 +00:00
|
|
|
void blk_mq_freeze_queue_wait(struct request_queue *q)
|
2014-11-04 18:52:27 +00:00
|
|
|
{
|
2015-10-21 17:20:12 +00:00
|
|
|
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
|
2013-12-26 13:31:35 +00:00
|
|
|
}
|
2017-03-01 19:22:10 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
|
2013-12-26 13:31:35 +00:00
|
|
|
|
2017-03-01 19:22:11 +00:00
|
|
|
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
|
|
|
|
unsigned long timeout)
|
|
|
|
{
|
|
|
|
return wait_event_timeout(q->mq_freeze_wq,
|
|
|
|
percpu_ref_is_zero(&q->q_usage_counter),
|
|
|
|
timeout);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
|
2013-12-26 13:31:35 +00:00
|
|
|
|
2014-11-04 18:52:27 +00:00
|
|
|
/*
|
|
|
|
* Guarantee no request is in use, so we can change any data structure of
|
|
|
|
* the queue afterward.
|
|
|
|
*/
|
2015-10-21 17:20:12 +00:00
|
|
|
void blk_freeze_queue(struct request_queue *q)
|
2014-11-04 18:52:27 +00:00
|
|
|
{
|
2015-10-21 17:20:12 +00:00
|
|
|
/*
|
|
|
|
* In the !blk_mq case we are only calling this to kill the
|
|
|
|
* q_usage_counter, otherwise this increases the freeze depth
|
|
|
|
* and waits for it to return to zero. For this reason there is
|
|
|
|
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
|
|
|
|
* exported to drivers as the only user for unfreeze is blk_mq.
|
|
|
|
*/
|
2017-03-27 12:06:57 +00:00
|
|
|
blk_freeze_queue_start(q);
|
2014-11-04 18:52:27 +00:00
|
|
|
blk_mq_freeze_queue_wait(q);
|
|
|
|
}
|
2015-10-21 17:20:12 +00:00
|
|
|
|
|
|
|
void blk_mq_freeze_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* ...just an alias to keep freeze and unfreeze actions balanced
|
|
|
|
* in the blk_mq_* namespace
|
|
|
|
*/
|
|
|
|
blk_freeze_queue(q);
|
|
|
|
}
|
2015-01-02 22:05:12 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
|
2014-11-04 18:52:27 +00:00
|
|
|
|
2014-12-20 00:54:14 +00:00
|
|
|
void blk_mq_unfreeze_queue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2019-05-21 03:25:55 +00:00
|
|
|
mutex_lock(&q->mq_freeze_lock);
|
|
|
|
q->mq_freeze_depth--;
|
|
|
|
WARN_ON_ONCE(q->mq_freeze_depth < 0);
|
|
|
|
if (!q->mq_freeze_depth) {
|
2018-09-26 21:01:08 +00:00
|
|
|
percpu_ref_resurrect(&q->q_usage_counter);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
wake_up_all(&q->mq_freeze_wq);
|
2014-07-01 16:34:38 +00:00
|
|
|
}
|
2019-05-21 03:25:55 +00:00
|
|
|
mutex_unlock(&q->mq_freeze_lock);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-12-20 00:54:14 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-06-21 17:55:47 +00:00
|
|
|
/*
|
|
|
|
* FIXME: replace the scsi_internal_device_*block_nowait() calls in the
|
|
|
|
* mpt3sas driver such that this function can be removed.
|
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue_nowait(struct request_queue *q)
|
|
|
|
{
|
2018-03-08 01:10:04 +00:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q);
|
2017-06-21 17:55:47 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
|
|
|
|
|
2016-11-02 16:09:51 +00:00
|
|
|
/**
|
2017-06-06 15:22:07 +00:00
|
|
|
* blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
|
2016-11-02 16:09:51 +00:00
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* Note: this function does not prevent that the struct request end_io()
|
2017-06-06 15:22:07 +00:00
|
|
|
* callback function is invoked. Once this function is returned, we make
|
|
|
|
* sure no dispatch can happen until the queue is unquiesced via
|
|
|
|
* blk_mq_unquiesce_queue().
|
2016-11-02 16:09:51 +00:00
|
|
|
*/
|
|
|
|
void blk_mq_quiesce_queue(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned int i;
|
|
|
|
bool rcu = false;
|
|
|
|
|
2017-06-06 15:22:08 +00:00
|
|
|
blk_mq_quiesce_queue_nowait(q);
|
2017-06-18 20:24:27 +00:00
|
|
|
|
2016-11-02 16:09:51 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (hctx->flags & BLK_MQ_F_BLOCKING)
|
2018-01-09 16:29:53 +00:00
|
|
|
synchronize_srcu(hctx->srcu);
|
2016-11-02 16:09:51 +00:00
|
|
|
else
|
|
|
|
rcu = true;
|
|
|
|
}
|
|
|
|
if (rcu)
|
|
|
|
synchronize_rcu();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
|
|
|
|
|
2017-06-06 15:22:03 +00:00
|
|
|
/*
|
|
|
|
* blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* This function recovers queue into the state before quiescing
|
|
|
|
* which is done by blk_mq_quiesce_queue.
|
|
|
|
*/
|
|
|
|
void blk_mq_unquiesce_queue(struct request_queue *q)
|
|
|
|
{
|
2018-03-08 01:10:04 +00:00
|
|
|
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
|
2017-06-18 20:24:27 +00:00
|
|
|
|
2017-06-06 15:22:08 +00:00
|
|
|
/* dispatch requests which are inserted during quiescing */
|
|
|
|
blk_mq_run_hw_queues(q, true);
|
2017-06-06 15:22:03 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
|
|
|
|
|
2014-12-22 21:04:42 +00:00
|
|
|
void blk_mq_wake_waiters(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_wakeup_all(hctx->tags, true);
|
|
|
|
}
|
|
|
|
|
2018-11-28 17:50:07 +00:00
|
|
|
/*
|
2019-05-21 07:59:04 +00:00
|
|
|
* Only need start/end time stamping if we have iostat or
|
|
|
|
* blk stats enabled, or using an IO scheduler.
|
2018-11-28 17:50:07 +00:00
|
|
|
*/
|
|
|
|
static inline bool blk_mq_need_time_stamp(struct request *rq)
|
|
|
|
{
|
2019-05-21 07:59:04 +00:00
|
|
|
return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS)) || rq->q->elevator;
|
2018-11-28 17:50:07 +00:00
|
|
|
}
|
|
|
|
|
2017-06-16 16:15:27 +00:00
|
|
|
static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
|
2020-05-29 13:53:10 +00:00
|
|
|
unsigned int tag, u64 alloc_time_ns)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2017-06-16 16:15:27 +00:00
|
|
|
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
|
|
|
|
struct request *rq = tags->static_rqs[tag];
|
2017-06-20 18:15:43 +00:00
|
|
|
|
2020-06-29 15:08:34 +00:00
|
|
|
if (data->q->elevator) {
|
2020-05-29 13:53:12 +00:00
|
|
|
rq->tag = BLK_MQ_NO_TAG;
|
2017-06-16 16:15:27 +00:00
|
|
|
rq->internal_tag = tag;
|
|
|
|
} else {
|
|
|
|
rq->tag = tag;
|
2020-05-29 13:53:12 +00:00
|
|
|
rq->internal_tag = BLK_MQ_NO_TAG;
|
2017-06-16 16:15:27 +00:00
|
|
|
}
|
|
|
|
|
2014-05-06 10:12:45 +00:00
|
|
|
/* csd/requeue_work/fifo_time is initialized before use */
|
2017-06-16 16:15:27 +00:00
|
|
|
rq->q = data->q;
|
|
|
|
rq->mq_ctx = data->ctx;
|
2018-10-29 21:06:13 +00:00
|
|
|
rq->mq_hctx = data->hctx;
|
2020-07-06 14:41:11 +00:00
|
|
|
rq->rq_flags = 0;
|
2020-05-29 13:53:10 +00:00
|
|
|
rq->cmd_flags = data->cmd_flags;
|
2020-12-09 05:29:45 +00:00
|
|
|
if (data->flags & BLK_MQ_REQ_PM)
|
|
|
|
rq->rq_flags |= RQF_PM;
|
2017-06-16 16:15:27 +00:00
|
|
|
if (blk_queue_io_stat(data->q))
|
2016-10-20 13:12:13 +00:00
|
|
|
rq->rq_flags |= RQF_IO_STAT;
|
2018-01-10 18:46:39 +00:00
|
|
|
INIT_LIST_HEAD(&rq->queuelist);
|
2014-05-06 10:12:45 +00:00
|
|
|
INIT_HLIST_NODE(&rq->hash);
|
|
|
|
RB_CLEAR_NODE(&rq->rb_node);
|
|
|
|
rq->rq_disk = NULL;
|
|
|
|
rq->part = NULL;
|
2019-08-28 22:05:57 +00:00
|
|
|
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
|
|
|
|
rq->alloc_time_ns = alloc_time_ns;
|
|
|
|
#endif
|
2018-11-28 17:50:07 +00:00
|
|
|
if (blk_mq_need_time_stamp(rq))
|
|
|
|
rq->start_time_ns = ktime_get_ns();
|
|
|
|
else
|
|
|
|
rq->start_time_ns = 0;
|
2018-05-09 09:08:50 +00:00
|
|
|
rq->io_start_time_ns = 0;
|
2019-05-21 07:59:03 +00:00
|
|
|
rq->stats_sectors = 0;
|
2014-05-06 10:12:45 +00:00
|
|
|
rq->nr_phys_segments = 0;
|
|
|
|
#if defined(CONFIG_BLK_DEV_INTEGRITY)
|
|
|
|
rq->nr_integrity_segments = 0;
|
|
|
|
#endif
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
blk_crypto_rq_set_defaults(rq);
|
2014-05-06 10:12:45 +00:00
|
|
|
/* tag was already set */
|
2018-11-14 16:02:05 +00:00
|
|
|
WRITE_ONCE(rq->deadline, 0);
|
2014-05-06 10:12:45 +00:00
|
|
|
|
2014-06-06 17:03:48 +00:00
|
|
|
rq->timeout = 0;
|
|
|
|
|
2014-05-06 10:12:45 +00:00
|
|
|
rq->end_io = NULL;
|
|
|
|
rq->end_io_data = NULL;
|
|
|
|
|
2020-05-29 13:53:10 +00:00
|
|
|
data->ctx->rq_dispatched[op_is_sync(data->cmd_flags)]++;
|
2018-05-29 13:52:28 +00:00
|
|
|
refcount_set(&rq->ref, 1);
|
2020-05-29 13:53:10 +00:00
|
|
|
|
|
|
|
if (!op_is_flush(data->cmd_flags)) {
|
|
|
|
struct elevator_queue *e = data->q->elevator;
|
|
|
|
|
|
|
|
rq->elv.icq = NULL;
|
|
|
|
if (e && e->type->ops.prepare_request) {
|
|
|
|
if (e->type->icq_cache)
|
|
|
|
blk_mq_sched_assign_ioc(rq);
|
|
|
|
|
|
|
|
e->type->ops.prepare_request(rq);
|
|
|
|
rq->rq_flags |= RQF_ELVPRIV;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
data->hctx->queued++;
|
2017-06-16 16:15:27 +00:00
|
|
|
return rq;
|
2014-05-27 18:59:47 +00:00
|
|
|
}
|
|
|
|
|
2020-05-29 13:53:09 +00:00
|
|
|
static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
|
2017-06-16 16:15:19 +00:00
|
|
|
{
|
2020-05-29 13:53:09 +00:00
|
|
|
struct request_queue *q = data->q;
|
2017-06-16 16:15:19 +00:00
|
|
|
struct elevator_queue *e = q->elevator;
|
2019-08-28 22:05:57 +00:00
|
|
|
u64 alloc_time_ns = 0;
|
2020-05-29 13:53:13 +00:00
|
|
|
unsigned int tag;
|
2017-06-16 16:15:19 +00:00
|
|
|
|
2019-08-28 22:05:57 +00:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
|
|
|
alloc_time_ns = ktime_get_ns();
|
|
|
|
|
2018-10-29 19:11:38 +00:00
|
|
|
if (data->cmd_flags & REQ_NOWAIT)
|
2017-06-20 12:05:46 +00:00
|
|
|
data->flags |= BLK_MQ_REQ_NOWAIT;
|
2017-06-16 16:15:19 +00:00
|
|
|
|
|
|
|
if (e) {
|
|
|
|
/*
|
2021-04-15 03:39:20 +00:00
|
|
|
* Flush/passthrough requests are special and go directly to the
|
2018-05-09 19:28:50 +00:00
|
|
|
* dispatch list. Don't include reserved tags in the
|
|
|
|
* limiting, as it isn't useful.
|
2017-06-16 16:15:19 +00:00
|
|
|
*/
|
2018-10-29 19:11:38 +00:00
|
|
|
if (!op_is_flush(data->cmd_flags) &&
|
2021-04-15 03:39:20 +00:00
|
|
|
!blk_op_is_passthrough(data->cmd_flags) &&
|
2018-10-29 19:11:38 +00:00
|
|
|
e->type->ops.limit_depth &&
|
2018-05-09 19:28:50 +00:00
|
|
|
!(data->flags & BLK_MQ_REQ_RESERVED))
|
2018-10-29 19:11:38 +00:00
|
|
|
e->type->ops.limit_depth(data->cmd_flags, data);
|
2017-06-16 16:15:19 +00:00
|
|
|
}
|
|
|
|
|
2020-05-29 13:53:15 +00:00
|
|
|
retry:
|
2020-05-29 13:53:13 +00:00
|
|
|
data->ctx = blk_mq_get_ctx(q);
|
|
|
|
data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
|
2020-06-29 15:08:34 +00:00
|
|
|
if (!e)
|
2020-05-29 13:53:13 +00:00
|
|
|
blk_mq_tag_busy(data->hctx);
|
|
|
|
|
2020-05-29 13:53:15 +00:00
|
|
|
/*
|
|
|
|
* Waiting allocations only fail because of an inactive hctx. In that
|
|
|
|
* case just retry the hctx assignment and tag allocation as CPU hotplug
|
|
|
|
* should have migrated us to an online CPU by now.
|
|
|
|
*/
|
2017-06-16 16:15:27 +00:00
|
|
|
tag = blk_mq_get_tag(data);
|
2020-05-29 13:53:15 +00:00
|
|
|
if (tag == BLK_MQ_NO_TAG) {
|
|
|
|
if (data->flags & BLK_MQ_REQ_NOWAIT)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Give up the CPU and sleep for a random short time to ensure
|
|
|
|
* that thread using a realtime scheduling class are migrated
|
2020-07-31 01:42:31 +00:00
|
|
|
* off the CPU, and thus off the hctx that is going away.
|
2020-05-29 13:53:15 +00:00
|
|
|
*/
|
|
|
|
msleep(3);
|
|
|
|
goto retry;
|
|
|
|
}
|
2020-05-29 13:53:10 +00:00
|
|
|
return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
|
2017-06-16 16:15:19 +00:00
|
|
|
}
|
|
|
|
|
2017-06-20 18:15:39 +00:00
|
|
|
struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
|
2017-11-09 18:49:59 +00:00
|
|
|
blk_mq_req_flags_t flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2020-05-29 13:53:09 +00:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
|
|
|
.cmd_flags = op,
|
|
|
|
};
|
2017-01-17 13:03:22 +00:00
|
|
|
struct request *rq;
|
2014-08-28 14:15:21 +00:00
|
|
|
int ret;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-11-09 18:49:58 +00:00
|
|
|
ret = blk_queue_enter(q, flags);
|
2014-08-28 14:15:21 +00:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2020-05-29 13:53:09 +00:00
|
|
|
rq = __blk_mq_alloc_request(&data);
|
2017-01-17 13:03:22 +00:00
|
|
|
if (!rq)
|
2020-05-16 18:27:58 +00:00
|
|
|
goto out_queue_exit;
|
2016-07-19 09:31:50 +00:00
|
|
|
rq->__data_len = 0;
|
|
|
|
rq->__sector = (sector_t) -1;
|
|
|
|
rq->bio = rq->biotail = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
return rq;
|
2020-05-16 18:27:58 +00:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(-EWOULDBLOCK);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-05-09 15:36:49 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-06-20 18:15:39 +00:00
|
|
|
struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
|
2017-11-09 18:49:59 +00:00
|
|
|
unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
|
2016-06-13 14:45:21 +00:00
|
|
|
{
|
2020-05-29 13:53:09 +00:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
.flags = flags,
|
|
|
|
.cmd_flags = op,
|
|
|
|
};
|
2020-05-29 13:53:13 +00:00
|
|
|
u64 alloc_time_ns = 0;
|
2017-02-27 18:28:27 +00:00
|
|
|
unsigned int cpu;
|
2020-05-29 13:53:13 +00:00
|
|
|
unsigned int tag;
|
2016-06-13 14:45:21 +00:00
|
|
|
int ret;
|
|
|
|
|
2020-05-29 13:53:13 +00:00
|
|
|
/* alloc_time includes depth and tag waits */
|
|
|
|
if (blk_queue_rq_alloc_time(q))
|
|
|
|
alloc_time_ns = ktime_get_ns();
|
|
|
|
|
2016-06-13 14:45:21 +00:00
|
|
|
/*
|
|
|
|
* If the tag allocator sleeps we could get an allocation for a
|
|
|
|
* different hardware context. No need to complicate the low level
|
|
|
|
* allocator for this for the rare use case of a command tied to
|
|
|
|
* a specific queue.
|
|
|
|
*/
|
2020-05-29 13:53:13 +00:00
|
|
|
if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED))))
|
2016-06-13 14:45:21 +00:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
if (hctx_idx >= q->nr_hw_queues)
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
|
2017-11-09 18:49:58 +00:00
|
|
|
ret = blk_queue_enter(q, flags);
|
2016-06-13 14:45:21 +00:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
2016-09-23 16:25:48 +00:00
|
|
|
/*
|
|
|
|
* Check if the hardware context is actually mapped to anything.
|
|
|
|
* If not tell the caller that it should skip this queue.
|
|
|
|
*/
|
2020-05-16 18:27:58 +00:00
|
|
|
ret = -EXDEV;
|
2020-05-29 13:53:09 +00:00
|
|
|
data.hctx = q->queue_hw_ctx[hctx_idx];
|
|
|
|
if (!blk_mq_hw_queue_mapped(data.hctx))
|
2020-05-16 18:27:58 +00:00
|
|
|
goto out_queue_exit;
|
2020-05-29 13:53:09 +00:00
|
|
|
cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
|
|
|
|
data.ctx = __blk_mq_get_ctx(q, cpu);
|
2016-06-13 14:45:21 +00:00
|
|
|
|
2020-06-29 15:08:34 +00:00
|
|
|
if (!q->elevator)
|
2020-05-29 13:53:13 +00:00
|
|
|
blk_mq_tag_busy(data.hctx);
|
|
|
|
|
2020-05-16 18:27:58 +00:00
|
|
|
ret = -EWOULDBLOCK;
|
2020-05-29 13:53:13 +00:00
|
|
|
tag = blk_mq_get_tag(&data);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
2020-05-16 18:27:58 +00:00
|
|
|
goto out_queue_exit;
|
2020-05-29 13:53:13 +00:00
|
|
|
return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns);
|
|
|
|
|
2020-05-16 18:27:58 +00:00
|
|
|
out_queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return ERR_PTR(ret);
|
2016-06-13 14:45:21 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
|
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
static void __blk_mq_free_request(struct request *rq)
|
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-10-29 21:06:13 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2018-05-29 13:52:28 +00:00
|
|
|
const int sched_tag = rq->internal_tag;
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
blk_crypto_free_request(rq);
|
2018-09-26 21:01:10 +00:00
|
|
|
blk_pm_mark_last_busy(rq);
|
2018-10-29 21:06:13 +00:00
|
|
|
rq->mq_hctx = NULL;
|
2020-05-29 13:53:12 +00:00
|
|
|
if (rq->tag != BLK_MQ_NO_TAG)
|
2020-02-26 12:10:15 +00:00
|
|
|
blk_mq_put_tag(hctx->tags, ctx, rq->tag);
|
2020-05-29 13:53:12 +00:00
|
|
|
if (sched_tag != BLK_MQ_NO_TAG)
|
2020-02-26 12:10:15 +00:00
|
|
|
blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
|
2018-05-29 13:52:28 +00:00
|
|
|
blk_mq_sched_restart(hctx);
|
|
|
|
blk_queue_exit(q);
|
|
|
|
}
|
|
|
|
|
2017-06-16 16:15:22 +00:00
|
|
|
void blk_mq_free_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2017-06-16 16:15:22 +00:00
|
|
|
struct elevator_queue *e = q->elevator;
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-10-29 21:06:13 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2017-06-16 16:15:22 +00:00
|
|
|
|
2017-06-16 16:15:26 +00:00
|
|
|
if (rq->rq_flags & RQF_ELVPRIV) {
|
2018-11-01 22:41:41 +00:00
|
|
|
if (e && e->type->ops.finish_request)
|
|
|
|
e->type->ops.finish_request(rq);
|
2017-06-16 16:15:22 +00:00
|
|
|
if (rq->elv.icq) {
|
|
|
|
put_io_context(rq->elv.icq->ioc);
|
|
|
|
rq->elv.icq = NULL;
|
|
|
|
}
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-06-16 16:15:22 +00:00
|
|
|
ctx->rq_completed[rq_is_sync(rq)]++;
|
2016-10-20 13:12:13 +00:00
|
|
|
if (rq->rq_flags & RQF_MQ_INFLIGHT)
|
2020-08-19 15:20:26 +00:00
|
|
|
__blk_mq_dec_active_requests(hctx);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 19:38:14 +00:00
|
|
|
|
2017-09-30 08:08:24 +00:00
|
|
|
if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
|
|
|
|
laptop_io_completion(q->backing_dev_info);
|
|
|
|
|
2018-07-03 15:32:35 +00:00
|
|
|
rq_qos_done(q, rq);
|
2014-05-13 21:10:52 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
|
|
|
if (refcount_dec_and_test(&rq->ref))
|
|
|
|
__blk_mq_free_request(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-11-17 17:40:48 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_free_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-06-03 07:38:04 +00:00
|
|
|
inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2018-11-28 17:50:07 +00:00
|
|
|
u64 now = 0;
|
|
|
|
|
|
|
|
if (blk_mq_need_time_stamp(rq))
|
|
|
|
now = ktime_get_ns();
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 09:08:53 +00:00
|
|
|
|
2018-05-09 09:08:52 +00:00
|
|
|
if (rq->rq_flags & RQF_STATS) {
|
|
|
|
blk_mq_poll_stats_start(rq->q);
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 09:08:53 +00:00
|
|
|
blk_stat_add(rq, now);
|
2018-05-09 09:08:52 +00:00
|
|
|
}
|
|
|
|
|
2020-07-04 07:28:21 +00:00
|
|
|
blk_mq_sched_completed_request(rq, now);
|
2018-09-27 22:55:51 +00:00
|
|
|
|
block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 09:08:53 +00:00
|
|
|
blk_account_io_done(rq, now);
|
2013-12-05 17:50:39 +00:00
|
|
|
|
2014-04-16 07:44:53 +00:00
|
|
|
if (rq->end_io) {
|
2018-07-03 15:32:35 +00:00
|
|
|
rq_qos_done(rq->q, rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
rq->end_io(rq, error);
|
2014-04-16 07:44:53 +00:00
|
|
|
} else {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
blk_mq_free_request(rq);
|
2014-04-16 07:44:53 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-09-13 23:40:10 +00:00
|
|
|
EXPORT_SYMBOL(__blk_mq_end_request);
|
2014-04-16 07:44:52 +00:00
|
|
|
|
2017-06-03 07:38:04 +00:00
|
|
|
void blk_mq_end_request(struct request *rq, blk_status_t error)
|
2014-04-16 07:44:52 +00:00
|
|
|
{
|
|
|
|
if (blk_update_request(rq, error, blk_rq_bytes(rq)))
|
|
|
|
BUG();
|
2014-09-13 23:40:10 +00:00
|
|
|
__blk_mq_end_request(rq, error);
|
2014-04-16 07:44:52 +00:00
|
|
|
}
|
2014-09-13 23:40:10 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_end_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
static void blk_complete_reqs(struct llist_head *list)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2021-01-23 20:10:27 +00:00
|
|
|
struct llist_node *entry = llist_reverse_order(llist_del_all(list));
|
|
|
|
struct request *rq, *next;
|
2020-06-11 06:44:41 +00:00
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
llist_for_each_entry_safe(rq, next, entry, ipi_list)
|
2020-06-11 06:44:41 +00:00
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2021-01-23 20:10:27 +00:00
|
|
|
blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
|
2020-06-11 06:44:42 +00:00
|
|
|
}
|
|
|
|
|
2020-06-11 06:44:41 +00:00
|
|
|
static int blk_softirq_cpu_dead(unsigned int cpu)
|
|
|
|
{
|
2021-01-23 20:10:27 +00:00
|
|
|
blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
|
2020-06-11 06:44:41 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-06-11 06:44:50 +00:00
|
|
|
static void __blk_mq_complete_request_remote(void *data)
|
2020-06-11 06:44:41 +00:00
|
|
|
{
|
2021-01-23 20:10:27 +00:00
|
|
|
__raise_softirq_irqoff(BLOCK_SOFTIRQ);
|
2020-06-11 06:44:41 +00:00
|
|
|
}
|
|
|
|
|
2020-06-11 06:44:49 +00:00
|
|
|
static inline bool blk_mq_complete_need_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
int cpu = raw_smp_processor_id();
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_SMP) ||
|
|
|
|
!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
|
|
|
|
return false;
|
2020-12-04 19:13:54 +00:00
|
|
|
/*
|
|
|
|
* With force threaded interrupts enabled, raising softirq from an SMP
|
|
|
|
* function call will always result in waking the ksoftirqd thread.
|
|
|
|
* This is probably worse than completing the request on a different
|
|
|
|
* cache domain.
|
|
|
|
*/
|
|
|
|
if (force_irqthreads)
|
|
|
|
return false;
|
2020-06-11 06:44:49 +00:00
|
|
|
|
|
|
|
/* same CPU or cache domain? Complete locally */
|
|
|
|
if (cpu == rq->mq_ctx->cpu ||
|
|
|
|
(!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) &&
|
|
|
|
cpus_share_cache(cpu, rq->mq_ctx->cpu)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* don't try to IPI to an offline CPU */
|
|
|
|
return cpu_online(rq->mq_ctx->cpu);
|
|
|
|
}
|
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
static void blk_mq_complete_send_ipi(struct request *rq)
|
|
|
|
{
|
|
|
|
struct llist_head *list;
|
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
cpu = rq->mq_ctx->cpu;
|
|
|
|
list = &per_cpu(blk_cpu_done, cpu);
|
|
|
|
if (llist_add(&rq->ipi_list, list)) {
|
|
|
|
INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
|
|
|
|
smp_call_function_single_async(cpu, &rq->csd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_raise_softirq(struct request *rq)
|
|
|
|
{
|
|
|
|
struct llist_head *list;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
list = this_cpu_ptr(&blk_cpu_done);
|
|
|
|
if (llist_add(&rq->ipi_list, list))
|
|
|
|
raise_softirq(BLOCK_SOFTIRQ);
|
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
2020-06-11 06:44:50 +00:00
|
|
|
bool blk_mq_complete_request_remote(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2018-11-26 16:54:30 +00:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
|
2018-09-28 08:42:20 +00:00
|
|
|
|
2018-11-18 23:15:35 +00:00
|
|
|
/*
|
|
|
|
* For a polled request, always complete locallly, it's pointless
|
|
|
|
* to redirect the completion.
|
|
|
|
*/
|
2020-06-11 06:44:50 +00:00
|
|
|
if (rq->cmd_flags & REQ_HIPRI)
|
|
|
|
return false;
|
2014-04-25 09:32:53 +00:00
|
|
|
|
2020-06-11 06:44:49 +00:00
|
|
|
if (blk_mq_complete_need_ipi(rq)) {
|
2021-01-23 20:10:27 +00:00
|
|
|
blk_mq_complete_send_ipi(rq);
|
|
|
|
return true;
|
2014-01-08 17:33:37 +00:00
|
|
|
}
|
2020-06-11 06:44:50 +00:00
|
|
|
|
2021-01-23 20:10:27 +00:00
|
|
|
if (rq->q->nr_hw_queues == 1) {
|
|
|
|
blk_mq_raise_softirq(rq);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
2020-06-11 06:44:50 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* blk_mq_complete_request - end I/O on a request
|
|
|
|
* @rq: the request being processed
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Complete a request by scheduling the ->complete_rq operation.
|
|
|
|
**/
|
|
|
|
void blk_mq_complete_request(struct request *rq)
|
|
|
|
{
|
|
|
|
if (!blk_mq_complete_request_remote(rq))
|
|
|
|
rq->q->mq_ops->complete(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2020-06-11 06:44:47 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_complete_request);
|
2014-02-10 11:24:38 +00:00
|
|
|
|
2018-01-09 16:29:46 +00:00
|
|
|
static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
|
2018-01-10 19:34:27 +00:00
|
|
|
__releases(hctx->srcu)
|
2018-01-09 16:29:46 +00:00
|
|
|
{
|
|
|
|
if (!(hctx->flags & BLK_MQ_F_BLOCKING))
|
|
|
|
rcu_read_unlock();
|
|
|
|
else
|
2018-01-09 16:29:53 +00:00
|
|
|
srcu_read_unlock(hctx->srcu, srcu_idx);
|
2018-01-09 16:29:46 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
|
2018-01-10 19:34:27 +00:00
|
|
|
__acquires(hctx->srcu)
|
2018-01-09 16:29:46 +00:00
|
|
|
{
|
2018-01-09 16:32:25 +00:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
|
|
|
|
/* shut up gcc false positive */
|
|
|
|
*srcu_idx = 0;
|
2018-01-09 16:29:46 +00:00
|
|
|
rcu_read_lock();
|
2018-01-09 16:32:25 +00:00
|
|
|
} else
|
2018-01-09 16:29:53 +00:00
|
|
|
*srcu_idx = srcu_read_lock(hctx->srcu);
|
2018-01-09 16:29:46 +00:00
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_start_request - Start processing a request
|
|
|
|
* @rq: Pointer to request to be started
|
|
|
|
*
|
|
|
|
* Function used by device drivers to notify the block layer that a request
|
|
|
|
* is going to be processed now, so blk layer can do proper initializations
|
|
|
|
* such as starting the timeout timer.
|
|
|
|
*/
|
2014-09-13 23:40:09 +00:00
|
|
|
void blk_mq_start_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2020-12-03 16:21:39 +00:00
|
|
|
trace_block_rq_issue(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2016-11-08 04:32:37 +00:00
|
|
|
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
|
2018-05-09 09:08:50 +00:00
|
|
|
rq->io_start_time_ns = ktime_get_ns();
|
2019-05-21 07:59:03 +00:00
|
|
|
rq->stats_sectors = blk_rq_sectors(rq);
|
2016-11-08 04:32:37 +00:00
|
|
|
rq->rq_flags |= RQF_STATS;
|
2018-07-03 15:32:35 +00:00
|
|
|
rq_qos_issue(q, rq);
|
2016-11-08 04:32:37 +00:00
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
|
2014-09-16 16:37:37 +00:00
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
blk_add_timer(rq);
|
2018-05-29 13:52:28 +00:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
|
2014-02-11 16:27:14 +00:00
|
|
|
|
2019-09-16 15:44:29 +00:00
|
|
|
#ifdef CONFIG_BLK_DEV_INTEGRITY
|
|
|
|
if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE)
|
|
|
|
q->integrity.profile->prepare_fn(rq);
|
|
|
|
#endif
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-09-13 23:40:09 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_start_request);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-04-16 07:44:57 +00:00
|
|
|
static void __blk_mq_requeue_request(struct request *rq)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
|
2017-11-02 15:24:38 +00:00
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
|
2020-12-03 16:21:39 +00:00
|
|
|
trace_block_rq_requeue(rq);
|
2018-07-03 15:32:35 +00:00
|
|
|
rq_qos_requeue(q, rq);
|
2014-02-11 16:27:14 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
if (blk_mq_request_started(rq)) {
|
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
2018-06-14 11:58:45 +00:00
|
|
|
rq->rq_flags &= ~RQF_TIMED_OUT;
|
2014-09-13 23:40:09 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2016-10-29 00:21:41 +00:00
|
|
|
void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
|
2014-04-16 07:44:57 +00:00
|
|
|
{
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
|
2018-02-23 15:36:56 +00:00
|
|
|
/* this request will be re-inserted to io scheduler queue */
|
|
|
|
blk_mq_sched_requeue_request(rq);
|
|
|
|
|
2018-10-24 16:48:12 +00:00
|
|
|
BUG_ON(!list_empty(&rq->queuelist));
|
2016-10-29 00:21:41 +00:00
|
|
|
blk_mq_add_to_requeue_list(rq, true, kick_requeue_list);
|
2014-04-16 07:44:57 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_requeue_request);
|
|
|
|
|
2014-05-28 14:08:02 +00:00
|
|
|
static void blk_mq_requeue_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct request_queue *q =
|
2016-09-14 17:28:30 +00:00
|
|
|
container_of(work, struct request_queue, requeue_work.work);
|
2014-05-28 14:08:02 +00:00
|
|
|
LIST_HEAD(rq_list);
|
|
|
|
struct request *rq, *next;
|
|
|
|
|
2017-07-27 14:03:57 +00:00
|
|
|
spin_lock_irq(&q->requeue_lock);
|
2014-05-28 14:08:02 +00:00
|
|
|
list_splice_init(&q->requeue_list, &rq_list);
|
2017-07-27 14:03:57 +00:00
|
|
|
spin_unlock_irq(&q->requeue_lock);
|
2014-05-28 14:08:02 +00:00
|
|
|
|
|
|
|
list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 01:56:25 +00:00
|
|
|
if (!(rq->rq_flags & (RQF_SOFTBARRIER | RQF_DONTPREP)))
|
2014-05-28 14:08:02 +00:00
|
|
|
continue;
|
|
|
|
|
2016-10-20 13:12:13 +00:00
|
|
|
rq->rq_flags &= ~RQF_SOFTBARRIER;
|
2014-05-28 14:08:02 +00:00
|
|
|
list_del_init(&rq->queuelist);
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 01:56:25 +00:00
|
|
|
/*
|
|
|
|
* If RQF_DONTPREP, rq has contained some driver specific
|
|
|
|
* data, so insert it to hctx dispatch list to avoid any
|
|
|
|
* merge.
|
|
|
|
*/
|
|
|
|
if (rq->rq_flags & RQF_DONTPREP)
|
2020-02-25 01:04:32 +00:00
|
|
|
blk_mq_request_bypass_insert(rq, false, false);
|
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12 01:56:25 +00:00
|
|
|
else
|
|
|
|
blk_mq_sched_insert_request(rq, true, false, false);
|
2014-05-28 14:08:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
while (!list_empty(&rq_list)) {
|
|
|
|
rq = list_entry(rq_list.next, struct request, queuelist);
|
|
|
|
list_del_init(&rq->queuelist);
|
2018-01-17 16:25:58 +00:00
|
|
|
blk_mq_sched_insert_request(rq, false, false, false);
|
2014-05-28 14:08:02 +00:00
|
|
|
}
|
|
|
|
|
2016-10-29 00:20:32 +00:00
|
|
|
blk_mq_run_hw_queues(q, false);
|
2014-05-28 14:08:02 +00:00
|
|
|
}
|
|
|
|
|
2016-10-29 00:21:41 +00:00
|
|
|
void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
|
|
|
|
bool kick_requeue_list)
|
2014-05-28 14:08:02 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We abuse this flag that is otherwise used by the I/O scheduler to
|
2017-11-11 05:05:12 +00:00
|
|
|
* request head insertion from the workqueue.
|
2014-05-28 14:08:02 +00:00
|
|
|
*/
|
2016-10-20 13:12:13 +00:00
|
|
|
BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
|
2014-05-28 14:08:02 +00:00
|
|
|
|
|
|
|
spin_lock_irqsave(&q->requeue_lock, flags);
|
|
|
|
if (at_head) {
|
2016-10-20 13:12:13 +00:00
|
|
|
rq->rq_flags |= RQF_SOFTBARRIER;
|
2014-05-28 14:08:02 +00:00
|
|
|
list_add(&rq->queuelist, &q->requeue_list);
|
|
|
|
} else {
|
|
|
|
list_add_tail(&rq->queuelist, &q->requeue_list);
|
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(&q->requeue_lock, flags);
|
2016-10-29 00:21:41 +00:00
|
|
|
|
|
|
|
if (kick_requeue_list)
|
|
|
|
blk_mq_kick_requeue_list(q);
|
2014-05-28 14:08:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_kick_requeue_list(struct request_queue *q)
|
|
|
|
{
|
2018-01-19 16:58:55 +00:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
|
2014-05-28 14:08:02 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_kick_requeue_list);
|
|
|
|
|
2016-09-14 17:28:30 +00:00
|
|
|
void blk_mq_delay_kick_requeue_list(struct request_queue *q,
|
|
|
|
unsigned long msecs)
|
|
|
|
{
|
2017-08-09 18:28:06 +00:00
|
|
|
kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2016-09-14 17:28:30 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
|
|
|
|
|
2014-06-04 16:23:49 +00:00
|
|
|
struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
|
|
|
|
{
|
2016-08-25 14:07:30 +00:00
|
|
|
if (tag < tags->nr_tags) {
|
|
|
|
prefetch(tags->rqs[tag]);
|
2016-03-15 19:03:28 +00:00
|
|
|
return tags->rqs[tag];
|
2016-08-25 14:07:30 +00:00
|
|
|
}
|
2016-03-15 19:03:28 +00:00
|
|
|
|
|
|
|
return NULL;
|
2014-04-15 20:14:00 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_tag_to_rq);
|
|
|
|
|
2018-12-18 04:11:17 +00:00
|
|
|
static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
|
|
|
|
void *priv, bool reserved)
|
2018-11-08 16:03:51 +00:00
|
|
|
{
|
|
|
|
/*
|
2020-07-07 15:04:33 +00:00
|
|
|
* If we find a request that isn't idle and the queue matches,
|
2018-12-18 04:11:17 +00:00
|
|
|
* we know the queue is busy. Return false to stop the iteration.
|
2018-11-08 16:03:51 +00:00
|
|
|
*/
|
2020-07-07 15:04:33 +00:00
|
|
|
if (blk_mq_request_started(rq) && rq->q == hctx->queue) {
|
2018-11-08 16:03:51 +00:00
|
|
|
bool *busy = priv;
|
|
|
|
|
|
|
|
*busy = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-12-18 04:11:17 +00:00
|
|
|
bool blk_mq_queue_inflight(struct request_queue *q)
|
2018-11-08 16:03:51 +00:00
|
|
|
{
|
|
|
|
bool busy = false;
|
|
|
|
|
2018-12-18 04:11:17 +00:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
|
2018-11-08 16:03:51 +00:00
|
|
|
return busy;
|
|
|
|
}
|
2018-12-18 04:11:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
|
2018-11-08 16:03:51 +00:00
|
|
|
|
2018-01-09 16:29:50 +00:00
|
|
|
static void blk_mq_rq_timed_out(struct request *req, bool reserved)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2018-06-14 11:58:45 +00:00
|
|
|
req->rq_flags |= RQF_TIMED_OUT;
|
2018-05-29 13:52:39 +00:00
|
|
|
if (req->q->mq_ops->timeout) {
|
|
|
|
enum blk_eh_timer_return ret;
|
|
|
|
|
|
|
|
ret = req->q->mq_ops->timeout(req, reserved);
|
|
|
|
if (ret == BLK_EH_DONE)
|
|
|
|
return;
|
|
|
|
WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
|
2014-09-13 23:40:12 +00:00
|
|
|
}
|
2018-05-29 13:52:39 +00:00
|
|
|
|
|
|
|
blk_add_timer(req);
|
2014-04-24 14:51:47 +00:00
|
|
|
}
|
2015-01-08 01:55:46 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
|
2014-09-13 23:40:11 +00:00
|
|
|
{
|
2018-05-29 13:52:28 +00:00
|
|
|
unsigned long deadline;
|
2014-04-24 14:51:47 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
|
|
|
|
return false;
|
2018-06-14 11:58:45 +00:00
|
|
|
if (rq->rq_flags & RQF_TIMED_OUT)
|
|
|
|
return false;
|
2017-09-06 08:00:22 +00:00
|
|
|
|
2018-11-14 16:02:05 +00:00
|
|
|
deadline = READ_ONCE(rq->deadline);
|
2018-05-29 13:52:28 +00:00
|
|
|
if (time_after_eq(jiffies, deadline))
|
|
|
|
return true;
|
2017-09-06 08:00:22 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
if (*next == 0)
|
|
|
|
*next = deadline;
|
|
|
|
else if (time_after(*next, deadline))
|
|
|
|
*next = deadline;
|
|
|
|
return false;
|
2014-04-24 14:51:47 +00:00
|
|
|
}
|
|
|
|
|
2021-05-11 15:22:34 +00:00
|
|
|
void blk_mq_put_rq_ref(struct request *rq)
|
|
|
|
{
|
|
|
|
if (is_flush_rq(rq, rq->mq_hctx))
|
|
|
|
rq->end_io(rq, 0);
|
|
|
|
else if (refcount_dec_and_test(&rq->ref))
|
|
|
|
__blk_mq_free_request(rq);
|
|
|
|
}
|
|
|
|
|
2018-11-08 17:24:07 +00:00
|
|
|
static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
struct request *rq, void *priv, bool reserved)
|
|
|
|
{
|
2018-05-29 13:52:28 +00:00
|
|
|
unsigned long *next = priv;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Just do a quick check if it is expired before locking the request in
|
|
|
|
* so we're not unnecessarilly synchronizing across CPUs.
|
|
|
|
*/
|
|
|
|
if (!blk_mq_req_expired(rq, next))
|
2018-11-08 17:24:07 +00:00
|
|
|
return true;
|
2018-05-29 13:52:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We have reason to believe the request may be expired. Take a
|
|
|
|
* reference on the request to lock this request lifetime into its
|
|
|
|
* currently allocated context to prevent it from being reallocated in
|
|
|
|
* the event the completion by-passes this timeout handler.
|
|
|
|
*
|
|
|
|
* If the reference was already released, then the driver beat the
|
|
|
|
* timeout handler to posting a natural completion.
|
|
|
|
*/
|
|
|
|
if (!refcount_inc_not_zero(&rq->ref))
|
2018-11-08 17:24:07 +00:00
|
|
|
return true;
|
2018-05-29 13:52:28 +00:00
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
/*
|
2018-05-29 13:52:28 +00:00
|
|
|
* The request is now locked and cannot be reallocated underneath the
|
|
|
|
* timeout handler's processing. Re-verify this exact request is truly
|
|
|
|
* expired; if it is not expired, then the request was completed and
|
|
|
|
* reallocated as a new request.
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
*/
|
2018-05-29 13:52:28 +00:00
|
|
|
if (blk_mq_req_expired(rq, next))
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
blk_mq_rq_timed_out(rq, reserved);
|
block: fix null pointer dereference in blk_mq_rq_timed_out()
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27 08:19:55 +00:00
|
|
|
|
2021-05-11 15:22:34 +00:00
|
|
|
blk_mq_put_rq_ref(rq);
|
2018-11-08 17:24:07 +00:00
|
|
|
return true;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
}
|
|
|
|
|
2015-10-30 12:57:30 +00:00
|
|
|
static void blk_mq_timeout_work(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2015-10-30 12:57:30 +00:00
|
|
|
struct request_queue *q =
|
|
|
|
container_of(work, struct request_queue, timeout_work);
|
2018-05-29 13:52:28 +00:00
|
|
|
unsigned long next = 0;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2014-09-13 23:40:11 +00:00
|
|
|
int i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 14:23:39 +00:00
|
|
|
/* A deadlock might occur if a request is stuck requiring a
|
|
|
|
* timeout at the same time a queue freeze is waiting
|
|
|
|
* completion, since the timeout code would not be able to
|
|
|
|
* acquire the queue reference here.
|
|
|
|
*
|
|
|
|
* That's why we don't use blk_queue_enter here; instead, we use
|
|
|
|
* percpu_ref_tryget directly, because we need to be able to
|
|
|
|
* obtain a reference even in the short window between the queue
|
|
|
|
* starting to freeze, by dropping the first reference in
|
2017-03-27 12:06:57 +00:00
|
|
|
* blk_freeze_queue_start, and the moment the last request is
|
blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread. This puts
the request_queue in a hung state, forever waiting for the freeze
completion.
This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion. This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive. This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.
Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout. In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process. I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis. I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-01 14:23:39 +00:00
|
|
|
* consumed, marked by the instant q_usage_counter reaches
|
|
|
|
* zero.
|
|
|
|
*/
|
|
|
|
if (!percpu_ref_tryget(&q->q_usage_counter))
|
2015-10-30 12:57:30 +00:00
|
|
|
return;
|
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
if (next != 0) {
|
|
|
|
mod_timer(&q->timeout, next);
|
2014-05-13 21:10:52 +00:00
|
|
|
} else {
|
2018-01-10 16:33:33 +00:00
|
|
|
/*
|
|
|
|
* Request timeouts are handled as a forward rolling timer. If
|
|
|
|
* we end up here it means that no requests are pending and
|
|
|
|
* also that no request has been pending for a while. Mark
|
|
|
|
* each hctx as idle.
|
|
|
|
*/
|
2015-04-21 02:00:19 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
/* the hctx may be unmapped, so check it here */
|
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
|
|
|
}
|
2014-05-13 21:10:52 +00:00
|
|
|
}
|
2015-10-30 12:57:30 +00:00
|
|
|
blk_queue_exit(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2016-09-17 14:38:44 +00:00
|
|
|
struct flush_busy_ctx_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct list_head *list;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
|
|
|
|
{
|
|
|
|
struct flush_busy_ctx_data *flush_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = flush_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 15:44:05 +00:00
|
|
|
enum hctx_type type = hctx->type;
|
2016-09-17 14:38:44 +00:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 15:44:05 +00:00
|
|
|
list_splice_tail_init(&ctx->rq_lists[type], flush_data->list);
|
2018-02-28 00:56:42 +00:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
2016-09-17 14:38:44 +00:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2014-05-19 15:23:55 +00:00
|
|
|
/*
|
|
|
|
* Process software queues that have been marked busy, splicing them
|
|
|
|
* to the for-dispatch
|
|
|
|
*/
|
2016-12-14 21:34:47 +00:00
|
|
|
void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
|
2014-05-19 15:23:55 +00:00
|
|
|
{
|
2016-09-17 14:38:44 +00:00
|
|
|
struct flush_busy_ctx_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.list = list,
|
|
|
|
};
|
2014-05-19 15:23:55 +00:00
|
|
|
|
2016-09-17 14:38:44 +00:00
|
|
|
sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
|
2014-05-19 15:23:55 +00:00
|
|
|
}
|
2016-12-14 21:34:47 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
|
2014-05-19 15:23:55 +00:00
|
|
|
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:22:30 +00:00
|
|
|
struct dispatch_rq_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct request *rq;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct dispatch_rq_data *dispatch_data = data;
|
|
|
|
struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
|
|
|
|
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
|
2018-12-17 15:44:05 +00:00
|
|
|
enum hctx_type type = hctx->type;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:22:30 +00:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 15:44:05 +00:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
dispatch_data->rq = list_entry_rq(ctx->rq_lists[type].next);
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:22:30 +00:00
|
|
|
list_del_init(&dispatch_data->rq->queuelist);
|
2018-12-17 15:44:05 +00:00
|
|
|
if (list_empty(&ctx->rq_lists[type]))
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:22:30 +00:00
|
|
|
sbitmap_clear_bit(sb, bitnr);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
return !dispatch_data->rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct blk_mq_ctx *start)
|
|
|
|
{
|
2018-10-29 19:13:29 +00:00
|
|
|
unsigned off = start ? start->index_hw[hctx->type] : 0;
|
blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun. Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:22:30 +00:00
|
|
|
struct dispatch_rq_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
.rq = NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
__sbitmap_for_each_set(&hctx->ctx_map, off,
|
|
|
|
dispatch_rq_from_ctx, &data);
|
|
|
|
|
|
|
|
return data.rq;
|
|
|
|
}
|
|
|
|
|
2016-09-16 19:59:14 +00:00
|
|
|
static inline unsigned int queued_to_index(unsigned int queued)
|
|
|
|
{
|
|
|
|
if (!queued)
|
|
|
|
return 0;
|
2014-05-19 15:23:55 +00:00
|
|
|
|
2016-09-16 19:59:14 +00:00
|
|
|
return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
|
2014-05-19 15:23:55 +00:00
|
|
|
}
|
|
|
|
|
2020-06-30 14:03:55 +00:00
|
|
|
static bool __blk_mq_get_driver_tag(struct request *rq)
|
|
|
|
{
|
2020-08-19 15:20:23 +00:00
|
|
|
struct sbitmap_queue *bt = rq->mq_hctx->tags->bitmap_tags;
|
2020-06-30 14:03:55 +00:00
|
|
|
unsigned int tag_offset = rq->mq_hctx->tags->nr_reserved_tags;
|
|
|
|
int tag;
|
|
|
|
|
2020-07-06 14:41:11 +00:00
|
|
|
blk_mq_tag_busy(rq->mq_hctx);
|
|
|
|
|
2020-06-30 14:03:55 +00:00
|
|
|
if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
|
2020-08-19 15:20:23 +00:00
|
|
|
bt = rq->mq_hctx->tags->breserved_tags;
|
2020-06-30 14:03:55 +00:00
|
|
|
tag_offset = 0;
|
2020-09-11 10:41:14 +00:00
|
|
|
} else {
|
|
|
|
if (!hctx_may_queue(rq->mq_hctx, bt))
|
|
|
|
return false;
|
2020-06-30 14:03:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
tag = __sbitmap_queue_get(bt);
|
|
|
|
if (tag == BLK_MQ_NO_TAG)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
rq->tag = tag + tag_offset;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-06-03 10:47:21 +00:00
|
|
|
bool blk_mq_get_driver_tag(struct request *rq)
|
2020-06-30 14:03:55 +00:00
|
|
|
{
|
2020-07-06 14:41:11 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
|
|
|
|
|
|
|
if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_get_driver_tag(rq))
|
|
|
|
return false;
|
|
|
|
|
2020-08-19 15:20:19 +00:00
|
|
|
if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
|
2020-07-06 14:41:11 +00:00
|
|
|
!(rq->rq_flags & RQF_MQ_INFLIGHT)) {
|
|
|
|
rq->rq_flags |= RQF_MQ_INFLIGHT;
|
2020-08-19 15:20:26 +00:00
|
|
|
__blk_mq_inc_active_requests(hctx);
|
2020-07-06 14:41:11 +00:00
|
|
|
}
|
|
|
|
hctx->tags->rqs[rq->tag] = rq;
|
|
|
|
return true;
|
2020-06-30 14:03:55 +00:00
|
|
|
}
|
|
|
|
|
2017-11-09 15:32:43 +00:00
|
|
|
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
|
|
|
|
int flags, void *key)
|
2017-02-22 18:58:29 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
|
|
|
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
|
|
|
|
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2019-03-25 18:34:10 +00:00
|
|
|
if (!list_empty(&wait->entry)) {
|
|
|
|
struct sbitmap_queue *sbq;
|
|
|
|
|
|
|
|
list_del_init(&wait->entry);
|
2020-08-19 15:20:23 +00:00
|
|
|
sbq = hctx->tags->bitmap_tags;
|
2019-03-25 18:34:10 +00:00
|
|
|
atomic_dec(&sbq->ws_active);
|
|
|
|
}
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
|
2017-02-22 18:58:29 +00:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2017-11-09 23:10:13 +00:00
|
|
|
/*
|
|
|
|
* Mark us waiting for a tag. For shared tags, this involves hooking us into
|
2018-01-09 18:09:15 +00:00
|
|
|
* the tag wakeups. For non-shared tags, we can simply mark us needing a
|
|
|
|
* restart. For both cases, take care to check the condition again after
|
2017-11-09 23:10:13 +00:00
|
|
|
* marking us as waiting.
|
|
|
|
*/
|
2018-06-25 11:31:46 +00:00
|
|
|
static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
|
2017-11-09 23:10:13 +00:00
|
|
|
struct request *rq)
|
2017-02-22 18:58:29 +00:00
|
|
|
{
|
2020-08-19 15:20:23 +00:00
|
|
|
struct sbitmap_queue *sbq = hctx->tags->bitmap_tags;
|
2018-06-25 11:31:47 +00:00
|
|
|
struct wait_queue_head *wq;
|
2017-11-09 23:10:13 +00:00
|
|
|
wait_queue_entry_t *wait;
|
|
|
|
bool ret;
|
2017-02-22 18:58:29 +00:00
|
|
|
|
2020-08-19 15:20:19 +00:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
|
2019-03-15 03:05:10 +00:00
|
|
|
blk_mq_sched_mark_restart_hctx(hctx);
|
2017-11-09 23:10:13 +00:00
|
|
|
|
2018-01-10 21:41:21 +00:00
|
|
|
/*
|
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
|
|
|
*
|
|
|
|
* Don't clear RESTART here, someone else could have set it.
|
|
|
|
* At most this will cost an extra queue run.
|
|
|
|
*/
|
2018-06-25 11:31:45 +00:00
|
|
|
return blk_mq_get_driver_tag(rq);
|
2017-11-09 15:32:43 +00:00
|
|
|
}
|
|
|
|
|
2018-06-25 11:31:46 +00:00
|
|
|
wait = &hctx->dispatch_wait;
|
2018-01-10 21:41:21 +00:00
|
|
|
if (!list_empty_careful(&wait->entry))
|
|
|
|
return false;
|
|
|
|
|
2019-03-25 18:34:10 +00:00
|
|
|
wq = &bt_wait_ptr(sbq, hctx)->wait;
|
2018-06-25 11:31:47 +00:00
|
|
|
|
|
|
|
spin_lock_irq(&wq->lock);
|
|
|
|
spin_lock(&hctx->dispatch_wait_lock);
|
2018-01-10 21:41:21 +00:00
|
|
|
if (!list_empty(&wait->entry)) {
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-10 21:41:21 +00:00
|
|
|
return false;
|
2017-11-09 15:32:43 +00:00
|
|
|
}
|
|
|
|
|
2019-03-25 18:34:10 +00:00
|
|
|
atomic_inc(&sbq->ws_active);
|
2018-06-25 11:31:47 +00:00
|
|
|
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
|
|
|
|
__add_wait_queue(wq, wait);
|
2018-01-10 21:41:21 +00:00
|
|
|
|
2017-02-22 18:58:29 +00:00
|
|
|
/*
|
2017-11-09 15:32:43 +00:00
|
|
|
* It's possible that a tag was freed in the window between the
|
|
|
|
* allocation failure and adding the hardware queue to the wait
|
|
|
|
* queue.
|
2017-02-22 18:58:29 +00:00
|
|
|
*/
|
2018-06-25 11:31:45 +00:00
|
|
|
ret = blk_mq_get_driver_tag(rq);
|
2018-01-10 21:41:21 +00:00
|
|
|
if (!ret) {
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-10 21:41:21 +00:00
|
|
|
return false;
|
2017-11-09 15:32:43 +00:00
|
|
|
}
|
2018-01-10 21:41:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We got a tag, remove ourselves from the wait queue to ensure
|
|
|
|
* someone else gets the wakeup.
|
|
|
|
*/
|
|
|
|
list_del_init(&wait->entry);
|
2019-03-25 18:34:10 +00:00
|
|
|
atomic_dec(&sbq->ws_active);
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_unlock(&hctx->dispatch_wait_lock);
|
|
|
|
spin_unlock_irq(&wq->lock);
|
2018-01-10 21:41:21 +00:00
|
|
|
|
|
|
|
return true;
|
2017-02-22 18:58:29 +00:00
|
|
|
}
|
|
|
|
|
2018-07-03 15:03:16 +00:00
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT 8
|
|
|
|
#define BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR 4
|
|
|
|
/*
|
|
|
|
* Update dispatch busy with the Exponential Weighted Moving Average(EWMA):
|
|
|
|
* - EWMA is one simple way to compute running average value
|
|
|
|
* - weight(7/8 and 1/8) is applied so that it can decrease exponentially
|
|
|
|
* - take 4 as factor for avoiding to get too small(0) result, and this
|
|
|
|
* factor doesn't matter because EWMA decreases exponentially
|
|
|
|
*/
|
|
|
|
static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
|
|
|
|
{
|
|
|
|
unsigned int ewma;
|
|
|
|
|
|
|
|
ewma = hctx->dispatch_busy;
|
|
|
|
|
|
|
|
if (!ewma && !busy)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ewma *= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT - 1;
|
|
|
|
if (busy)
|
|
|
|
ewma += 1 << BLK_MQ_DISPATCH_BUSY_EWMA_FACTOR;
|
|
|
|
ewma /= BLK_MQ_DISPATCH_BUSY_EWMA_WEIGHT;
|
|
|
|
|
|
|
|
hctx->dispatch_busy = ewma;
|
|
|
|
}
|
|
|
|
|
2018-01-31 03:04:57 +00:00
|
|
|
#define BLK_MQ_RESOURCE_DELAY 3 /* ms units */
|
|
|
|
|
2020-03-24 15:24:44 +00:00
|
|
|
static void blk_mq_handle_dev_resource(struct request *rq,
|
|
|
|
struct list_head *list)
|
|
|
|
{
|
|
|
|
struct request *next =
|
|
|
|
list_first_entry_or_null(list, struct request, queuelist);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If an I/O scheduler has been configured and we got a driver tag for
|
|
|
|
* the next request already, free it.
|
|
|
|
*/
|
|
|
|
if (next)
|
|
|
|
blk_mq_put_driver_tag(next);
|
|
|
|
|
|
|
|
list_add(&rq->queuelist, list);
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
}
|
|
|
|
|
2020-05-12 08:55:47 +00:00
|
|
|
static void blk_mq_handle_zone_resource(struct request *rq,
|
|
|
|
struct list_head *zone_list)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we end up here it is because we cannot dispatch a request to a
|
|
|
|
* specific zone due to LLD level zone-write locking or other zone
|
|
|
|
* related resource not being available. In this case, set the request
|
|
|
|
* aside in zone_list for retrying it later.
|
|
|
|
*/
|
|
|
|
list_add(&rq->queuelist, zone_list);
|
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
}
|
|
|
|
|
2020-06-30 10:24:58 +00:00
|
|
|
enum prep_dispatch {
|
|
|
|
PREP_DISPATCH_OK,
|
|
|
|
PREP_DISPATCH_NO_TAG,
|
|
|
|
PREP_DISPATCH_NO_BUDGET,
|
|
|
|
};
|
|
|
|
|
|
|
|
static enum prep_dispatch blk_mq_prep_dispatch_rq(struct request *rq,
|
|
|
|
bool need_budget)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2021-01-22 02:33:12 +00:00
|
|
|
int budget_token = -1;
|
2020-06-30 10:24:58 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
if (need_budget) {
|
|
|
|
budget_token = blk_mq_get_dispatch_budget(rq->q);
|
|
|
|
if (budget_token < 0) {
|
|
|
|
blk_mq_put_driver_tag(rq);
|
|
|
|
return PREP_DISPATCH_NO_BUDGET;
|
|
|
|
}
|
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
2020-06-30 10:24:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
|
|
|
/*
|
|
|
|
* The initial allocation attempt failed, so we need to
|
|
|
|
* rerun the hardware queue when a tag is freed. The
|
|
|
|
* waitqueue takes care of that. If the queue is run
|
|
|
|
* before we add this entry back on the dispatch list,
|
|
|
|
* we'll re-run it below.
|
|
|
|
*/
|
|
|
|
if (!blk_mq_mark_tag_wait(hctx, rq)) {
|
2020-06-30 10:25:00 +00:00
|
|
|
/*
|
|
|
|
* All budgets not got from this function will be put
|
|
|
|
* together during handling partial dispatch
|
|
|
|
*/
|
|
|
|
if (need_budget)
|
2021-01-22 02:33:12 +00:00
|
|
|
blk_mq_put_dispatch_budget(rq->q, budget_token);
|
2020-06-30 10:24:58 +00:00
|
|
|
return PREP_DISPATCH_NO_TAG;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return PREP_DISPATCH_OK;
|
|
|
|
}
|
|
|
|
|
2020-06-30 10:25:00 +00:00
|
|
|
/* release all allocated budgets before calling to blk_mq_dispatch_rq_list */
|
|
|
|
static void blk_mq_release_budgets(struct request_queue *q,
|
2021-01-22 02:33:12 +00:00
|
|
|
struct list_head *list)
|
2020-06-30 10:25:00 +00:00
|
|
|
{
|
2021-01-22 02:33:12 +00:00
|
|
|
struct request *rq;
|
2020-06-30 10:25:00 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
|
|
|
int budget_token = blk_mq_get_rq_budget_token(rq);
|
2020-06-30 10:25:00 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
if (budget_token >= 0)
|
|
|
|
blk_mq_put_dispatch_budget(q, budget_token);
|
|
|
|
}
|
2020-06-30 10:25:00 +00:00
|
|
|
}
|
|
|
|
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-28 17:54:01 +00:00
|
|
|
/*
|
|
|
|
* Returns true if we did some work AND can potentially do more.
|
|
|
|
*/
|
2020-06-30 10:24:57 +00:00
|
|
|
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
|
2020-06-30 10:25:00 +00:00
|
|
|
unsigned int nr_budgets)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2020-06-30 10:24:58 +00:00
|
|
|
enum prep_dispatch prep;
|
2020-06-30 10:24:57 +00:00
|
|
|
struct request_queue *q = hctx->queue;
|
2017-11-02 15:24:32 +00:00
|
|
|
struct request *rq, *nxt;
|
2017-06-03 07:38:05 +00:00
|
|
|
int errors, queued;
|
2018-01-31 03:04:57 +00:00
|
|
|
blk_status_t ret = BLK_STS_OK;
|
2020-05-12 08:55:47 +00:00
|
|
|
LIST_HEAD(zone_list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 14:56:26 +00:00
|
|
|
if (list_empty(list))
|
|
|
|
return false;
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
|
|
|
* Now process all the entries, sending them to the driver.
|
|
|
|
*/
|
2017-03-24 18:04:19 +00:00
|
|
|
errors = queued = 0;
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 14:56:26 +00:00
|
|
|
do {
|
2014-10-29 17:14:52 +00:00
|
|
|
struct blk_mq_queue_data bd;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2016-12-07 15:41:17 +00:00
|
|
|
rq = list_first_entry(list, struct request, queuelist);
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-04 16:35:21 +00:00
|
|
|
|
2020-06-30 10:24:57 +00:00
|
|
|
WARN_ON_ONCE(hctx != rq->mq_hctx);
|
2020-06-30 10:25:00 +00:00
|
|
|
prep = blk_mq_prep_dispatch_rq(rq, !nr_budgets);
|
2020-06-30 10:24:58 +00:00
|
|
|
if (prep != PREP_DISPATCH_OK)
|
blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:
1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.
2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.
3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B
4) both two IO pathes can't move on, and IO stall is caused.
This issue can be observed when running dbench on USB storage.
This patch fixes this issue by always getting budget before getting
driver tag.
Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-04 16:35:21 +00:00
|
|
|
break;
|
2017-10-14 09:22:29 +00:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
list_del_init(&rq->queuelist);
|
|
|
|
|
2014-10-29 17:14:52 +00:00
|
|
|
bd.rq = rq;
|
2017-03-02 20:26:04 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Flag last if we have no more requests, or if we have more
|
|
|
|
* but can't assign a driver tag to it.
|
|
|
|
*/
|
|
|
|
if (list_empty(list))
|
|
|
|
bd.last = true;
|
|
|
|
else {
|
|
|
|
nxt = list_first_entry(list, struct request, queuelist);
|
2018-06-25 11:31:45 +00:00
|
|
|
bd.last = !blk_mq_get_driver_tag(nxt);
|
2017-03-02 20:26:04 +00:00
|
|
|
}
|
2014-10-29 17:14:52 +00:00
|
|
|
|
2020-06-30 10:25:00 +00:00
|
|
|
/*
|
|
|
|
* once the request is queued to lld, no need to cover the
|
|
|
|
* budget any more
|
|
|
|
*/
|
|
|
|
if (nr_budgets)
|
|
|
|
nr_budgets--;
|
2014-10-29 17:14:52 +00:00
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
2020-07-01 13:58:57 +00:00
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
|
|
|
queued++;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
break;
|
2020-07-01 13:58:57 +00:00
|
|
|
case BLK_STS_RESOURCE:
|
|
|
|
case BLK_STS_DEV_RESOURCE:
|
|
|
|
blk_mq_handle_dev_resource(rq, list);
|
|
|
|
goto out;
|
|
|
|
case BLK_STS_ZONE_RESOURCE:
|
2020-05-12 08:55:47 +00:00
|
|
|
/*
|
|
|
|
* Move the request to zone_list and keep going through
|
|
|
|
* the dispatch list to find more requests the drive can
|
|
|
|
* accept.
|
|
|
|
*/
|
|
|
|
blk_mq_handle_zone_resource(rq, &zone_list);
|
2020-07-01 13:58:57 +00:00
|
|
|
break;
|
|
|
|
default:
|
2017-03-24 18:04:19 +00:00
|
|
|
errors++;
|
2020-09-30 08:02:53 +00:00
|
|
|
blk_mq_end_request(rq, ret);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 14:56:26 +00:00
|
|
|
} while (!list_empty(list));
|
2020-07-01 13:58:57 +00:00
|
|
|
out:
|
2020-05-12 08:55:47 +00:00
|
|
|
if (!list_empty(&zone_list))
|
|
|
|
list_splice_tail_init(&zone_list, list);
|
|
|
|
|
2016-09-16 19:59:14 +00:00
|
|
|
hctx->dispatched[queued_to_index(queued)]++;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2020-09-05 11:25:56 +00:00
|
|
|
/* If we didn't flush the entire list, we could have told the driver
|
|
|
|
* there was more coming, but that turned out to be a lie.
|
|
|
|
*/
|
|
|
|
if ((!list_empty(list) || errors) && q->mq_ops->commit_rqs && queued)
|
|
|
|
q->mq_ops->commit_rqs(hctx);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
|
|
|
* Any items that need requeuing? Stuff them into hctx->dispatch,
|
|
|
|
* that is where we will continue on next queue run.
|
|
|
|
*/
|
2016-12-07 15:41:17 +00:00
|
|
|
if (!list_empty(list)) {
|
2018-01-31 03:04:57 +00:00
|
|
|
bool needs_restart;
|
2020-06-30 10:24:58 +00:00
|
|
|
/* For non-shared tags, the RESTART check will suffice */
|
|
|
|
bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
|
2020-08-19 15:20:19 +00:00
|
|
|
(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED);
|
2020-06-30 10:24:58 +00:00
|
|
|
bool no_budget_avail = prep == PREP_DISPATCH_NO_BUDGET;
|
2018-01-31 03:04:57 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
if (nr_budgets)
|
|
|
|
blk_mq_release_budgets(q, list);
|
2018-01-31 03:04:57 +00:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
spin_lock(&hctx->lock);
|
2020-02-25 01:04:32 +00:00
|
|
|
list_splice_tail_init(list, &hctx->dispatch);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
spin_unlock(&hctx->lock);
|
2016-12-07 15:41:17 +00:00
|
|
|
|
2020-08-17 10:01:15 +00:00
|
|
|
/*
|
|
|
|
* Order adding requests to hctx->dispatch and checking
|
|
|
|
* SCHED_RESTART flag. The pair of this smp_mb() is the one
|
|
|
|
* in blk_mq_sched_restart(). Avoid restart code path to
|
|
|
|
* miss the new added requests to hctx->dispatch, meantime
|
|
|
|
* SCHED_RESTART is observed here.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
|
2015-05-04 20:32:48 +00:00
|
|
|
/*
|
2017-04-07 18:16:51 +00:00
|
|
|
* If SCHED_RESTART was set by the caller of this function and
|
|
|
|
* it is no longer set that means that it was cleared by another
|
|
|
|
* thread and hence that a queue rerun is needed.
|
2015-05-04 20:32:48 +00:00
|
|
|
*
|
2017-11-09 15:32:43 +00:00
|
|
|
* If 'no_tag' is set, that means that we failed getting
|
|
|
|
* a driver tag with an I/O scheduler attached. If our dispatch
|
|
|
|
* waitqueue is no longer active, ensure that we run the queue
|
|
|
|
* AFTER adding our entries back to the list.
|
2017-01-17 13:03:22 +00:00
|
|
|
*
|
2017-04-07 18:16:51 +00:00
|
|
|
* If no I/O scheduler has been configured it is possible that
|
|
|
|
* the hardware queue got stopped and restarted before requests
|
|
|
|
* were pushed back onto the dispatch list. Rerun the queue to
|
|
|
|
* avoid starvation. Notes:
|
|
|
|
* - blk_mq_run_hw_queue() checks whether or not a queue has
|
|
|
|
* been stopped before rerunning a queue.
|
|
|
|
* - Some but not all block drivers stop a queue before
|
2017-06-03 07:38:05 +00:00
|
|
|
* returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
|
2017-04-07 18:16:51 +00:00
|
|
|
* and dm-rq.
|
2018-01-31 03:04:57 +00:00
|
|
|
*
|
|
|
|
* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
|
|
|
|
* bit is set, run queue after a delay to avoid IO stalls
|
2020-04-20 16:24:51 +00:00
|
|
|
* that could otherwise occur if the queue is idle. We'll do
|
|
|
|
* similar if we couldn't get budget and SCHED_RESTART is set.
|
2017-01-17 13:03:22 +00:00
|
|
|
*/
|
2018-01-31 03:04:57 +00:00
|
|
|
needs_restart = blk_mq_sched_needs_restart(hctx);
|
|
|
|
if (!needs_restart ||
|
2017-11-09 15:32:43 +00:00
|
|
|
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
|
2017-01-17 13:03:22 +00:00
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2020-04-20 16:24:51 +00:00
|
|
|
else if (needs_restart && (ret == BLK_STS_RESOURCE ||
|
|
|
|
no_budget_avail))
|
2018-01-31 03:04:57 +00:00
|
|
|
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-28 17:54:01 +00:00
|
|
|
|
2018-07-03 15:03:16 +00:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-28 17:54:01 +00:00
|
|
|
return false;
|
2018-07-03 15:03:16 +00:00
|
|
|
} else
|
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2016-12-07 15:41:17 +00:00
|
|
|
|
2017-03-24 18:04:19 +00:00
|
|
|
return (queued + errors) != 0;
|
2016-12-07 15:41:17 +00:00
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* __blk_mq_run_hw_queue - Run a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
*
|
|
|
|
* Send pending requests to the hardware.
|
|
|
|
*/
|
2016-11-02 16:09:51 +00:00
|
|
|
static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
int srcu_idx;
|
|
|
|
|
2017-08-01 15:28:24 +00:00
|
|
|
/*
|
|
|
|
* We can't run the queue inline with ints disabled. Ensure that
|
|
|
|
* we catch bad users of this early.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(in_interrupt());
|
|
|
|
|
2018-01-09 16:29:46 +00:00
|
|
|
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
|
2017-03-30 18:30:39 +00:00
|
|
|
|
2018-01-09 16:29:46 +00:00
|
|
|
hctx_lock(hctx, &srcu_idx);
|
|
|
|
blk_mq_sched_dispatch_requests(hctx);
|
|
|
|
hctx_unlock(hctx, srcu_idx);
|
2016-11-02 16:09:51 +00:00
|
|
|
}
|
|
|
|
|
2018-04-08 09:48:10 +00:00
|
|
|
static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
|
|
|
|
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
cpu = cpumask_first(hctx->cpumask);
|
|
|
|
return cpu;
|
|
|
|
}
|
|
|
|
|
2014-05-07 16:26:44 +00:00
|
|
|
/*
|
|
|
|
* It'd be great if the workqueue API had a way to pass
|
|
|
|
* in a mask and had some smarts for more clever placement.
|
|
|
|
* For now we just round-robin here, switching for every
|
|
|
|
* BLK_MQ_CPU_WORK_BATCH queued items.
|
|
|
|
*/
|
|
|
|
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
bool tried = false;
|
2018-04-08 09:48:09 +00:00
|
|
|
int next_cpu = hctx->next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
|
2014-11-24 08:27:23 +00:00
|
|
|
if (hctx->queue->nr_hw_queues == 1)
|
|
|
|
return WORK_CPU_UNBOUND;
|
2014-05-07 16:26:44 +00:00
|
|
|
|
|
|
|
if (--hctx->next_cpu_batch <= 0) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
select_cpu:
|
2018-04-08 09:48:09 +00:00
|
|
|
next_cpu = cpumask_next_and(next_cpu, hctx->cpumask,
|
2018-01-12 02:53:06 +00:00
|
|
|
cpu_online_mask);
|
2014-05-07 16:26:44 +00:00
|
|
|
if (next_cpu >= nr_cpu_ids)
|
2018-04-08 09:48:10 +00:00
|
|
|
next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-07 16:26:44 +00:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
|
|
|
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
/*
|
|
|
|
* Do unbound schedule if we can't find a online CPU for this hctx,
|
|
|
|
* and it should only happen in the path of handling CPU DEAD.
|
|
|
|
*/
|
2018-04-08 09:48:09 +00:00
|
|
|
if (!cpu_online(next_cpu)) {
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
if (!tried) {
|
|
|
|
tried = true;
|
|
|
|
goto select_cpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure to re-select CPU next time once after CPUs
|
|
|
|
* in hctx->cpumask become online again.
|
|
|
|
*/
|
2018-04-08 09:48:09 +00:00
|
|
|
hctx->next_cpu = next_cpu;
|
blk-mq: make sure hctx->next_cpu is set correctly
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-17 16:41:51 +00:00
|
|
|
hctx->next_cpu_batch = 1;
|
|
|
|
return WORK_CPU_UNBOUND;
|
|
|
|
}
|
2018-04-08 09:48:09 +00:00
|
|
|
|
|
|
|
hctx->next_cpu = next_cpu;
|
|
|
|
return next_cpu;
|
2014-05-07 16:26:44 +00:00
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* __blk_mq_delay_run_hw_queue - Run (or schedule to run) a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
2020-12-04 15:20:55 +00:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queue.
|
2020-01-06 18:08:18 +00:00
|
|
|
*
|
|
|
|
* If !@async, try to run the queue now. Else, run the queue asynchronously and
|
|
|
|
* with a delay of @msecs.
|
|
|
|
*/
|
2017-04-07 18:16:52 +00:00
|
|
|
static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
|
|
|
|
unsigned long msecs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2017-06-20 18:15:49 +00:00
|
|
|
if (unlikely(blk_mq_hctx_stopped(hctx)))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
return;
|
|
|
|
|
2016-09-21 16:12:13 +00:00
|
|
|
if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
|
2014-11-07 22:04:00 +00:00
|
|
|
int cpu = get_cpu();
|
|
|
|
if (cpumask_test_cpu(cpu, hctx->cpumask)) {
|
2014-11-07 22:03:59 +00:00
|
|
|
__blk_mq_run_hw_queue(hctx);
|
2014-11-07 22:04:00 +00:00
|
|
|
put_cpu();
|
2014-11-07 22:03:59 +00:00
|
|
|
return;
|
|
|
|
}
|
2014-04-09 16:18:23 +00:00
|
|
|
|
2014-11-07 22:04:00 +00:00
|
|
|
put_cpu();
|
2014-04-09 16:18:23 +00:00
|
|
|
}
|
2014-11-07 22:03:59 +00:00
|
|
|
|
2018-01-19 16:58:55 +00:00
|
|
|
kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
|
|
|
|
msecs_to_jiffies(msecs));
|
2017-04-07 18:16:52 +00:00
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_delay_run_hw_queue - Run a hardware queue asynchronously.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
2020-12-04 15:20:55 +00:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queue.
|
2020-01-06 18:08:18 +00:00
|
|
|
*
|
|
|
|
* Run a hardware queue asynchronously with a delay of @msecs.
|
|
|
|
*/
|
2017-04-07 18:16:52 +00:00
|
|
|
void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
|
|
|
|
{
|
|
|
|
__blk_mq_delay_run_hw_queue(hctx, true, msecs);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_run_hw_queue - Start to run a hardware queue.
|
|
|
|
* @hctx: Pointer to the hardware queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*
|
|
|
|
* Check if the request queue is not in a quiesced state and if there are
|
|
|
|
* pending requests to be sent. If this is true, run the queue to send requests
|
|
|
|
* to hardware.
|
|
|
|
*/
|
2019-10-29 16:59:30 +00:00
|
|
|
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
2017-04-07 18:16:52 +00:00
|
|
|
{
|
2018-01-06 08:27:38 +00:00
|
|
|
int srcu_idx;
|
|
|
|
bool need_run;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When queue is quiesced, we may be switching io scheduler, or
|
|
|
|
* updating nr_hw_queues, or other things, and we can't run queue
|
|
|
|
* any more, even __blk_mq_hctx_has_pending() can't be called safely.
|
|
|
|
*
|
|
|
|
* And queue will be rerun in blk_mq_unquiesce_queue() if it is
|
|
|
|
* quiesced.
|
|
|
|
*/
|
2018-01-09 16:29:46 +00:00
|
|
|
hctx_lock(hctx, &srcu_idx);
|
|
|
|
need_run = !blk_queue_quiesced(hctx->queue) &&
|
|
|
|
blk_mq_hctx_has_pending(hctx);
|
|
|
|
hctx_unlock(hctx, srcu_idx);
|
2018-01-06 08:27:38 +00:00
|
|
|
|
2019-10-29 16:59:30 +00:00
|
|
|
if (need_run)
|
2017-11-10 16:13:21 +00:00
|
|
|
__blk_mq_delay_run_hw_queue(hctx, async, 0);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2017-04-14 08:00:00 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2021-01-11 16:47:17 +00:00
|
|
|
/*
|
|
|
|
* Is the request queue handled by an IO scheduler that does not respect
|
|
|
|
* hardware queues when dispatching?
|
|
|
|
*/
|
|
|
|
static bool blk_mq_has_sqsched(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct elevator_queue *e = q->elevator;
|
|
|
|
|
|
|
|
if (e && e->type->ops.dispatch_request &&
|
|
|
|
!(e->type->elevator_features & ELEVATOR_F_MQ_AWARE))
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return prefered queue to dispatch from (if any) for non-mq aware IO
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the IO scheduler does not respect hardware queues when
|
|
|
|
* dispatching, we just don't bother with multiple HW queues and
|
|
|
|
* dispatch from hctx for the current CPU since running multiple queues
|
|
|
|
* just causes lock contention inside the scheduler and pointless cache
|
|
|
|
* bouncing.
|
|
|
|
*/
|
|
|
|
hctx = blk_mq_map_queue_type(q, HCTX_TYPE_DEFAULT,
|
|
|
|
raw_smp_processor_id());
|
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return hctx;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
2020-10-23 16:32:54 +00:00
|
|
|
* blk_mq_run_hw_queues - Run all hardware queues in a request queue.
|
2020-01-06 18:08:18 +00:00
|
|
|
* @q: Pointer to the request queue to run.
|
|
|
|
* @async: If we want to run the queue asynchronously.
|
|
|
|
*/
|
2015-03-12 03:56:38 +00:00
|
|
|
void blk_mq_run_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2021-01-11 16:47:17 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
int i;
|
|
|
|
|
2021-01-11 16:47:17 +00:00
|
|
|
sq_hctx = NULL;
|
|
|
|
if (blk_mq_has_sqsched(q))
|
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2017-11-10 16:13:21 +00:00
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
continue;
|
2021-01-11 16:47:17 +00:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
}
|
2015-03-12 03:56:38 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_run_hw_queues);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2020-04-20 16:24:52 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_delay_run_hw_queues - Run all hardware queues asynchronously.
|
|
|
|
* @q: Pointer to the request queue to run.
|
2020-12-04 15:20:55 +00:00
|
|
|
* @msecs: Milliseconds of delay to wait before running the queues.
|
2020-04-20 16:24:52 +00:00
|
|
|
*/
|
|
|
|
void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
|
|
|
|
{
|
2021-01-11 16:47:17 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx, *sq_hctx;
|
2020-04-20 16:24:52 +00:00
|
|
|
int i;
|
|
|
|
|
2021-01-11 16:47:17 +00:00
|
|
|
sq_hctx = NULL;
|
|
|
|
if (blk_mq_has_sqsched(q))
|
|
|
|
sq_hctx = blk_mq_get_sq_hctx(q);
|
2020-04-20 16:24:52 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
|
|
|
continue;
|
2021-01-11 16:47:17 +00:00
|
|
|
/*
|
|
|
|
* Dispatch from this hctx either if there's no hctx preferred
|
|
|
|
* by IO scheduler or if it has requests that bypass the
|
|
|
|
* scheduler.
|
|
|
|
*/
|
|
|
|
if (!sq_hctx || sq_hctx == hctx ||
|
|
|
|
!list_empty_careful(&hctx->dispatch))
|
|
|
|
blk_mq_delay_run_hw_queue(hctx, msecs);
|
2020-04-20 16:24:52 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_delay_run_hw_queues);
|
|
|
|
|
2016-10-29 00:19:37 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_queue_stopped() - check whether one or more hctxs have been stopped
|
|
|
|
* @q: request queue.
|
|
|
|
*
|
|
|
|
* The caller is responsible for serializing this function against
|
|
|
|
* blk_mq_{start,stop}_hw_queue().
|
|
|
|
*/
|
|
|
|
bool blk_mq_queue_stopped(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_queue_stopped);
|
|
|
|
|
2017-06-06 15:22:09 +00:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-17 23:23:00 +00:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 15:22:09 +00:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queue() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-03 17:08:14 +00:00
|
|
|
void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
2017-06-06 15:22:10 +00:00
|
|
|
cancel_delayed_work(&hctx->run_work);
|
2013-10-25 13:45:58 +00:00
|
|
|
|
2017-06-06 15:22:10 +00:00
|
|
|
set_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2017-05-03 17:08:14 +00:00
|
|
|
}
|
2017-06-06 15:22:10 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queue);
|
2017-05-03 17:08:14 +00:00
|
|
|
|
2017-06-06 15:22:09 +00:00
|
|
|
/*
|
|
|
|
* This function is often used for pausing .queue_rq() by driver when
|
|
|
|
* there isn't enough resource or some conditions aren't satisfied, and
|
2017-08-17 23:23:00 +00:00
|
|
|
* BLK_STS_RESOURCE is usually returned.
|
2017-06-06 15:22:09 +00:00
|
|
|
*
|
|
|
|
* We do not guarantee that dispatch can be drained or blocked
|
|
|
|
* after blk_mq_stop_hw_queues() returns. Please use
|
|
|
|
* blk_mq_quiesce_queue() for that requirement.
|
|
|
|
*/
|
2017-05-03 17:08:14 +00:00
|
|
|
void blk_mq_stop_hw_queues(struct request_queue *q)
|
|
|
|
{
|
2017-06-06 15:22:10 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_stop_hw_queue(hctx);
|
2013-10-25 13:45:58 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_stop_hw_queues);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
2014-04-09 16:18:23 +00:00
|
|
|
|
2014-06-25 14:22:34 +00:00
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queue);
|
|
|
|
|
2014-04-16 07:44:56 +00:00
|
|
|
void blk_mq_start_hw_queues(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_start_hw_queue(hctx);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_hw_queues);
|
|
|
|
|
2016-12-08 20:19:30 +00:00
|
|
|
void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
|
|
|
|
{
|
|
|
|
if (!blk_mq_hctx_stopped(hctx))
|
|
|
|
return;
|
|
|
|
|
|
|
|
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
|
|
|
|
blk_mq_run_hw_queue(hctx, async);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
|
|
|
|
|
2014-04-16 07:44:54 +00:00
|
|
|
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
2016-12-08 20:19:30 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
blk_mq_start_stopped_hw_queue(hctx, async);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
|
|
|
|
|
2014-04-16 16:48:08 +00:00
|
|
|
static void blk_mq_run_work_fn(struct work_struct *work)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
|
2017-04-10 15:54:54 +00:00
|
|
|
hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-04-10 15:54:56 +00:00
|
|
|
/*
|
2018-04-08 09:48:11 +00:00
|
|
|
* If we are stopped, don't run the queue.
|
2017-04-10 15:54:56 +00:00
|
|
|
*/
|
2020-10-09 03:26:30 +00:00
|
|
|
if (blk_mq_hctx_stopped(hctx))
|
2018-06-04 09:03:55 +00:00
|
|
|
return;
|
2017-04-07 18:16:52 +00:00
|
|
|
|
|
|
|
__blk_mq_run_hw_queue(hctx);
|
|
|
|
}
|
|
|
|
|
2015-10-20 15:13:57 +00:00
|
|
|
static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct request *rq,
|
|
|
|
bool at_head)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2016-08-24 21:34:35 +00:00
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
2018-12-17 15:44:05 +00:00
|
|
|
enum hctx_type type = hctx->type;
|
2016-08-24 21:34:35 +00:00
|
|
|
|
2017-06-20 18:15:47 +00:00
|
|
|
lockdep_assert_held(&ctx->lock);
|
|
|
|
|
2020-12-03 16:21:39 +00:00
|
|
|
trace_block_rq_insert(rq);
|
2013-11-20 01:59:10 +00:00
|
|
|
|
2014-02-07 18:22:36 +00:00
|
|
|
if (at_head)
|
2018-12-17 15:44:05 +00:00
|
|
|
list_add(&rq->queuelist, &ctx->rq_lists[type]);
|
2014-02-07 18:22:36 +00:00
|
|
|
else
|
2018-12-17 15:44:05 +00:00
|
|
|
list_add_tail(&rq->queuelist, &ctx->rq_lists[type]);
|
2015-10-20 15:13:57 +00:00
|
|
|
}
|
2014-05-09 15:36:49 +00:00
|
|
|
|
2016-12-14 21:34:47 +00:00
|
|
|
void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
|
|
|
|
bool at_head)
|
2015-10-20 15:13:57 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_ctx *ctx = rq->mq_ctx;
|
|
|
|
|
2017-06-20 18:15:47 +00:00
|
|
|
lockdep_assert_held(&ctx->lock);
|
|
|
|
|
2016-08-24 21:34:35 +00:00
|
|
|
__blk_mq_insert_req_list(hctx, rq, at_head);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_request_bypass_insert - Insert a request at dispatch list.
|
|
|
|
* @rq: Pointer to request to be inserted.
|
2020-08-16 23:39:34 +00:00
|
|
|
* @at_head: true if the request should be inserted at the head of the list.
|
2020-01-06 18:08:18 +00:00
|
|
|
* @run_queue: If we should run the hardware queue after inserting the request.
|
|
|
|
*
|
2017-09-11 22:43:57 +00:00
|
|
|
* Should only be used carefully, when the caller knows we want to
|
|
|
|
* bypass a potential IO scheduler on the target device.
|
|
|
|
*/
|
2020-02-25 01:04:32 +00:00
|
|
|
void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
|
|
|
|
bool run_queue)
|
2017-09-11 22:43:57 +00:00
|
|
|
{
|
2018-10-29 21:06:13 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
2017-09-11 22:43:57 +00:00
|
|
|
|
|
|
|
spin_lock(&hctx->lock);
|
2020-02-25 01:04:32 +00:00
|
|
|
if (at_head)
|
|
|
|
list_add(&rq->queuelist, &hctx->dispatch);
|
|
|
|
else
|
|
|
|
list_add_tail(&rq->queuelist, &hctx->dispatch);
|
2017-09-11 22:43:57 +00:00
|
|
|
spin_unlock(&hctx->lock);
|
|
|
|
|
2017-11-02 15:24:34 +00:00
|
|
|
if (run_queue)
|
|
|
|
blk_mq_run_hw_queue(hctx, false);
|
2017-09-11 22:43:57 +00:00
|
|
|
}
|
|
|
|
|
2017-01-17 13:03:22 +00:00
|
|
|
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
|
|
|
|
struct list_head *list)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
{
|
2018-07-02 09:35:58 +00:00
|
|
|
struct request *rq;
|
2018-12-17 15:44:05 +00:00
|
|
|
enum hctx_type type = hctx->type;
|
2018-07-02 09:35:58 +00:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
|
|
|
* preemption doesn't flush plug list, so it's possible ctx->cpu is
|
|
|
|
* offline now
|
|
|
|
*/
|
2018-07-02 09:35:58 +00:00
|
|
|
list_for_each_entry(rq, list, queuelist) {
|
2016-08-24 21:34:35 +00:00
|
|
|
BUG_ON(rq->mq_ctx != ctx);
|
2020-12-03 16:21:39 +00:00
|
|
|
trace_block_rq_insert(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2018-07-02 09:35:58 +00:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 15:44:05 +00:00
|
|
|
list_splice_tail_init(list, &ctx->rq_lists[type]);
|
2015-10-20 15:13:57 +00:00
|
|
|
blk_mq_hctx_mark_pending(hctx, ctx);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
}
|
|
|
|
|
2021-04-08 18:28:34 +00:00
|
|
|
static int plug_rq_cmp(void *priv, const struct list_head *a,
|
|
|
|
const struct list_head *b)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
|
|
|
struct request *rqa = container_of(a, struct request, queuelist);
|
|
|
|
struct request *rqb = container_of(b, struct request, queuelist);
|
|
|
|
|
2019-11-28 21:11:53 +00:00
|
|
|
if (rqa->mq_ctx != rqb->mq_ctx)
|
|
|
|
return rqa->mq_ctx > rqb->mq_ctx;
|
|
|
|
if (rqa->mq_hctx != rqb->mq_hctx)
|
|
|
|
return rqa->mq_hctx > rqb->mq_hctx;
|
2018-10-30 18:24:04 +00:00
|
|
|
|
|
|
|
return blk_rq_pos(rqa) > blk_rq_pos(rqb);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
|
|
|
|
{
|
|
|
|
LIST_HEAD(list);
|
|
|
|
|
2019-11-28 21:11:55 +00:00
|
|
|
if (list_empty(&plug->mq_list))
|
|
|
|
return;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
list_splice_init(&plug->mq_list, &list);
|
|
|
|
|
2018-11-28 00:13:56 +00:00
|
|
|
if (plug->rq_count > 2 && plug->multiple_queues)
|
|
|
|
list_sort(NULL, &list, plug_rq_cmp);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2019-04-04 02:57:44 +00:00
|
|
|
plug->rq_count = 0;
|
|
|
|
|
2019-11-28 21:11:55 +00:00
|
|
|
do {
|
|
|
|
struct list_head rq_list;
|
|
|
|
struct request *rq, *head_rq = list_entry_rq(list.next);
|
|
|
|
struct list_head *pos = &head_rq->queuelist; /* skip first */
|
|
|
|
struct blk_mq_hw_ctx *this_hctx = head_rq->mq_hctx;
|
|
|
|
struct blk_mq_ctx *this_ctx = head_rq->mq_ctx;
|
|
|
|
unsigned int depth = 1;
|
|
|
|
|
|
|
|
list_for_each_continue(pos, &list) {
|
|
|
|
rq = list_entry_rq(pos);
|
|
|
|
BUG_ON(!rq->q);
|
|
|
|
if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx)
|
|
|
|
break;
|
|
|
|
depth++;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2019-11-28 21:11:55 +00:00
|
|
|
list_cut_before(&rq_list, &list, pos);
|
|
|
|
trace_block_unplug(head_rq->q, depth, !from_schedule);
|
2018-10-30 17:31:51 +00:00
|
|
|
blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list,
|
2017-01-17 13:03:22 +00:00
|
|
|
from_schedule);
|
2019-11-28 21:11:55 +00:00
|
|
|
} while(!list_empty(&list));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2019-06-06 10:29:01 +00:00
|
|
|
static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
|
|
|
|
unsigned int nr_segs)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2020-09-16 03:53:14 +00:00
|
|
|
int err;
|
|
|
|
|
2019-06-06 10:29:00 +00:00
|
|
|
if (bio->bi_opf & REQ_RAHEAD)
|
|
|
|
rq->cmd_flags |= REQ_FAILFAST_MASK;
|
|
|
|
|
|
|
|
rq->__sector = bio->bi_iter.bi_sector;
|
|
|
|
rq->write_hint = bio->bi_write_hint;
|
2019-06-06 10:29:01 +00:00
|
|
|
blk_rq_bio_prep(rq, bio, nr_segs);
|
2020-09-16 03:53:14 +00:00
|
|
|
|
|
|
|
/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */
|
|
|
|
err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO);
|
|
|
|
WARN_ON_ONCE(err);
|
2014-05-29 17:00:11 +00:00
|
|
|
|
2020-05-27 05:24:16 +00:00
|
|
|
blk_account_io_start(rq);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2018-01-17 16:25:56 +00:00
|
|
|
static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct request *rq,
|
2018-11-24 17:15:46 +00:00
|
|
|
blk_qc_t *cookie, bool last)
|
2015-05-08 17:51:32 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
|
|
|
struct blk_mq_queue_data bd = {
|
|
|
|
.rq = rq,
|
2018-11-24 17:15:46 +00:00
|
|
|
.last = last,
|
2015-05-08 17:51:32 +00:00
|
|
|
};
|
2017-01-17 13:03:22 +00:00
|
|
|
blk_qc_t new_cookie;
|
2017-06-12 17:22:46 +00:00
|
|
|
blk_status_t ret;
|
2018-01-17 16:25:56 +00:00
|
|
|
|
|
|
|
new_cookie = request_to_qc_t(hctx, rq);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For OK queue, we are done. For error, caller may kill it.
|
|
|
|
* Any other error (busy), just add it to our list as we
|
|
|
|
* previously would have done.
|
|
|
|
*/
|
|
|
|
ret = q->mq_ops->queue_rq(hctx, &bd);
|
|
|
|
switch (ret) {
|
|
|
|
case BLK_STS_OK:
|
2018-07-10 01:03:31 +00:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-17 16:25:56 +00:00
|
|
|
*cookie = new_cookie;
|
|
|
|
break;
|
|
|
|
case BLK_STS_RESOURCE:
|
2018-01-31 03:04:57 +00:00
|
|
|
case BLK_STS_DEV_RESOURCE:
|
2018-07-10 01:03:31 +00:00
|
|
|
blk_mq_update_dispatch_busy(hctx, true);
|
2018-01-17 16:25:56 +00:00
|
|
|
__blk_mq_requeue_request(rq);
|
|
|
|
break;
|
|
|
|
default:
|
2018-07-10 01:03:31 +00:00
|
|
|
blk_mq_update_dispatch_busy(hctx, false);
|
2018-01-17 16:25:56 +00:00
|
|
|
*cookie = BLK_QC_T_NONE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-04-04 17:08:43 +00:00
|
|
|
static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
|
2018-01-17 16:25:56 +00:00
|
|
|
struct request *rq,
|
2018-01-17 16:25:57 +00:00
|
|
|
blk_qc_t *cookie,
|
2019-04-04 17:08:43 +00:00
|
|
|
bool bypass_insert, bool last)
|
2018-01-17 16:25:56 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q = rq->q;
|
2017-06-06 15:22:00 +00:00
|
|
|
bool run_queue = true;
|
2021-01-22 02:33:12 +00:00
|
|
|
int budget_token;
|
2017-06-06 15:22:00 +00:00
|
|
|
|
2018-01-18 04:06:59 +00:00
|
|
|
/*
|
2019-04-04 17:08:43 +00:00
|
|
|
* RCU or SRCU read lock is needed before checking quiesced flag.
|
2018-01-18 04:06:59 +00:00
|
|
|
*
|
2019-04-04 17:08:43 +00:00
|
|
|
* When queue is stopped or quiesced, ignore 'bypass_insert' from
|
|
|
|
* blk_mq_request_issue_directly(), and return BLK_STS_OK to caller,
|
|
|
|
* and avoid driver to try to dispatch again.
|
2018-01-18 04:06:59 +00:00
|
|
|
*/
|
2019-04-04 17:08:43 +00:00
|
|
|
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
|
2017-06-06 15:22:00 +00:00
|
|
|
run_queue = false;
|
2019-04-04 17:08:43 +00:00
|
|
|
bypass_insert = false;
|
|
|
|
goto insert;
|
2017-06-06 15:22:00 +00:00
|
|
|
}
|
2015-05-08 17:51:32 +00:00
|
|
|
|
2019-04-04 17:08:43 +00:00
|
|
|
if (q->elevator && !bypass_insert)
|
|
|
|
goto insert;
|
2016-10-29 00:20:02 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
budget_token = blk_mq_get_dispatch_budget(q);
|
|
|
|
if (budget_token < 0)
|
2019-04-04 17:08:43 +00:00
|
|
|
goto insert;
|
2017-01-17 13:03:22 +00:00
|
|
|
|
2021-01-22 02:33:12 +00:00
|
|
|
blk_mq_set_rq_budget_token(rq, budget_token);
|
|
|
|
|
2018-06-25 11:31:45 +00:00
|
|
|
if (!blk_mq_get_driver_tag(rq)) {
|
2021-01-22 02:33:12 +00:00
|
|
|
blk_mq_put_dispatch_budget(q, budget_token);
|
2019-04-04 17:08:43 +00:00
|
|
|
goto insert;
|
2017-11-04 18:21:12 +00:00
|
|
|
}
|
2017-10-14 09:22:29 +00:00
|
|
|
|
2019-04-04 17:08:43 +00:00
|
|
|
return __blk_mq_issue_directly(hctx, rq, cookie, last);
|
|
|
|
insert:
|
|
|
|
if (bypass_insert)
|
|
|
|
return BLK_STS_RESOURCE;
|
|
|
|
|
2020-08-18 09:07:28 +00:00
|
|
|
blk_mq_sched_insert_request(rq, false, run_queue, false);
|
|
|
|
|
2019-04-04 17:08:43 +00:00
|
|
|
return BLK_STS_OK;
|
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
|
|
|
* blk_mq_try_issue_directly - Try to send a request directly to device driver.
|
|
|
|
* @hctx: Pointer of the associated hardware queue.
|
|
|
|
* @rq: Pointer to request to be sent.
|
|
|
|
* @cookie: Request queue cookie.
|
|
|
|
*
|
|
|
|
* If the device has enough resources to accept a new request now, send the
|
|
|
|
* request directly to device driver. Else, insert at hctx->dispatch queue, so
|
|
|
|
* we can try send it another time in the future. Requests inserted at this
|
|
|
|
* queue have higher priority.
|
|
|
|
*/
|
2019-04-04 17:08:43 +00:00
|
|
|
static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct request *rq, blk_qc_t *cookie)
|
|
|
|
{
|
|
|
|
blk_status_t ret;
|
|
|
|
int srcu_idx;
|
|
|
|
|
|
|
|
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
|
|
|
|
|
|
|
|
hctx_lock(hctx, &srcu_idx);
|
|
|
|
|
|
|
|
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
|
|
|
|
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
|
2020-02-25 01:04:32 +00:00
|
|
|
blk_mq_request_bypass_insert(rq, false, true);
|
2019-04-04 17:08:43 +00:00
|
|
|
else if (ret != BLK_STS_OK)
|
|
|
|
blk_mq_end_request(rq, ret);
|
|
|
|
|
|
|
|
hctx_unlock(hctx, srcu_idx);
|
|
|
|
}
|
|
|
|
|
|
|
|
blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
|
|
|
|
{
|
|
|
|
blk_status_t ret;
|
|
|
|
int srcu_idx;
|
|
|
|
blk_qc_t unused_cookie;
|
|
|
|
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
|
|
|
|
|
|
|
|
hctx_lock(hctx, &srcu_idx);
|
|
|
|
ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true, last);
|
2018-01-09 16:29:46 +00:00
|
|
|
hctx_unlock(hctx, srcu_idx);
|
2018-12-14 01:28:18 +00:00
|
|
|
|
|
|
|
return ret;
|
2017-03-22 19:01:51 +00:00
|
|
|
}
|
|
|
|
|
2018-07-10 01:03:31 +00:00
|
|
|
void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
|
|
|
|
struct list_head *list)
|
|
|
|
{
|
2020-04-06 18:13:48 +00:00
|
|
|
int queued = 0;
|
2020-09-05 11:25:56 +00:00
|
|
|
int errors = 0;
|
2020-04-06 18:13:48 +00:00
|
|
|
|
2018-07-10 01:03:31 +00:00
|
|
|
while (!list_empty(list)) {
|
2019-04-04 17:08:43 +00:00
|
|
|
blk_status_t ret;
|
2018-07-10 01:03:31 +00:00
|
|
|
struct request *rq = list_first_entry(list, struct request,
|
|
|
|
queuelist);
|
|
|
|
|
|
|
|
list_del_init(&rq->queuelist);
|
2019-04-04 17:08:43 +00:00
|
|
|
ret = blk_mq_request_issue_directly(rq, list_empty(list));
|
|
|
|
if (ret != BLK_STS_OK) {
|
|
|
|
if (ret == BLK_STS_RESOURCE ||
|
|
|
|
ret == BLK_STS_DEV_RESOURCE) {
|
2020-02-25 01:04:32 +00:00
|
|
|
blk_mq_request_bypass_insert(rq, false,
|
blk-mq: punt failed direct issue to dispatch list
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.
Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.
This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.
Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 05:17:44 +00:00
|
|
|
list_empty(list));
|
2019-04-04 17:08:43 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
blk_mq_end_request(rq, ret);
|
2020-09-05 11:25:56 +00:00
|
|
|
errors++;
|
2020-04-06 18:13:48 +00:00
|
|
|
} else
|
|
|
|
queued++;
|
2018-07-10 01:03:31 +00:00
|
|
|
}
|
2018-11-28 00:02:25 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we didn't flush the entire list, we could have told
|
|
|
|
* the driver there was more coming, but that turned out to
|
|
|
|
* be a lie.
|
|
|
|
*/
|
2020-09-05 11:25:56 +00:00
|
|
|
if ((!list_empty(list) || errors) &&
|
|
|
|
hctx->queue->mq_ops->commit_rqs && queued)
|
2018-11-28 00:02:25 +00:00
|
|
|
hctx->queue->mq_ops->commit_rqs(hctx);
|
2018-07-10 01:03:31 +00:00
|
|
|
}
|
|
|
|
|
2018-11-28 00:13:56 +00:00
|
|
|
static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
|
|
|
|
{
|
|
|
|
list_add_tail(&rq->queuelist, &plug->mq_list);
|
|
|
|
plug->rq_count++;
|
|
|
|
if (!plug->multiple_queues && !list_is_singular(&plug->mq_list)) {
|
|
|
|
struct request *tmp;
|
|
|
|
|
|
|
|
tmp = list_first_entry(&plug->mq_list, struct request,
|
|
|
|
queuelist);
|
|
|
|
if (tmp->q != rq->q)
|
|
|
|
plug->multiple_queues = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-01-06 18:08:18 +00:00
|
|
|
/**
|
2020-07-01 08:59:43 +00:00
|
|
|
* blk_mq_submit_bio - Create and send a request to block device.
|
2020-01-06 18:08:18 +00:00
|
|
|
* @bio: Bio pointer.
|
|
|
|
*
|
|
|
|
* Builds up a request structure from @q and @bio and send to the device. The
|
|
|
|
* request may not be queued directly to hardware if:
|
|
|
|
* * This request can be merged with another one
|
|
|
|
* * We want to place request at plug queue for possible future merging
|
|
|
|
* * There is an IO scheduler active at this queue
|
|
|
|
*
|
|
|
|
* It will not queue the request if there is an error with the bio, or at the
|
|
|
|
* request creation.
|
|
|
|
*
|
|
|
|
* Returns: Request queue cookie.
|
|
|
|
*/
|
2020-07-01 08:59:43 +00:00
|
|
|
blk_qc_t blk_mq_submit_bio(struct bio *bio)
|
2014-05-22 16:40:51 +00:00
|
|
|
{
|
2021-01-24 10:02:34 +00:00
|
|
|
struct request_queue *q = bio->bi_bdev->bd_disk->queue;
|
2016-10-28 14:48:16 +00:00
|
|
|
const int is_sync = op_is_sync(bio->bi_opf);
|
2017-01-27 15:30:47 +00:00
|
|
|
const int is_flush_fua = op_is_flush(bio->bi_opf);
|
2020-05-29 13:53:09 +00:00
|
|
|
struct blk_mq_alloc_data data = {
|
|
|
|
.q = q,
|
|
|
|
};
|
2014-05-22 16:40:51 +00:00
|
|
|
struct request *rq;
|
2015-05-08 17:51:32 +00:00
|
|
|
struct blk_plug *plug;
|
2015-05-08 17:51:33 +00:00
|
|
|
struct request *same_queue_rq = NULL;
|
2019-06-06 10:29:01 +00:00
|
|
|
unsigned int nr_segs;
|
2015-11-05 17:41:40 +00:00
|
|
|
blk_qc_t cookie;
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
blk_status_t ret;
|
block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.
Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.
iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.
iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.
To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 09:18:52 +00:00
|
|
|
bool hipri;
|
2014-05-22 16:40:51 +00:00
|
|
|
|
|
|
|
blk_queue_bounce(q, &bio);
|
2020-07-01 08:59:39 +00:00
|
|
|
__blk_queue_split(&bio, &nr_segs);
|
blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.
Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error
We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.
1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
With the fix, It showed us both of the first op and the second op have
correct bi and pi address.
1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 13:54:11 +00:00
|
|
|
|
2017-06-29 18:31:11 +00:00
|
|
|
if (!bio_integrity_prep(bio))
|
2020-05-16 18:28:01 +00:00
|
|
|
goto queue_exit;
|
2014-05-22 16:40:51 +00:00
|
|
|
|
2016-06-02 05:18:48 +00:00
|
|
|
if (!is_flush_fua && !blk_queue_nomerges(q) &&
|
2019-06-06 10:29:01 +00:00
|
|
|
blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq))
|
2020-05-16 18:28:01 +00:00
|
|
|
goto queue_exit;
|
2015-05-08 17:51:32 +00:00
|
|
|
|
2019-06-06 10:29:01 +00:00
|
|
|
if (blk_mq_sched_bio_merge(q, bio, nr_segs))
|
2020-05-16 18:28:01 +00:00
|
|
|
goto queue_exit;
|
2017-01-17 13:03:22 +00:00
|
|
|
|
2018-11-14 16:02:09 +00:00
|
|
|
rq_qos_throttle(q, bio);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 19:38:14 +00:00
|
|
|
|
block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.
Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.
iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.
iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.
To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 09:18:52 +00:00
|
|
|
hipri = bio->bi_opf & REQ_HIPRI;
|
|
|
|
|
2019-01-16 11:08:15 +00:00
|
|
|
data.cmd_flags = bio->bi_opf;
|
2020-05-29 13:53:09 +00:00
|
|
|
rq = __blk_mq_alloc_request(&data);
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 19:38:14 +00:00
|
|
|
if (unlikely(!rq)) {
|
2018-07-03 15:14:59 +00:00
|
|
|
rq_qos_cleanup(q, bio);
|
2019-08-15 17:09:16 +00:00
|
|
|
if (bio->bi_opf & REQ_NOWAIT)
|
2017-06-20 12:05:46 +00:00
|
|
|
bio_wouldblock_error(bio);
|
2020-05-16 18:28:01 +00:00
|
|
|
goto queue_exit;
|
block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-09 19:38:14 +00:00
|
|
|
}
|
|
|
|
|
2020-12-03 16:21:36 +00:00
|
|
|
trace_block_getrq(bio);
|
2018-10-23 14:30:50 +00:00
|
|
|
|
2018-07-03 15:14:59 +00:00
|
|
|
rq_qos_track(q, rq, bio);
|
2014-05-22 16:40:51 +00:00
|
|
|
|
2017-01-12 17:04:45 +00:00
|
|
|
cookie = request_to_qc_t(data.hctx, rq);
|
2014-05-22 16:40:51 +00:00
|
|
|
|
2019-07-01 15:47:30 +00:00
|
|
|
blk_mq_bio_to_request(rq, bio, nr_segs);
|
|
|
|
|
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
|
|
|
ret = blk_crypto_init_request(rq);
|
|
|
|
if (ret != BLK_STS_OK) {
|
|
|
|
bio->bi_status = ret;
|
|
|
|
bio_endio(bio);
|
|
|
|
blk_mq_free_request(rq);
|
|
|
|
return BLK_QC_T_NONE;
|
|
|
|
}
|
|
|
|
|
block: Disable write plugging for zoned block devices
Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.
Context A Context B
| blk_start_plug()
| ...
| seq_write_zone()
| mutex_lock(zone)
| bio-0->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-0)
| submit_bio(bio-0)
| bio-1->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-1)
| submit_bio(bio-1)
| mutex_unlock(zone)
| return
| -----------------------> | seq_write_zone()
| mutex_lock(zone)
| bio-2->bi_iter.bi_sector = zone->wp
| zone->wp += bio_sectors(bio-2)
| submit_bio(bio-2)
| mutex_unlock(zone)
| <------------------------- |
| blk_finish_plug()
In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().
While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-10 16:18:31 +00:00
|
|
|
plug = blk_mq_plug(q, bio);
|
2014-05-22 16:40:51 +00:00
|
|
|
if (unlikely(is_flush_fua)) {
|
2020-01-06 18:08:18 +00:00
|
|
|
/* Bypass scheduler for flush requests */
|
2017-11-02 15:24:38 +00:00
|
|
|
blk_insert_flush(rq);
|
|
|
|
blk_mq_run_hw_queue(data.hctx, true);
|
2021-05-14 02:20:52 +00:00
|
|
|
} else if (plug && (q->nr_hw_queues == 1 ||
|
|
|
|
blk_mq_is_sbitmap_shared(rq->mq_hctx->flags) ||
|
|
|
|
q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) {
|
2018-11-29 17:03:42 +00:00
|
|
|
/*
|
|
|
|
* Use plugging if we have a ->commit_rqs() hook as well, as
|
|
|
|
* we know the driver uses bd->last in a smart fashion.
|
2019-09-27 07:24:31 +00:00
|
|
|
*
|
|
|
|
* Use normal plugging if this disk is slow HDD, as sequential
|
|
|
|
* IO may benefit a lot from plug merging.
|
2018-11-29 17:03:42 +00:00
|
|
|
*/
|
2018-11-24 05:04:33 +00:00
|
|
|
unsigned int request_count = plug->rq_count;
|
2016-11-04 00:03:54 +00:00
|
|
|
struct request *last = NULL;
|
|
|
|
|
2015-10-20 15:13:56 +00:00
|
|
|
if (!request_count)
|
2015-05-08 17:51:30 +00:00
|
|
|
trace_block_plug(q);
|
2016-11-04 00:03:54 +00:00
|
|
|
else
|
|
|
|
last = list_entry_rq(plug->mq_list.prev);
|
blk-mq: fix calling unplug callbacks with preempt disabled
Liu reported that running certain parts of xfstests threw the
following error:
BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[<ffffffff8130191b>] dump_stack+0x4f/0x74
[<ffffffff8108ed95>] ___might_sleep+0x185/0x240
[<ffffffff8108eea2>] __might_sleep+0x52/0x90
[<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
[<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
[<ffffffff8109a6d1>] ? local_clock+0x21/0x40
[<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
[<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
[<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
[<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
[<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
[<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
[<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
[<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
[<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
[<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
[<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
[<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
[<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
[<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
[<ffffffff812d0ca0>] submit_bio+0x70/0x140
[<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
[<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
[<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
[<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
[<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
[<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
[<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
[<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
[<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
[<ffffffff81184df3>] do_writepages+0x23/0x40
[<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
[<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
[<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
[<ffffffff811e6805>] ? trylock_super+0x25/0x60
[<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
[<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812130b5>] wb_writeback+0x2b5/0x500
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
[<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
[<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
[<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
[<ffffffff81213842>] wb_workfn+0x92/0x330
[<ffffffff8107f133>] process_one_work+0x223/0x730
[<ffffffff8107f083>] ? process_one_work+0x173/0x730
[<ffffffff8108035f>] ? worker_thread+0x18f/0x430
[<ffffffff810802ed>] worker_thread+0x11d/0x430
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810858df>] kthread+0xef/0x110
[<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816673bf>] ret_from_fork+0x3f/0x70
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.
Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.
This only affects blk-mq driven devices, and only those that
register a single queue.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-21 03:29:45 +00:00
|
|
|
|
2016-11-04 00:03:54 +00:00
|
|
|
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
|
|
|
|
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
|
2015-05-08 17:51:30 +00:00
|
|
|
blk_flush_plug_list(plug, false);
|
|
|
|
trace_block_plug(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
blk-mq: fix calling unplug callbacks with preempt disabled
Liu reported that running certain parts of xfstests threw the
following error:
BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[<ffffffff8130191b>] dump_stack+0x4f/0x74
[<ffffffff8108ed95>] ___might_sleep+0x185/0x240
[<ffffffff8108eea2>] __might_sleep+0x52/0x90
[<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
[<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
[<ffffffff8109a6d1>] ? local_clock+0x21/0x40
[<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
[<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
[<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
[<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
[<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
[<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
[<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
[<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
[<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
[<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
[<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
[<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
[<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
[<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
[<ffffffff812d0ca0>] submit_bio+0x70/0x140
[<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
[<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
[<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
[<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
[<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
[<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
[<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
[<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
[<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
[<ffffffff81184df3>] do_writepages+0x23/0x40
[<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
[<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
[<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
[<ffffffff811e6805>] ? trylock_super+0x25/0x60
[<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
[<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812130b5>] wb_writeback+0x2b5/0x500
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
[<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
[<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
[<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
[<ffffffff81213842>] wb_workfn+0x92/0x330
[<ffffffff8107f133>] process_one_work+0x223/0x730
[<ffffffff8107f083>] ? process_one_work+0x173/0x730
[<ffffffff8108035f>] ? worker_thread+0x18f/0x430
[<ffffffff810802ed>] worker_thread+0x11d/0x430
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810858df>] kthread+0xef/0x110
[<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816673bf>] ret_from_fork+0x3f/0x70
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.
Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.
This only affects blk-mq driven devices, and only those that
register a single queue.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-21 03:29:45 +00:00
|
|
|
|
2018-11-28 00:13:56 +00:00
|
|
|
blk_add_rq_to_plug(plug, rq);
|
2019-09-27 07:24:30 +00:00
|
|
|
} else if (q->elevator) {
|
2020-01-06 18:08:18 +00:00
|
|
|
/* Insert the request at the IO scheduler queue */
|
2019-09-27 07:24:30 +00:00
|
|
|
blk_mq_sched_insert_request(rq, false, true, true);
|
2017-03-22 19:01:52 +00:00
|
|
|
} else if (plug && !blk_queue_nomerges(q)) {
|
2014-05-22 16:40:51 +00:00
|
|
|
/*
|
2016-11-02 16:09:51 +00:00
|
|
|
* We do limited plugging. If the bio can be merged, do that.
|
2015-05-08 17:51:32 +00:00
|
|
|
* Otherwise the existing request in the plug list will be
|
|
|
|
* issued. So the plug list will have one request at most
|
2017-03-22 19:01:52 +00:00
|
|
|
* The plug list might get flushed before this. If that happens,
|
|
|
|
* the plug list is empty, and same_queue_rq is invalid.
|
2014-05-22 16:40:51 +00:00
|
|
|
*/
|
2017-03-22 19:01:52 +00:00
|
|
|
if (list_empty(&plug->mq_list))
|
|
|
|
same_queue_rq = NULL;
|
2018-11-28 00:07:17 +00:00
|
|
|
if (same_queue_rq) {
|
2017-03-22 19:01:52 +00:00
|
|
|
list_del_init(&same_queue_rq->queuelist);
|
2018-11-28 00:07:17 +00:00
|
|
|
plug->rq_count--;
|
|
|
|
}
|
2018-11-28 00:13:56 +00:00
|
|
|
blk_add_rq_to_plug(plug, rq);
|
2019-03-26 13:19:25 +00:00
|
|
|
trace_block_plug(q);
|
2017-03-22 19:01:52 +00:00
|
|
|
|
2017-06-06 15:21:59 +00:00
|
|
|
if (same_queue_rq) {
|
2018-10-29 21:06:13 +00:00
|
|
|
data.hctx = same_queue_rq->mq_hctx;
|
2019-03-26 13:19:25 +00:00
|
|
|
trace_block_unplug(q, 1, true);
|
2017-03-22 19:01:52 +00:00
|
|
|
blk_mq_try_issue_directly(data.hctx, same_queue_rq,
|
2019-04-04 17:08:43 +00:00
|
|
|
&cookie);
|
2017-06-06 15:21:59 +00:00
|
|
|
}
|
2019-09-27 07:24:30 +00:00
|
|
|
} else if ((q->nr_hw_queues > 1 && is_sync) ||
|
|
|
|
!data.hctx->dispatch_busy) {
|
2020-01-06 18:08:18 +00:00
|
|
|
/*
|
|
|
|
* There is no scheduler and we can try to send directly
|
|
|
|
* to the hardware.
|
|
|
|
*/
|
2019-04-04 17:08:43 +00:00
|
|
|
blk_mq_try_issue_directly(data.hctx, rq, &cookie);
|
2017-05-26 11:53:19 +00:00
|
|
|
} else {
|
2020-01-06 18:08:18 +00:00
|
|
|
/* Default case. */
|
2018-05-16 14:21:21 +00:00
|
|
|
blk_mq_sched_insert_request(rq, false, true, true);
|
2017-05-26 11:53:19 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.
Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.
iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.
iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.
To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 09:18:52 +00:00
|
|
|
if (!hipri)
|
|
|
|
return BLK_QC_T_NONE;
|
2015-11-05 17:41:40 +00:00
|
|
|
return cookie;
|
2020-05-16 18:28:01 +00:00
|
|
|
queue_exit:
|
|
|
|
blk_queue_exit(q);
|
|
|
|
return BLK_QC_T_NONE;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2021-05-11 15:22:35 +00:00
|
|
|
static size_t order_to_size(unsigned int order)
|
|
|
|
{
|
|
|
|
return (size_t)PAGE_SIZE << order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* called before freeing request pool in @tags */
|
|
|
|
static void blk_mq_clear_rq_mapping(struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_tags *tags, unsigned int hctx_idx)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags *drv_tags = set->tags[hctx_idx];
|
|
|
|
struct page *page;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
list_for_each_entry(page, &tags->page_list, lru) {
|
|
|
|
unsigned long start = (unsigned long)page_address(page);
|
|
|
|
unsigned long end = start + order_to_size(page->private);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < set->queue_depth; i++) {
|
|
|
|
struct request *rq = drv_tags->rqs[i];
|
|
|
|
unsigned long rq_addr = (unsigned long)rq;
|
|
|
|
|
|
|
|
if (rq_addr >= start && rq_addr < end) {
|
|
|
|
WARN_ON_ONCE(refcount_read(&rq->ref) != 0);
|
|
|
|
cmpxchg(&drv_tags->rqs[i], rq, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&drv_tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&drv_tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2017-01-11 21:29:56 +00:00
|
|
|
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx)
|
2014-03-14 16:43:15 +00:00
|
|
|
{
|
2014-04-15 19:59:10 +00:00
|
|
|
struct page *page;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
if (tags->rqs && set->ops->exit_request) {
|
2014-04-15 19:59:10 +00:00
|
|
|
int i;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
for (i = 0; i < tags->nr_tags; i++) {
|
2017-01-13 21:39:30 +00:00
|
|
|
struct request *rq = tags->static_rqs[i];
|
|
|
|
|
|
|
|
if (!rq)
|
2014-04-15 19:59:10 +00:00
|
|
|
continue;
|
2017-05-01 16:19:08 +00:00
|
|
|
set->ops->exit_request(set, rq, hctx_idx);
|
2017-01-13 21:39:30 +00:00
|
|
|
tags->static_rqs[i] = NULL;
|
2014-04-15 19:59:10 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2021-05-11 15:22:35 +00:00
|
|
|
blk_mq_clear_rq_mapping(set, tags, hctx_idx);
|
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
while (!list_empty(&tags->page_list)) {
|
|
|
|
page = list_first_entry(&tags->page_list, struct page, lru);
|
2014-01-09 03:17:46 +00:00
|
|
|
list_del_init(&page->lru);
|
2015-09-14 17:16:02 +00:00
|
|
|
/*
|
|
|
|
* Remove kmemleak object previously allocated in
|
2019-05-02 19:48:11 +00:00
|
|
|
* blk_mq_alloc_rqs().
|
2015-09-14 17:16:02 +00:00
|
|
|
*/
|
|
|
|
kmemleak_free(page_address(page));
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
__free_pages(page, page->private);
|
|
|
|
}
|
2017-01-11 21:29:56 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2020-08-19 15:20:22 +00:00
|
|
|
void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags)
|
2017-01-11 21:29:56 +00:00
|
|
|
{
|
2014-04-15 20:14:00 +00:00
|
|
|
kfree(tags->rqs);
|
2017-01-11 21:29:56 +00:00
|
|
|
tags->rqs = NULL;
|
2017-01-13 21:39:30 +00:00
|
|
|
kfree(tags->static_rqs);
|
|
|
|
tags->static_rqs = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2020-08-19 15:20:22 +00:00
|
|
|
blk_mq_free_tags(tags, flags);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2017-01-11 21:29:56 +00:00
|
|
|
struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx,
|
|
|
|
unsigned int nr_tags,
|
2020-08-19 15:20:22 +00:00
|
|
|
unsigned int reserved_tags,
|
|
|
|
unsigned int flags)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2014-04-15 20:14:00 +00:00
|
|
|
struct blk_mq_tags *tags;
|
2017-02-01 17:53:14 +00:00
|
|
|
int node;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2019-02-27 13:35:01 +00:00
|
|
|
node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], hctx_idx);
|
2017-02-01 17:53:14 +00:00
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
|
|
|
|
2020-08-19 15:20:22 +00:00
|
|
|
tags = blk_mq_init_tags(nr_tags, reserved_tags, node, flags);
|
2014-04-15 20:14:00 +00:00
|
|
|
if (!tags)
|
|
|
|
return NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 21:04:20 +00:00
|
|
|
tags->rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
2016-12-06 15:31:44 +00:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
2017-02-01 17:53:14 +00:00
|
|
|
node);
|
2014-04-15 20:14:00 +00:00
|
|
|
if (!tags->rqs) {
|
2020-08-19 15:20:22 +00:00
|
|
|
blk_mq_free_tags(tags, flags);
|
2014-04-15 20:14:00 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
treewide: kzalloc_node() -> kcalloc_node()
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:
kzalloc_node(a * b, gfp, node)
with:
kcalloc_node(a * b, gfp, node)
as well as handling cases of:
kzalloc_node(a * b * c, gfp, node)
with:
kzalloc_node(array3_size(a, b, c), gfp, node)
as it's slightly less ugly than:
kcalloc_node(array_size(a, b), c, gfp, node)
This does, however, attempt to ignore constant size factors like:
kzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 21:04:20 +00:00
|
|
|
tags->static_rqs = kcalloc_node(nr_tags, sizeof(struct request *),
|
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
|
|
|
|
node);
|
2017-01-13 21:39:30 +00:00
|
|
|
if (!tags->static_rqs) {
|
|
|
|
kfree(tags->rqs);
|
2020-08-19 15:20:22 +00:00
|
|
|
blk_mq_free_tags(tags, flags);
|
2017-01-13 21:39:30 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-01-11 21:29:56 +00:00
|
|
|
return tags;
|
|
|
|
}
|
|
|
|
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
|
|
|
|
unsigned int hctx_idx, int node)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (set->ops->init_request) {
|
|
|
|
ret = set->ops->init_request(set, rq, hctx_idx, node);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-29 13:52:28 +00:00
|
|
|
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-01-11 21:29:56 +00:00
|
|
|
int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
|
|
|
|
unsigned int hctx_idx, unsigned int depth)
|
|
|
|
{
|
|
|
|
unsigned int i, j, entries_per_page, max_order = 4;
|
|
|
|
size_t rq_size, left;
|
2017-02-01 17:53:14 +00:00
|
|
|
int node;
|
|
|
|
|
2019-02-27 13:35:01 +00:00
|
|
|
node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], hctx_idx);
|
2017-02-01 17:53:14 +00:00
|
|
|
if (node == NUMA_NO_NODE)
|
|
|
|
node = set->numa_node;
|
2017-01-11 21:29:56 +00:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&tags->page_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
|
|
|
* rq_size is the size of the request plus driver payload, rounded
|
|
|
|
* to the cacheline size
|
|
|
|
*/
|
2014-04-15 20:14:00 +00:00
|
|
|
rq_size = round_up(sizeof(struct request) + set->cmd_size,
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
cache_line_size());
|
2017-01-11 21:29:56 +00:00
|
|
|
left = rq_size * depth;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2017-01-11 21:29:56 +00:00
|
|
|
for (i = 0; i < depth; ) {
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
int this_order = max_order;
|
|
|
|
struct page *page;
|
|
|
|
int to_do;
|
|
|
|
void *p;
|
|
|
|
|
2016-05-16 15:54:47 +00:00
|
|
|
while (this_order && left < order_to_size(this_order - 1))
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
this_order--;
|
|
|
|
|
|
|
|
do {
|
2017-02-01 17:53:14 +00:00
|
|
|
page = alloc_pages_node(node,
|
2016-12-06 15:31:44 +00:00
|
|
|
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
|
2014-09-10 15:02:03 +00:00
|
|
|
this_order);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
if (page)
|
|
|
|
break;
|
|
|
|
if (!this_order--)
|
|
|
|
break;
|
|
|
|
if (order_to_size(this_order) < rq_size)
|
|
|
|
break;
|
|
|
|
} while (1);
|
|
|
|
|
|
|
|
if (!page)
|
2014-04-15 20:14:00 +00:00
|
|
|
goto fail;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
page->private = this_order;
|
2014-04-15 20:14:00 +00:00
|
|
|
list_add_tail(&page->lru, &tags->page_list);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
p = page_address(page);
|
2015-09-14 17:16:02 +00:00
|
|
|
/*
|
|
|
|
* Allow kmemleak to scan these pages as they contain pointers
|
|
|
|
* to additional allocations like via ops->init_request().
|
|
|
|
*/
|
2016-12-06 15:31:44 +00:00
|
|
|
kmemleak_alloc(p, order_to_size(this_order), 1, GFP_NOIO);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
entries_per_page = order_to_size(this_order) / rq_size;
|
2017-01-11 21:29:56 +00:00
|
|
|
to_do = min(entries_per_page, depth - i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
left -= to_do * rq_size;
|
|
|
|
for (j = 0; j < to_do; j++) {
|
2017-01-13 21:39:30 +00:00
|
|
|
struct request *rq = p;
|
|
|
|
|
|
|
|
tags->static_rqs[i] = rq;
|
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 16:29:48 +00:00
|
|
|
if (blk_mq_init_request(set, rq, hctx_idx, node)) {
|
|
|
|
tags->static_rqs[i] = NULL;
|
|
|
|
goto fail;
|
2014-04-15 19:59:10 +00:00
|
|
|
}
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
p += rq_size;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
2017-01-11 21:29:56 +00:00
|
|
|
return 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
fail:
|
2017-01-11 21:29:56 +00:00
|
|
|
blk_mq_free_rqs(set, tags, hctx_idx);
|
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2020-05-29 13:53:15 +00:00
|
|
|
struct rq_iter_data {
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
bool has_rq;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
|
|
|
|
{
|
|
|
|
struct rq_iter_data *iter_data = data;
|
|
|
|
|
|
|
|
if (rq->mq_hctx != iter_data->hctx)
|
|
|
|
return true;
|
|
|
|
iter_data->has_rq = true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags *tags = hctx->sched_tags ?
|
|
|
|
hctx->sched_tags : hctx->tags;
|
|
|
|
struct rq_iter_data data = {
|
|
|
|
.hctx = hctx,
|
|
|
|
};
|
|
|
|
|
|
|
|
blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
|
|
|
|
return data.has_rq;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
|
|
|
|
struct blk_mq_hw_ctx *hctx)
|
|
|
|
{
|
|
|
|
if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
|
|
|
|
return false;
|
|
|
|
if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
|
|
|
if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
|
|
|
|
!blk_mq_last_cpu_in_hctx(cpu, hctx))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent new request from being allocated on the current hctx.
|
|
|
|
*
|
|
|
|
* The smp_mb__after_atomic() Pairs with the implied barrier in
|
|
|
|
* test_and_set_bit_lock in sbitmap_get(). Ensures the inactive flag is
|
|
|
|
* seen once we return from the tag allocator.
|
|
|
|
*/
|
|
|
|
set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to grab a reference to the queue and wait for any outstanding
|
|
|
|
* requests. If we could not grab a reference the queue has been
|
|
|
|
* frozen and there are no requests.
|
|
|
|
*/
|
|
|
|
if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
|
|
|
|
while (blk_mq_hctx_has_requests(hctx))
|
|
|
|
msleep(5);
|
|
|
|
percpu_ref_put(&hctx->queue->q_usage_counter);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
|
|
|
|
struct blk_mq_hw_ctx, cpuhp_online);
|
|
|
|
|
|
|
|
if (cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-08-24 21:34:35 +00:00
|
|
|
/*
|
|
|
|
* 'cpu' is going away. splice any existing rq_list entries from this
|
|
|
|
* software queue to the hw queue dispatch list, and ensure that it
|
|
|
|
* gets run.
|
|
|
|
*/
|
2016-09-22 14:05:17 +00:00
|
|
|
static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
|
2014-05-21 20:01:15 +00:00
|
|
|
{
|
2016-09-22 14:05:17 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2014-05-21 20:01:15 +00:00
|
|
|
struct blk_mq_ctx *ctx;
|
|
|
|
LIST_HEAD(tmp);
|
2018-12-17 15:44:05 +00:00
|
|
|
enum hctx_type type;
|
2014-05-21 20:01:15 +00:00
|
|
|
|
2016-09-22 14:05:17 +00:00
|
|
|
hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
|
2020-05-29 13:53:15 +00:00
|
|
|
if (!cpumask_test_cpu(cpu, hctx->cpumask))
|
|
|
|
return 0;
|
|
|
|
|
2016-08-24 21:34:35 +00:00
|
|
|
ctx = __blk_mq_get_ctx(hctx->queue, cpu);
|
2018-12-17 15:44:05 +00:00
|
|
|
type = hctx->type;
|
2014-05-21 20:01:15 +00:00
|
|
|
|
|
|
|
spin_lock(&ctx->lock);
|
2018-12-17 15:44:05 +00:00
|
|
|
if (!list_empty(&ctx->rq_lists[type])) {
|
|
|
|
list_splice_init(&ctx->rq_lists[type], &tmp);
|
2014-05-21 20:01:15 +00:00
|
|
|
blk_mq_hctx_clear_pending(hctx, ctx);
|
|
|
|
}
|
|
|
|
spin_unlock(&ctx->lock);
|
|
|
|
|
|
|
|
if (list_empty(&tmp))
|
2016-09-22 14:05:17 +00:00
|
|
|
return 0;
|
2014-05-21 20:01:15 +00:00
|
|
|
|
2016-08-24 21:34:35 +00:00
|
|
|
spin_lock(&hctx->lock);
|
|
|
|
list_splice_tail_init(&tmp, &hctx->dispatch);
|
|
|
|
spin_unlock(&hctx->lock);
|
2014-05-21 20:01:15 +00:00
|
|
|
|
|
|
|
blk_mq_run_hw_queue(hctx, true);
|
2016-09-22 14:05:17 +00:00
|
|
|
return 0;
|
2014-05-21 20:01:15 +00:00
|
|
|
}
|
|
|
|
|
2016-09-22 14:05:17 +00:00
|
|
|
static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
|
2014-05-21 20:01:15 +00:00
|
|
|
{
|
2020-05-29 13:53:15 +00:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2016-09-22 14:05:17 +00:00
|
|
|
cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
|
|
|
|
&hctx->cpuhp_dead);
|
2014-05-21 20:01:15 +00:00
|
|
|
}
|
|
|
|
|
2021-05-11 15:22:36 +00:00
|
|
|
/*
|
|
|
|
* Before freeing hw queue, clearing the flush request reference in
|
|
|
|
* tags->rqs[] for avoiding potential UAF.
|
|
|
|
*/
|
|
|
|
static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
|
|
|
|
unsigned int queue_depth, struct request *flush_rq)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/* The hw queue may not be mapped yet */
|
|
|
|
if (!tags)
|
|
|
|
return;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(refcount_read(&flush_rq->ref) != 0);
|
|
|
|
|
|
|
|
for (i = 0; i < queue_depth; i++)
|
|
|
|
cmpxchg(&tags->rqs[i], flush_rq, NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait until all pending iteration is done.
|
|
|
|
*
|
|
|
|
* Request reference is cleared and it is guaranteed to be observed
|
|
|
|
* after the ->lock is released.
|
|
|
|
*/
|
|
|
|
spin_lock_irqsave(&tags->lock, flags);
|
|
|
|
spin_unlock_irqrestore(&tags->lock, flags);
|
|
|
|
}
|
|
|
|
|
2015-06-04 14:25:04 +00:00
|
|
|
/* hctx->ctxs will be freed in queue's release handler */
|
2014-09-25 15:23:38 +00:00
|
|
|
static void blk_mq_exit_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
|
|
|
|
{
|
2021-05-11 15:22:36 +00:00
|
|
|
struct request *flush_rq = hctx->fq->flush_rq;
|
|
|
|
|
2018-01-09 13:28:29 +00:00
|
|
|
if (blk_mq_hw_queue_mapped(hctx))
|
|
|
|
blk_mq_tag_idle(hctx);
|
2014-09-25 15:23:38 +00:00
|
|
|
|
2021-05-11 15:22:36 +00:00
|
|
|
blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
|
|
|
|
set->queue_depth, flush_rq);
|
2014-09-25 15:23:47 +00:00
|
|
|
if (set->ops->exit_request)
|
2021-05-11 15:22:36 +00:00
|
|
|
set->ops->exit_request(set, flush_rq, hctx_idx);
|
2014-09-25 15:23:47 +00:00
|
|
|
|
2014-09-25 15:23:38 +00:00
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
|
2016-09-22 14:05:17 +00:00
|
|
|
blk_mq_remove_cpuhp(hctx);
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
|
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_add(&hctx->hctx_list, &q->unused_hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
2014-09-25 15:23:38 +00:00
|
|
|
}
|
|
|
|
|
2014-05-27 15:35:13 +00:00
|
|
|
static void blk_mq_exit_hw_queues(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set, int nr_queue)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
|
|
|
if (i == nr_queue)
|
|
|
|
break;
|
2018-10-12 10:07:25 +00:00
|
|
|
blk_mq_debugfs_unregister_hctx(hctx);
|
2014-09-25 15:23:38 +00:00
|
|
|
blk_mq_exit_hctx(q, set, hctx, i);
|
2014-05-27 15:35:13 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-04-30 01:52:26 +00:00
|
|
|
static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
|
|
|
|
{
|
|
|
|
int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
|
|
|
|
|
|
|
|
BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
|
|
|
|
__alignof__(struct blk_mq_hw_ctx)) !=
|
|
|
|
sizeof(struct blk_mq_hw_ctx));
|
|
|
|
|
|
|
|
if (tag_set->flags & BLK_MQ_F_BLOCKING)
|
|
|
|
hw_ctx_size += sizeof(struct srcu_struct);
|
|
|
|
|
|
|
|
return hw_ctx_size;
|
|
|
|
}
|
|
|
|
|
2014-09-25 15:23:38 +00:00
|
|
|
static int blk_mq_init_hctx(struct request_queue *q,
|
|
|
|
struct blk_mq_tag_set *set,
|
|
|
|
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2019-04-30 01:52:26 +00:00
|
|
|
hctx->queue_num = hctx_idx;
|
|
|
|
|
2020-05-29 13:53:15 +00:00
|
|
|
if (!(hctx->flags & BLK_MQ_F_STACKING))
|
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
|
|
|
|
&hctx->cpuhp_online);
|
2019-04-30 01:52:26 +00:00
|
|
|
cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
|
|
|
|
|
|
|
|
hctx->tags = set->tags[hctx_idx];
|
|
|
|
|
|
|
|
if (set->ops->init_hctx &&
|
|
|
|
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
|
|
|
|
goto unregister_cpu_notifier;
|
2014-09-25 15:23:38 +00:00
|
|
|
|
2019-04-30 01:52:26 +00:00
|
|
|
if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
|
|
|
|
hctx->numa_node))
|
|
|
|
goto exit_hctx;
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
exit_hctx:
|
|
|
|
if (set->ops->exit_hctx)
|
|
|
|
set->ops->exit_hctx(hctx, hctx_idx);
|
|
|
|
unregister_cpu_notifier:
|
|
|
|
blk_mq_remove_cpuhp(hctx);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct blk_mq_hw_ctx *
|
|
|
|
blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
|
|
|
|
int node)
|
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
|
|
|
|
|
|
|
|
hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node);
|
|
|
|
if (!hctx)
|
|
|
|
goto fail_alloc_hctx;
|
|
|
|
|
|
|
|
if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
|
|
|
|
goto free_hctx;
|
|
|
|
|
|
|
|
atomic_set(&hctx->nr_active, 0);
|
2014-09-25 15:23:38 +00:00
|
|
|
if (node == NUMA_NO_NODE)
|
2019-04-30 01:52:26 +00:00
|
|
|
node = set->numa_node;
|
|
|
|
hctx->numa_node = node;
|
2014-09-25 15:23:38 +00:00
|
|
|
|
2017-04-10 15:54:54 +00:00
|
|
|
INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
|
2014-09-25 15:23:38 +00:00
|
|
|
spin_lock_init(&hctx->lock);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch);
|
|
|
|
hctx->queue = q;
|
2020-08-19 15:20:19 +00:00
|
|
|
hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2014-09-25 15:23:38 +00:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
INIT_LIST_HEAD(&hctx->hctx_list);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
/*
|
2014-09-25 15:23:38 +00:00
|
|
|
* Allocate space for all possible cpus to avoid allocation at
|
|
|
|
* runtime
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
*/
|
2017-11-16 01:32:33 +00:00
|
|
|
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
|
2019-04-30 01:52:26 +00:00
|
|
|
gfp, node);
|
2014-09-25 15:23:38 +00:00
|
|
|
if (!hctx->ctxs)
|
2019-04-30 01:52:26 +00:00
|
|
|
goto free_cpumask;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2018-10-12 10:07:26 +00:00
|
|
|
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
|
2021-01-22 02:33:08 +00:00
|
|
|
gfp, node, false, false))
|
2014-09-25 15:23:38 +00:00
|
|
|
goto free_ctxs;
|
|
|
|
hctx->nr_ctx = 0;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2018-06-25 11:31:47 +00:00
|
|
|
spin_lock_init(&hctx->dispatch_wait_lock);
|
2017-11-09 15:32:43 +00:00
|
|
|
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
|
|
|
|
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
|
|
|
|
|
2020-03-09 21:41:37 +00:00
|
|
|
hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
|
2014-09-25 15:23:47 +00:00
|
|
|
if (!hctx->fq)
|
2019-04-30 01:52:26 +00:00
|
|
|
goto free_bitmap;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2016-11-02 16:09:51 +00:00
|
|
|
if (hctx->flags & BLK_MQ_F_BLOCKING)
|
2018-01-09 16:29:53 +00:00
|
|
|
init_srcu_struct(hctx->srcu);
|
2019-04-30 01:52:26 +00:00
|
|
|
blk_mq_hctx_kobj_init(hctx);
|
2016-11-02 16:09:51 +00:00
|
|
|
|
2019-04-30 01:52:26 +00:00
|
|
|
return hctx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-09-25 15:23:38 +00:00
|
|
|
free_bitmap:
|
2016-09-17 14:38:44 +00:00
|
|
|
sbitmap_free(&hctx->ctx_map);
|
2014-09-25 15:23:38 +00:00
|
|
|
free_ctxs:
|
|
|
|
kfree(hctx->ctxs);
|
2019-04-30 01:52:26 +00:00
|
|
|
free_cpumask:
|
|
|
|
free_cpumask_var(hctx->cpumask);
|
|
|
|
free_hctx:
|
|
|
|
kfree(hctx);
|
|
|
|
fail_alloc_hctx:
|
|
|
|
return NULL;
|
2014-09-25 15:23:38 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
static void blk_mq_init_cpu_queues(struct request_queue *q,
|
|
|
|
unsigned int nr_hw_queues)
|
|
|
|
{
|
2018-10-30 16:36:06 +00:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
unsigned int i, j;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
for_each_possible_cpu(i) {
|
|
|
|
struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2018-12-17 15:44:05 +00:00
|
|
|
int k;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
__ctx->cpu = i;
|
|
|
|
spin_lock_init(&__ctx->lock);
|
2018-12-17 15:44:05 +00:00
|
|
|
for (k = HCTX_TYPE_DEFAULT; k < HCTX_MAX_TYPES; k++)
|
|
|
|
INIT_LIST_HEAD(&__ctx->rq_lists[k]);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
__ctx->queue = q;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set local node, IFF we have more than one hw queue. If
|
|
|
|
* not, we remain on the home node of the device
|
|
|
|
*/
|
2018-10-30 16:36:06 +00:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
|
|
|
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
|
2020-10-19 08:20:47 +00:00
|
|
|
hctx->numa_node = cpu_to_node(i);
|
2018-10-30 16:36:06 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-05-07 13:04:22 +00:00
|
|
|
static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
|
|
|
|
int hctx_idx)
|
2017-01-11 21:29:56 +00:00
|
|
|
{
|
2020-08-19 15:20:22 +00:00
|
|
|
unsigned int flags = set->flags;
|
2017-01-11 21:29:56 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
|
2020-08-19 15:20:22 +00:00
|
|
|
set->queue_depth, set->reserved_tags, flags);
|
2017-01-11 21:29:56 +00:00
|
|
|
if (!set->tags[hctx_idx])
|
|
|
|
return false;
|
|
|
|
|
|
|
|
ret = blk_mq_alloc_rqs(set, set->tags[hctx_idx], hctx_idx,
|
|
|
|
set->queue_depth);
|
|
|
|
if (!ret)
|
|
|
|
return true;
|
|
|
|
|
2020-08-19 15:20:22 +00:00
|
|
|
blk_mq_free_rq_map(set->tags[hctx_idx], flags);
|
2017-01-11 21:29:56 +00:00
|
|
|
set->tags[hctx_idx] = NULL;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
|
|
|
|
unsigned int hctx_idx)
|
|
|
|
{
|
2020-08-19 15:20:22 +00:00
|
|
|
unsigned int flags = set->flags;
|
|
|
|
|
2018-11-29 10:56:54 +00:00
|
|
|
if (set->tags && set->tags[hctx_idx]) {
|
2017-01-17 13:03:22 +00:00
|
|
|
blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
|
2020-08-19 15:20:22 +00:00
|
|
|
blk_mq_free_rq_map(set->tags[hctx_idx], flags);
|
2017-01-17 13:03:22 +00:00
|
|
|
set->tags[hctx_idx] = NULL;
|
|
|
|
}
|
2017-01-11 21:29:56 +00:00
|
|
|
}
|
|
|
|
|
2017-06-26 10:20:57 +00:00
|
|
|
static void blk_mq_map_swqueue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2018-10-30 16:36:06 +00:00
|
|
|
unsigned int i, j, hctx_idx;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
struct blk_mq_ctx *ctx;
|
2015-04-21 02:00:20 +00:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2014-04-09 16:18:23 +00:00
|
|
|
cpumask_clear(hctx->cpumask);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
hctx->nr_ctx = 0;
|
2018-05-18 14:32:30 +00:00
|
|
|
hctx->dispatch_from = NULL;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-06-26 10:20:57 +00:00
|
|
|
* Map software to hardware queues.
|
2018-04-24 20:01:44 +00:00
|
|
|
*
|
|
|
|
* If the cpu isn't present, the cpu is mapped to first hctx.
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
*/
|
2018-01-12 02:53:06 +00:00
|
|
|
for_each_possible_cpu(i) {
|
2018-04-24 20:01:44 +00:00
|
|
|
|
2016-03-19 10:30:33 +00:00
|
|
|
ctx = per_cpu_ptr(q->queue_ctx, i);
|
2018-10-30 16:36:06 +00:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
2019-01-24 10:25:33 +00:00
|
|
|
if (!set->map[j].nr_queues) {
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
2018-12-17 17:28:56 +00:00
|
|
|
continue;
|
2019-01-24 10:25:33 +00:00
|
|
|
}
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 13:04:08 +00:00
|
|
|
hctx_idx = set->map[j].mq_map[i];
|
|
|
|
/* unmapped hw queue can be remapped after CPU topo changed */
|
|
|
|
if (!set->tags[hctx_idx] &&
|
2020-05-07 13:04:22 +00:00
|
|
|
!__blk_mq_alloc_map_and_request(set, hctx_idx)) {
|
block: alloc map and request for new hardware queue
Alloc new map and request for new hardware queue when increse
hardware queue count. Before this patch, it will show a
warning for each new hardware queue, but it's not enough, these
hctx have no maps and reqeust, when a bio was mapped to these
hardware queue, it will trigger kernel panic when get request
from these hctx.
Test environment:
* A NVMe disk supports 128 io queues
* 96 cpus in system
A corner case can always trigger this panic, there are 96
io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
queues to 96, then nvme will alloc others(32) queues for read, but
blk_mq_update_nr_hw_queues does not alloc map and request for these new
added io queues. So when process read nvme disk, it will trigger kernel
panic when get request from these hardware context.
Reproduce script:
nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
echo $nr > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
[ 8040.805626] ------------[ cut here ]------------
[ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
[ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
[ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
[ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
[ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
[ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
[ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
[ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
[ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
[ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
[ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
[ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8040.805647] PKRU: 55555554
[ 8040.805647] Call Trace:
[ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
[ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
[ 8040.805651] process_one_work+0x1a7/0x370
[ 8040.805652] worker_thread+0x1c9/0x380
[ 8040.805653] ? max_active_store+0x80/0x80
[ 8040.805655] kthread+0x112/0x130
[ 8040.805656] ? __kthread_parkme+0x70/0x70
[ 8040.805657] ret_from_fork+0x35/0x40
[ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
[ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 8229.365165] #PF: supervisor read access in kernel mode
[ 8229.365178] #PF: error_code(0x0000) - not-present page
[ 8229.365191] PGD 0 P4D 0
[ 8229.365201] Oops: 0000 [#1] SMP PTI
[ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
[ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
[ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
[ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
[ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
[ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
[ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
[ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
[ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
[ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
[ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
[ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8229.365476] PKRU: 55555554
[ 8229.365484] Call Trace:
[ 8229.365498] ? finish_wait+0x80/0x80
[ 8229.365512] blk_mq_get_request+0xcb/0x3f0
[ 8229.365525] blk_mq_make_request+0x143/0x5d0
[ 8229.365538] generic_make_request+0xcf/0x310
[ 8229.365553] ? scan_shadow_nodes+0x30/0x30
[ 8229.365564] submit_bio+0x3c/0x150
[ 8229.365576] mpage_readpages+0x163/0x1a0
[ 8229.365588] ? blkdev_direct_IO+0x490/0x490
[ 8229.365601] read_pages+0x6b/0x190
[ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
[ 8229.365626] ondemand_readahead+0x182/0x2f0
[ 8229.365639] generic_file_buffered_read+0x590/0xab0
[ 8229.365655] new_sync_read+0x12a/0x1c0
[ 8229.365666] vfs_read+0x8a/0x140
[ 8229.365676] ksys_read+0x59/0xd0
[ 8229.365688] do_syscall_64+0x55/0x1d0
[ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 13:04:08 +00:00
|
|
|
/*
|
|
|
|
* If tags initialization fail for some hctx,
|
|
|
|
* that hctx won't be brought online. In this
|
|
|
|
* case, remap the current ctx to hctx[0] which
|
|
|
|
* is guaranteed to always have tags allocated
|
|
|
|
*/
|
|
|
|
set->map[j].mq_map[i] = 0;
|
|
|
|
}
|
2018-12-17 17:28:56 +00:00
|
|
|
|
2018-10-30 16:36:06 +00:00
|
|
|
hctx = blk_mq_map_queue_type(q, j, i);
|
2019-01-24 10:25:32 +00:00
|
|
|
ctx->hctxs[j] = hctx;
|
2018-10-30 16:36:06 +00:00
|
|
|
/*
|
|
|
|
* If the CPU is already set in the mask, then we've
|
|
|
|
* mapped this one already. This can happen if
|
|
|
|
* devices share queues across queue maps.
|
|
|
|
*/
|
|
|
|
if (cpumask_test_cpu(i, hctx->cpumask))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
cpumask_set_cpu(i, hctx->cpumask);
|
|
|
|
hctx->type = j;
|
|
|
|
ctx->index_hw[hctx->type] = hctx->nr_ctx;
|
|
|
|
hctx->ctxs[hctx->nr_ctx++] = ctx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the nr_ctx type overflows, we have exceeded the
|
|
|
|
* amount of sw queues we can support.
|
|
|
|
*/
|
|
|
|
BUG_ON(!hctx->nr_ctx);
|
|
|
|
}
|
2019-01-24 10:25:33 +00:00
|
|
|
|
|
|
|
for (; j < HCTX_MAX_TYPES; j++)
|
|
|
|
ctx->hctxs[j] = blk_mq_map_queue_type(q,
|
|
|
|
HCTX_TYPE_DEFAULT, i);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2014-05-07 16:26:44 +00:00
|
|
|
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2018-04-24 20:01:44 +00:00
|
|
|
/*
|
|
|
|
* If no software queues are mapped to this hardware queue,
|
|
|
|
* disable it and free the request entries.
|
|
|
|
*/
|
|
|
|
if (!hctx->nr_ctx) {
|
|
|
|
/* Never unmap queue 0. We need it as a
|
|
|
|
* fallback in case of a new remap fails
|
|
|
|
* allocation
|
|
|
|
*/
|
|
|
|
if (i && set->tags[i])
|
|
|
|
blk_mq_free_map_and_requests(set, i);
|
|
|
|
|
|
|
|
hctx->tags = NULL;
|
|
|
|
continue;
|
|
|
|
}
|
2014-05-21 20:01:15 +00:00
|
|
|
|
2015-04-21 02:00:20 +00:00
|
|
|
hctx->tags = set->tags[i];
|
|
|
|
WARN_ON(!hctx->tags);
|
|
|
|
|
2015-04-15 17:39:29 +00:00
|
|
|
/*
|
|
|
|
* Set the map size to the number of mapped software queues.
|
|
|
|
* This is more accurate and more efficient than looping
|
|
|
|
* over all possibly mapped software queues.
|
|
|
|
*/
|
2016-09-17 14:38:44 +00:00
|
|
|
sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx);
|
2015-04-15 17:39:29 +00:00
|
|
|
|
2014-05-21 20:01:15 +00:00
|
|
|
/*
|
|
|
|
* Initialize batch roundrobin counts
|
|
|
|
*/
|
2018-04-08 09:48:10 +00:00
|
|
|
hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
|
2014-05-07 16:26:44 +00:00
|
|
|
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
|
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2017-06-20 23:56:13 +00:00
|
|
|
/*
|
|
|
|
* Caller needs to ensure that we're either frozen/quiesced, or that
|
|
|
|
* the queue isn't live yet.
|
|
|
|
*/
|
2015-11-03 15:40:06 +00:00
|
|
|
static void queue_set_hctx_shared(struct request_queue *q, bool shared)
|
2014-05-13 21:10:52 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i;
|
|
|
|
|
2015-11-03 15:40:06 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-25 11:31:48 +00:00
|
|
|
if (shared)
|
2020-08-19 15:20:19 +00:00
|
|
|
hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-25 11:31:48 +00:00
|
|
|
else
|
2020-08-19 15:20:19 +00:00
|
|
|
hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 15:40:06 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-19 15:20:20 +00:00
|
|
|
static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
|
|
|
|
bool shared)
|
2015-11-03 15:40:06 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2014-05-13 21:10:52 +00:00
|
|
|
|
2017-04-07 18:16:49 +00:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2014-05-13 21:10:52 +00:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_freeze_queue(q);
|
2015-11-03 15:40:06 +00:00
|
|
|
queue_set_hctx_shared(q, shared);
|
2014-05-13 21:10:52 +00:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_del_queue_tag_set(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2020-07-28 13:29:51 +00:00
|
|
|
list_del(&q->tag_set_list);
|
2015-11-03 15:40:06 +00:00
|
|
|
if (list_is_singular(&set->tag_list)) {
|
|
|
|
/* just transitioned to unshared */
|
2020-08-19 15:20:19 +00:00
|
|
|
set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 15:40:06 +00:00
|
|
|
/* update existing queue */
|
2020-08-19 15:20:20 +00:00
|
|
|
blk_mq_update_tag_set_shared(set, false);
|
2015-11-03 15:40:06 +00:00
|
|
|
}
|
2014-05-13 21:10:52 +00:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
2018-06-10 20:38:24 +00:00
|
|
|
INIT_LIST_HEAD(&q->tag_set_list);
|
2014-05-13 21:10:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
2015-11-03 15:40:06 +00:00
|
|
|
|
2017-11-11 05:05:12 +00:00
|
|
|
/*
|
|
|
|
* Check to see if we're transitioning to shared (from 1 to 2 queues).
|
|
|
|
*/
|
|
|
|
if (!list_empty(&set->tag_list) &&
|
2020-08-19 15:20:19 +00:00
|
|
|
!(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
|
|
|
|
set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
|
2015-11-03 15:40:06 +00:00
|
|
|
/* update existing queue */
|
2020-08-19 15:20:20 +00:00
|
|
|
blk_mq_update_tag_set_shared(set, true);
|
2015-11-03 15:40:06 +00:00
|
|
|
}
|
2020-08-19 15:20:19 +00:00
|
|
|
if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
|
2015-11-03 15:40:06 +00:00
|
|
|
queue_set_hctx_shared(q, true);
|
2020-07-28 13:29:51 +00:00
|
|
|
list_add_tail(&q->tag_set_list, &set->tag_list);
|
2015-11-03 15:40:06 +00:00
|
|
|
|
2014-05-13 21:10:52 +00:00
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
|
|
|
|
2018-11-20 01:44:35 +00:00
|
|
|
/* All allocations will be freed in release handler of q->mq_kobj */
|
|
|
|
static int blk_mq_alloc_ctxs(struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_ctxs *ctxs;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
ctxs = kzalloc(sizeof(*ctxs), GFP_KERNEL);
|
|
|
|
if (!ctxs)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ctxs->queue_ctx = alloc_percpu(struct blk_mq_ctx);
|
|
|
|
if (!ctxs->queue_ctx)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct blk_mq_ctx *ctx = per_cpu_ptr(ctxs->queue_ctx, cpu);
|
|
|
|
ctx->ctxs = ctxs;
|
|
|
|
}
|
|
|
|
|
|
|
|
q->mq_kobj = &ctxs->kobj;
|
|
|
|
q->queue_ctx = ctxs->queue_ctx;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
fail:
|
|
|
|
kfree(ctxs);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2015-01-29 12:17:27 +00:00
|
|
|
/*
|
|
|
|
* It is the actual release handler for mq, but we do it from
|
|
|
|
* request queue's release handler for avoiding use-after-free
|
|
|
|
* and headache because q->mq_kobj shouldn't have been introduced,
|
|
|
|
* but we can't group ctx/kctx kobj without it.
|
|
|
|
*/
|
|
|
|
void blk_mq_release(struct request_queue *q)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx, *next;
|
|
|
|
int i;
|
2015-01-29 12:17:27 +00:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
queue_for_each_hw_ctx(q, hctx, i)
|
|
|
|
WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
|
|
|
|
|
|
|
|
/* all hctx are in .unused_hctx_list now */
|
|
|
|
list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
|
|
|
|
list_del_init(&hctx->hctx_list);
|
2017-02-22 10:14:01 +00:00
|
|
|
kobject_put(&hctx->kobj);
|
2015-06-04 14:25:04 +00:00
|
|
|
}
|
2015-01-29 12:17:27 +00:00
|
|
|
|
|
|
|
kfree(q->queue_hw_ctx);
|
|
|
|
|
2017-02-22 10:14:00 +00:00
|
|
|
/*
|
|
|
|
* release .mq_kobj and sw queue's kobject now because
|
|
|
|
* both share lifetime with request queue.
|
|
|
|
*/
|
|
|
|
blk_mq_sysfs_deinit(q);
|
2015-01-29 12:17:27 +00:00
|
|
|
}
|
|
|
|
|
2021-06-24 08:10:12 +00:00
|
|
|
static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
|
2020-03-27 08:30:08 +00:00
|
|
|
void *queuedata)
|
2015-03-13 03:56:02 +00:00
|
|
|
{
|
2021-06-02 06:53:17 +00:00
|
|
|
struct request_queue *q;
|
|
|
|
int ret;
|
2015-03-13 03:56:02 +00:00
|
|
|
|
2021-06-02 06:53:17 +00:00
|
|
|
q = blk_alloc_queue(set->numa_node);
|
|
|
|
if (!q)
|
2015-03-13 03:56:02 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2021-06-02 06:53:17 +00:00
|
|
|
q->queuedata = queuedata;
|
|
|
|
ret = blk_mq_init_allocated_queue(set, q);
|
|
|
|
if (ret) {
|
|
|
|
blk_cleanup_queue(q);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2015-03-13 03:56:02 +00:00
|
|
|
return q;
|
|
|
|
}
|
2020-03-27 08:30:08 +00:00
|
|
|
|
|
|
|
struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
return blk_mq_init_queue_data(set, NULL);
|
|
|
|
}
|
2015-03-13 03:56:02 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_init_queue);
|
|
|
|
|
2021-06-02 06:53:18 +00:00
|
|
|
struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set, void *queuedata)
|
2018-10-15 14:40:37 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2021-06-02 06:53:18 +00:00
|
|
|
struct gendisk *disk;
|
2018-10-15 14:40:37 +00:00
|
|
|
|
2021-06-02 06:53:18 +00:00
|
|
|
q = blk_mq_init_queue_data(set, queuedata);
|
|
|
|
if (IS_ERR(q))
|
|
|
|
return ERR_CAST(q);
|
2018-10-15 14:40:37 +00:00
|
|
|
|
2021-06-02 06:53:18 +00:00
|
|
|
disk = __alloc_disk_node(0, set->numa_node);
|
|
|
|
if (!disk) {
|
|
|
|
blk_cleanup_queue(q);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2018-10-15 14:40:37 +00:00
|
|
|
}
|
2021-06-02 06:53:18 +00:00
|
|
|
disk->queue = q;
|
|
|
|
return disk;
|
2018-10-15 14:40:37 +00:00
|
|
|
}
|
2021-06-02 06:53:18 +00:00
|
|
|
EXPORT_SYMBOL(__blk_mq_alloc_disk);
|
2018-10-15 14:40:37 +00:00
|
|
|
|
2018-10-12 10:07:27 +00:00
|
|
|
static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
|
|
|
|
struct blk_mq_tag_set *set, struct request_queue *q,
|
|
|
|
int hctx_idx, int node)
|
|
|
|
{
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = NULL, *tmp;
|
2018-10-12 10:07:27 +00:00
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
/* reuse dead hctx first */
|
|
|
|
spin_lock(&q->unused_hctx_lock);
|
|
|
|
list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
|
|
|
|
if (tmp->numa_node == node) {
|
|
|
|
hctx = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (hctx)
|
|
|
|
list_del_init(&hctx->hctx_list);
|
|
|
|
spin_unlock(&q->unused_hctx_lock);
|
|
|
|
|
|
|
|
if (!hctx)
|
|
|
|
hctx = blk_mq_alloc_hctx(q, set, node);
|
2018-10-12 10:07:27 +00:00
|
|
|
if (!hctx)
|
2019-04-30 01:52:26 +00:00
|
|
|
goto fail;
|
2018-10-12 10:07:27 +00:00
|
|
|
|
2019-04-30 01:52:26 +00:00
|
|
|
if (blk_mq_init_hctx(q, set, hctx, hctx_idx))
|
|
|
|
goto free_hctx;
|
2018-10-12 10:07:27 +00:00
|
|
|
|
|
|
|
return hctx;
|
2019-04-30 01:52:26 +00:00
|
|
|
|
|
|
|
free_hctx:
|
|
|
|
kobject_put(&hctx->kobj);
|
|
|
|
fail:
|
|
|
|
return NULL;
|
2018-10-12 10:07:27 +00:00
|
|
|
}
|
|
|
|
|
2015-12-18 00:08:14 +00:00
|
|
|
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
int i, j, end;
|
2015-12-18 00:08:14 +00:00
|
|
|
struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
|
2014-05-27 18:06:53 +00:00
|
|
|
|
2019-10-25 16:50:09 +00:00
|
|
|
if (q->nr_hw_queues < set->nr_hw_queues) {
|
|
|
|
struct blk_mq_hw_ctx **new_hctxs;
|
|
|
|
|
|
|
|
new_hctxs = kcalloc_node(set->nr_hw_queues,
|
|
|
|
sizeof(*new_hctxs), GFP_KERNEL,
|
|
|
|
set->numa_node);
|
|
|
|
if (!new_hctxs)
|
|
|
|
return;
|
|
|
|
if (hctxs)
|
|
|
|
memcpy(new_hctxs, hctxs, q->nr_hw_queues *
|
|
|
|
sizeof(*hctxs));
|
|
|
|
q->queue_hw_ctx = new_hctxs;
|
|
|
|
kfree(hctxs);
|
|
|
|
hctxs = new_hctxs;
|
|
|
|
}
|
|
|
|
|
2018-01-06 08:27:40 +00:00
|
|
|
/* protect against switching io scheduler */
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
2014-04-15 20:14:00 +00:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2015-12-18 00:08:14 +00:00
|
|
|
int node;
|
2018-10-12 10:07:27 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2015-12-18 00:08:14 +00:00
|
|
|
|
2019-02-27 13:35:01 +00:00
|
|
|
node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], i);
|
2018-10-12 10:07:27 +00:00
|
|
|
/*
|
|
|
|
* If the hw queue has been mapped to another numa node,
|
|
|
|
* we need to realloc the hctx. If allocation fails, fallback
|
|
|
|
* to use the previous one.
|
|
|
|
*/
|
|
|
|
if (hctxs[i] && (hctxs[i]->numa_node == node))
|
|
|
|
continue;
|
2015-12-18 00:08:14 +00:00
|
|
|
|
2018-10-12 10:07:27 +00:00
|
|
|
hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
|
|
|
|
if (hctx) {
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
if (hctxs[i])
|
2018-10-12 10:07:27 +00:00
|
|
|
blk_mq_exit_hctx(q, set, hctxs[i], i);
|
|
|
|
hctxs[i] = hctx;
|
|
|
|
} else {
|
|
|
|
if (hctxs[i])
|
|
|
|
pr_warn("Allocate new hctx on node %d fails,\
|
|
|
|
fallback to previous one on node %d\n",
|
|
|
|
node, hctxs[i]->numa_node);
|
|
|
|
else
|
|
|
|
break;
|
2015-12-18 00:08:14 +00:00
|
|
|
}
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
/*
|
|
|
|
* Increasing nr_hw_queues fails. Free the newly allocated
|
|
|
|
* hctxs and keep the previous q->nr_hw_queues.
|
|
|
|
*/
|
|
|
|
if (i != set->nr_hw_queues) {
|
|
|
|
j = q->nr_hw_queues;
|
|
|
|
end = i;
|
|
|
|
} else {
|
|
|
|
j = i;
|
|
|
|
end = q->nr_hw_queues;
|
|
|
|
q->nr_hw_queues = set->nr_hw_queues;
|
|
|
|
}
|
2018-10-12 10:07:27 +00:00
|
|
|
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
for (; j < end; j++) {
|
2015-12-18 00:08:14 +00:00
|
|
|
struct blk_mq_hw_ctx *hctx = hctxs[j];
|
|
|
|
|
|
|
|
if (hctx) {
|
2017-01-11 21:29:56 +00:00
|
|
|
if (hctx->tags)
|
|
|
|
blk_mq_free_map_and_requests(set, j);
|
2015-12-18 00:08:14 +00:00
|
|
|
blk_mq_exit_hctx(q, set, hctx, j);
|
|
|
|
hctxs[j] = NULL;
|
|
|
|
}
|
|
|
|
}
|
2018-01-06 08:27:40 +00:00
|
|
|
mutex_unlock(&q->sysfs_lock);
|
2015-12-18 00:08:14 +00:00
|
|
|
}
|
|
|
|
|
2021-06-02 06:53:17 +00:00
|
|
|
int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
|
|
|
|
struct request_queue *q)
|
2015-12-18 00:08:14 +00:00
|
|
|
{
|
2016-02-12 07:27:00 +00:00
|
|
|
/* mark the queue as mq asap */
|
|
|
|
q->mq_ops = set->ops;
|
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
|
2017-04-07 12:24:03 +00:00
|
|
|
blk_mq_poll_stats_bkt,
|
|
|
|
BLK_MQ_POLL_STATS_BKTS, q);
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
if (!q->poll_cb)
|
|
|
|
goto err_exit;
|
|
|
|
|
2018-11-20 01:44:35 +00:00
|
|
|
if (blk_mq_alloc_ctxs(q))
|
2019-04-19 20:35:44 +00:00
|
|
|
goto err_poll;
|
2015-12-18 00:08:14 +00:00
|
|
|
|
2017-02-22 10:13:59 +00:00
|
|
|
/* init q->mq_kobj and sw queues' kobjects */
|
|
|
|
blk_mq_sysfs_init(q);
|
|
|
|
|
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:27 +00:00
|
|
|
INIT_LIST_HEAD(&q->unused_hctx_list);
|
|
|
|
spin_lock_init(&q->unused_hctx_lock);
|
|
|
|
|
2015-12-18 00:08:14 +00:00
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
|
|
|
if (!q->nr_hw_queues)
|
|
|
|
goto err_hctxs;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2015-10-30 12:57:30 +00:00
|
|
|
INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
|
2015-07-16 11:53:22 +00:00
|
|
|
blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2018-10-16 20:23:06 +00:00
|
|
|
q->tag_set = set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2013-11-19 16:25:07 +00:00
|
|
|
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
|
2018-12-18 04:15:29 +00:00
|
|
|
if (set->nr_maps > HCTX_TYPE_POLL &&
|
|
|
|
set->map[HCTX_TYPE_POLL].nr_queues)
|
2018-12-02 16:46:28 +00:00
|
|
|
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2014-02-07 18:22:39 +00:00
|
|
|
q->sg_reserved_size = INT_MAX;
|
|
|
|
|
2016-09-14 17:28:30 +00:00
|
|
|
INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
|
2014-05-28 14:08:02 +00:00
|
|
|
INIT_LIST_HEAD(&q->requeue_list);
|
|
|
|
spin_lock_init(&q->requeue_lock);
|
|
|
|
|
2014-05-20 21:17:27 +00:00
|
|
|
q->nr_requests = set->queue_depth;
|
|
|
|
|
2016-11-14 20:03:03 +00:00
|
|
|
/*
|
|
|
|
* Default to classic polling
|
|
|
|
*/
|
2019-03-18 14:44:41 +00:00
|
|
|
q->poll_nsec = BLK_MQ_POLL_CLASSIC;
|
2016-11-14 20:03:03 +00:00
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
|
2014-05-13 21:10:52 +00:00
|
|
|
blk_mq_add_queue_tag_set(set, q);
|
2017-06-26 10:20:57 +00:00
|
|
|
blk_mq_map_swqueue(q);
|
2021-06-02 06:53:17 +00:00
|
|
|
return 0;
|
2014-02-10 16:29:00 +00:00
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
err_hctxs:
|
2015-12-18 00:08:14 +00:00
|
|
|
kfree(q->queue_hw_ctx);
|
2019-07-23 14:10:42 +00:00
|
|
|
q->nr_hw_queues = 0;
|
2018-11-20 01:44:35 +00:00
|
|
|
blk_mq_sysfs_deinit(q);
|
2019-04-19 20:35:44 +00:00
|
|
|
err_poll:
|
|
|
|
blk_stat_free_callback(q->poll_cb);
|
|
|
|
q->poll_cb = NULL;
|
2016-05-26 06:23:27 +00:00
|
|
|
err_exit:
|
|
|
|
q->mq_ops = NULL;
|
2021-06-02 06:53:17 +00:00
|
|
|
return -ENOMEM;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
2015-03-13 03:56:02 +00:00
|
|
|
EXPORT_SYMBOL(blk_mq_init_allocated_queue);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 01:52:25 +00:00
|
|
|
/* tags can _not_ be used after returning from blk_mq_exit_queue */
|
|
|
|
void blk_mq_exit_queue(struct request_queue *q)
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
{
|
2021-05-13 17:15:29 +00:00
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
|
2021-05-13 17:15:29 +00:00
|
|
|
/* Checks hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED. */
|
2014-05-27 15:35:13 +00:00
|
|
|
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
|
2021-05-13 17:15:29 +00:00
|
|
|
/* May clear BLK_MQ_F_TAG_QUEUE_SHARED in hctx->flags. */
|
|
|
|
blk_mq_del_queue_tag_set(q);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
}
|
|
|
|
|
2014-09-10 15:02:03 +00:00
|
|
|
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 02:39:47 +00:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++) {
|
2020-05-07 13:04:22 +00:00
|
|
|
if (!__blk_mq_alloc_map_and_request(set, i))
|
2014-09-10 15:02:03 +00:00
|
|
|
goto out_unwind;
|
blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-26 02:39:47 +00:00
|
|
|
cond_resched();
|
|
|
|
}
|
2014-09-10 15:02:03 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_unwind:
|
|
|
|
while (--i >= 0)
|
2020-05-07 13:03:39 +00:00
|
|
|
blk_mq_free_map_and_requests(set, i);
|
2014-09-10 15:02:03 +00:00
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate the request maps associated with this tag_set. Note that this
|
|
|
|
* may reduce the depth asked for, if memory is tight. set->queue_depth
|
|
|
|
* will be updated to reflect the allocated depth.
|
|
|
|
*/
|
2020-05-07 13:04:42 +00:00
|
|
|
static int blk_mq_alloc_map_and_requests(struct blk_mq_tag_set *set)
|
2014-09-10 15:02:03 +00:00
|
|
|
{
|
|
|
|
unsigned int depth;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
depth = set->queue_depth;
|
|
|
|
do {
|
|
|
|
err = __blk_mq_alloc_rq_maps(set);
|
|
|
|
if (!err)
|
|
|
|
break;
|
|
|
|
|
|
|
|
set->queue_depth >>= 1;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} while (set->queue_depth);
|
|
|
|
|
|
|
|
if (!set->queue_depth || err) {
|
|
|
|
pr_err("blk-mq: failed to allocate request map\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (depth != set->queue_depth)
|
|
|
|
pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
|
|
|
|
depth, set->queue_depth);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-04-07 14:53:11 +00:00
|
|
|
static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2020-03-10 04:26:17 +00:00
|
|
|
/*
|
|
|
|
* blk_mq_map_queues() and multiple .map_queues() implementations
|
|
|
|
* expect that set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the
|
|
|
|
* number of hardware queues.
|
|
|
|
*/
|
|
|
|
if (set->nr_maps == 1)
|
|
|
|
set->map[HCTX_TYPE_DEFAULT].nr_queues = set->nr_hw_queues;
|
|
|
|
|
2018-12-07 03:03:53 +00:00
|
|
|
if (set->ops->map_queues && !is_kdump_kernel()) {
|
2018-10-30 16:36:06 +00:00
|
|
|
int i;
|
|
|
|
|
2018-01-06 08:27:39 +00:00
|
|
|
/*
|
|
|
|
* transport .map_queues is usually done in the following
|
|
|
|
* way:
|
|
|
|
*
|
|
|
|
* for (queue = 0; queue < set->nr_hw_queues; queue++) {
|
|
|
|
* mask = get_cpu_mask(queue)
|
|
|
|
* for_each_cpu(cpu, mask)
|
2018-10-30 16:36:06 +00:00
|
|
|
* set->map[x].mq_map[cpu] = queue;
|
2018-01-06 08:27:39 +00:00
|
|
|
* }
|
|
|
|
*
|
|
|
|
* When we need to remap, the table has to be cleared for
|
|
|
|
* killing stale mapping since one CPU may not be mapped
|
|
|
|
* to any hw queue.
|
|
|
|
*/
|
2018-10-30 16:36:06 +00:00
|
|
|
for (i = 0; i < set->nr_maps; i++)
|
|
|
|
blk_mq_clear_mq_map(&set->map[i]);
|
2018-01-06 08:27:39 +00:00
|
|
|
|
2017-04-07 14:53:11 +00:00
|
|
|
return set->ops->map_queues(set);
|
2018-10-30 16:36:06 +00:00
|
|
|
} else {
|
|
|
|
BUG_ON(set->nr_maps > 1);
|
2019-02-27 13:35:01 +00:00
|
|
|
return blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
|
2018-10-30 16:36:06 +00:00
|
|
|
}
|
2017-04-07 14:53:11 +00:00
|
|
|
}
|
|
|
|
|
2019-10-25 16:50:10 +00:00
|
|
|
static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
|
|
|
|
int cur_nr_hw_queues, int new_nr_hw_queues)
|
|
|
|
{
|
|
|
|
struct blk_mq_tags **new_tags;
|
|
|
|
|
|
|
|
if (cur_nr_hw_queues >= new_nr_hw_queues)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
|
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!new_tags)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (set->tags)
|
|
|
|
memcpy(new_tags, set->tags, cur_nr_hw_queues *
|
|
|
|
sizeof(*set->tags));
|
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = new_tags;
|
|
|
|
set->nr_hw_queues = new_nr_hw_queues;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-12-04 15:20:53 +00:00
|
|
|
static int blk_mq_alloc_tag_set_tags(struct blk_mq_tag_set *set,
|
|
|
|
int new_nr_hw_queues)
|
|
|
|
{
|
|
|
|
return blk_mq_realloc_tag_set_tags(set, 0, new_nr_hw_queues);
|
|
|
|
}
|
|
|
|
|
2014-06-05 21:21:56 +00:00
|
|
|
/*
|
|
|
|
* Alloc a tag set to be associated with one or more request queues.
|
|
|
|
* May fail with EINVAL for various error conditions. May adjust the
|
2018-06-30 13:12:41 +00:00
|
|
|
* requested depth down, if it's too large. In that case, the set
|
2014-06-05 21:21:56 +00:00
|
|
|
* value will be stored in set->queue_depth.
|
|
|
|
*/
|
2014-04-15 20:14:00 +00:00
|
|
|
int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-30 16:36:06 +00:00
|
|
|
int i, ret;
|
2016-09-14 14:18:55 +00:00
|
|
|
|
2014-10-30 13:45:11 +00:00
|
|
|
BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
|
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
if (!set->nr_hw_queues)
|
|
|
|
return -EINVAL;
|
2014-06-05 21:21:56 +00:00
|
|
|
if (!set->queue_depth)
|
2014-04-15 20:14:00 +00:00
|
|
|
return -EINVAL;
|
|
|
|
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-09-14 14:18:54 +00:00
|
|
|
if (!set->ops->queue_rq)
|
2014-04-15 20:14:00 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2017-10-14 09:22:29 +00:00
|
|
|
if (!set->ops->get_budget ^ !set->ops->put_budget)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-06-05 21:21:56 +00:00
|
|
|
if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
|
|
|
|
pr_info("blk-mq: reduced tag depth to %u\n",
|
|
|
|
BLK_MQ_MAX_DEPTH);
|
|
|
|
set->queue_depth = BLK_MQ_MAX_DEPTH;
|
|
|
|
}
|
2014-04-15 20:14:00 +00:00
|
|
|
|
2018-10-30 16:36:06 +00:00
|
|
|
if (!set->nr_maps)
|
|
|
|
set->nr_maps = 1;
|
|
|
|
else if (set->nr_maps > HCTX_MAX_TYPES)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-12-01 00:00:58 +00:00
|
|
|
/*
|
|
|
|
* If a crashdump is active, then we are potentially in a very
|
|
|
|
* memory constrained environment. Limit us to 1 queue and
|
|
|
|
* 64 tags to prevent using too much memory.
|
|
|
|
*/
|
|
|
|
if (is_kdump_kernel()) {
|
|
|
|
set->nr_hw_queues = 1;
|
2018-12-07 03:03:53 +00:00
|
|
|
set->nr_maps = 1;
|
2014-12-01 00:00:58 +00:00
|
|
|
set->queue_depth = min(64U, set->queue_depth);
|
|
|
|
}
|
2015-12-18 00:08:14 +00:00
|
|
|
/*
|
2018-10-29 19:25:27 +00:00
|
|
|
* There is no use for more h/w queues than cpus if we just have
|
|
|
|
* a single map
|
2015-12-18 00:08:14 +00:00
|
|
|
*/
|
2018-10-29 19:25:27 +00:00
|
|
|
if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
|
2015-12-18 00:08:14 +00:00
|
|
|
set->nr_hw_queues = nr_cpu_ids;
|
2014-12-01 00:00:58 +00:00
|
|
|
|
2020-12-04 15:20:53 +00:00
|
|
|
if (blk_mq_alloc_tag_set_tags(set, set->nr_hw_queues) < 0)
|
2014-09-10 15:02:03 +00:00
|
|
|
return -ENOMEM;
|
2014-04-15 20:14:00 +00:00
|
|
|
|
2016-09-14 14:18:55 +00:00
|
|
|
ret = -ENOMEM;
|
2018-10-30 16:36:06 +00:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
|
2018-12-17 10:42:45 +00:00
|
|
|
sizeof(set->map[i].mq_map[0]),
|
2018-10-30 16:36:06 +00:00
|
|
|
GFP_KERNEL, set->numa_node);
|
|
|
|
if (!set->map[i].mq_map)
|
|
|
|
goto out_free_mq_map;
|
2018-12-07 03:03:53 +00:00
|
|
|
set->map[i].nr_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
|
2018-10-30 16:36:06 +00:00
|
|
|
}
|
2016-09-14 14:18:53 +00:00
|
|
|
|
2017-04-07 14:53:11 +00:00
|
|
|
ret = blk_mq_update_queue_map(set);
|
2016-09-14 14:18:55 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_free_mq_map;
|
|
|
|
|
2020-05-07 13:04:42 +00:00
|
|
|
ret = blk_mq_alloc_map_and_requests(set);
|
2016-09-14 14:18:55 +00:00
|
|
|
if (ret)
|
2016-09-14 14:18:53 +00:00
|
|
|
goto out_free_mq_map;
|
2014-04-15 20:14:00 +00:00
|
|
|
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 15:20:24 +00:00
|
|
|
if (blk_mq_is_sbitmap_shared(set->flags)) {
|
2020-08-19 15:20:27 +00:00
|
|
|
atomic_set(&set->active_queues_shared_sbitmap, 0);
|
|
|
|
|
2021-05-13 12:00:57 +00:00
|
|
|
if (blk_mq_init_shared_sbitmap(set)) {
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 15:20:24 +00:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_free_mq_rq_maps;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-05-13 21:10:52 +00:00
|
|
|
mutex_init(&set->tag_list_lock);
|
|
|
|
INIT_LIST_HEAD(&set->tag_list);
|
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
return 0;
|
2016-09-14 14:18:53 +00:00
|
|
|
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 15:20:24 +00:00
|
|
|
out_free_mq_rq_maps:
|
|
|
|
for (i = 0; i < set->nr_hw_queues; i++)
|
|
|
|
blk_mq_free_map_and_requests(set, i);
|
2016-09-14 14:18:53 +00:00
|
|
|
out_free_mq_map:
|
2018-10-30 16:36:06 +00:00
|
|
|
for (i = 0; i < set->nr_maps; i++) {
|
|
|
|
kfree(set->map[i].mq_map);
|
|
|
|
set->map[i].mq_map = NULL;
|
|
|
|
}
|
2014-09-02 16:38:44 +00:00
|
|
|
kfree(set->tags);
|
|
|
|
set->tags = NULL;
|
2016-09-14 14:18:55 +00:00
|
|
|
return ret;
|
2014-04-15 20:14:00 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_alloc_tag_set);
|
|
|
|
|
2021-06-02 06:53:16 +00:00
|
|
|
/* allocate and initialize a tagset for a simple single-queue device */
|
|
|
|
int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set,
|
|
|
|
const struct blk_mq_ops *ops, unsigned int queue_depth,
|
|
|
|
unsigned int set_flags)
|
|
|
|
{
|
|
|
|
memset(set, 0, sizeof(*set));
|
|
|
|
set->ops = ops;
|
|
|
|
set->nr_hw_queues = 1;
|
|
|
|
set->nr_maps = 1;
|
|
|
|
set->queue_depth = queue_depth;
|
|
|
|
set->numa_node = NUMA_NO_NODE;
|
|
|
|
set->flags = set_flags;
|
|
|
|
return blk_mq_alloc_tag_set(set);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_alloc_sq_tag_set);
|
|
|
|
|
2014-04-15 20:14:00 +00:00
|
|
|
void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
|
|
|
|
{
|
2018-10-30 16:36:06 +00:00
|
|
|
int i, j;
|
2014-04-15 20:14:00 +00:00
|
|
|
|
2019-10-25 16:50:10 +00:00
|
|
|
for (i = 0; i < set->nr_hw_queues; i++)
|
2017-01-11 21:29:56 +00:00
|
|
|
blk_mq_free_map_and_requests(set, i);
|
2014-05-21 20:01:15 +00:00
|
|
|
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 15:20:24 +00:00
|
|
|
if (blk_mq_is_sbitmap_shared(set->flags))
|
|
|
|
blk_mq_exit_shared_sbitmap(set);
|
|
|
|
|
2018-10-30 16:36:06 +00:00
|
|
|
for (j = 0; j < set->nr_maps; j++) {
|
|
|
|
kfree(set->map[j].mq_map);
|
|
|
|
set->map[j].mq_map = NULL;
|
|
|
|
}
|
2016-09-14 14:18:53 +00:00
|
|
|
|
2014-04-23 16:07:34 +00:00
|
|
|
kfree(set->tags);
|
2014-09-02 16:38:44 +00:00
|
|
|
set->tags = NULL;
|
2014-04-15 20:14:00 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_free_tag_set);
|
|
|
|
|
2014-05-20 17:49:02 +00:00
|
|
|
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
|
|
|
|
{
|
|
|
|
struct blk_mq_tag_set *set = q->tag_set;
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
|
|
|
int i, ret;
|
|
|
|
|
2017-01-17 13:03:22 +00:00
|
|
|
if (!set)
|
2014-05-20 17:49:02 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-02-08 16:14:05 +00:00
|
|
|
if (q->nr_requests == nr)
|
|
|
|
return 0;
|
|
|
|
|
2017-01-19 17:59:07 +00:00
|
|
|
blk_mq_freeze_queue(q);
|
2018-01-06 08:27:38 +00:00
|
|
|
blk_mq_quiesce_queue(q);
|
2017-01-19 17:59:07 +00:00
|
|
|
|
2014-05-20 17:49:02 +00:00
|
|
|
ret = 0;
|
|
|
|
queue_for_each_hw_ctx(q, hctx, i) {
|
2016-02-18 21:56:35 +00:00
|
|
|
if (!hctx->tags)
|
|
|
|
continue;
|
2017-01-17 13:03:22 +00:00
|
|
|
/*
|
|
|
|
* If we're using an MQ scheduler, just update the scheduler
|
|
|
|
* queue depth. This is similar to what the old code would do.
|
|
|
|
*/
|
2017-01-19 17:59:07 +00:00
|
|
|
if (!hctx->sched_tags) {
|
2017-09-22 15:36:28 +00:00
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
|
2017-01-19 17:59:07 +00:00
|
|
|
false);
|
blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 15:20:24 +00:00
|
|
|
if (!ret && blk_mq_is_sbitmap_shared(set->flags))
|
|
|
|
blk_mq_tag_resize_shared_sbitmap(set, nr);
|
2017-01-19 17:59:07 +00:00
|
|
|
} else {
|
|
|
|
ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
|
|
|
|
nr, true);
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 12:00:58 +00:00
|
|
|
if (blk_mq_is_sbitmap_shared(set->flags)) {
|
|
|
|
hctx->sched_tags->bitmap_tags =
|
|
|
|
&q->sched_bitmap_tags;
|
|
|
|
hctx->sched_tags->breserved_tags =
|
|
|
|
&q->sched_breserved_tags;
|
|
|
|
}
|
2017-01-19 17:59:07 +00:00
|
|
|
}
|
2014-05-20 17:49:02 +00:00
|
|
|
if (ret)
|
|
|
|
break;
|
2019-01-18 17:34:16 +00:00
|
|
|
if (q->elevator && q->elevator->type->ops.depth_updated)
|
|
|
|
q->elevator->type->ops.depth_updated(hctx);
|
2014-05-20 17:49:02 +00:00
|
|
|
}
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 12:00:58 +00:00
|
|
|
if (!ret) {
|
2014-05-20 17:49:02 +00:00
|
|
|
q->nr_requests = nr;
|
blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-13 12:00:58 +00:00
|
|
|
if (q->elevator && blk_mq_is_sbitmap_shared(set->flags))
|
|
|
|
sbitmap_queue_resize(&q->sched_bitmap_tags,
|
|
|
|
nr - set->reserved_tags);
|
|
|
|
}
|
2014-05-20 17:49:02 +00:00
|
|
|
|
2018-01-06 08:27:38 +00:00
|
|
|
blk_mq_unquiesce_queue(q);
|
2017-01-19 17:59:07 +00:00
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
|
2014-05-20 17:49:02 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-08-21 07:15:03 +00:00
|
|
|
/*
|
|
|
|
* request_queue and elevator_type pair.
|
|
|
|
* It is just used by __blk_mq_update_nr_hw_queues to cache
|
|
|
|
* the elevator_type associated with a request_queue.
|
|
|
|
*/
|
|
|
|
struct blk_mq_qe_pair {
|
|
|
|
struct list_head node;
|
|
|
|
struct request_queue *q;
|
|
|
|
struct elevator_type *type;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cache the elevator_type in qe pair list and switch the
|
|
|
|
* io scheduler to 'none'
|
|
|
|
*/
|
|
|
|
static bool blk_mq_elv_switch_none(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
|
|
|
|
if (!q->elevator)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
|
|
|
|
if (!qe)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&qe->node);
|
|
|
|
qe->q = q;
|
|
|
|
qe->type = q->elevator->type;
|
|
|
|
list_add(&qe->node, head);
|
|
|
|
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
/*
|
|
|
|
* After elevator_switch_mq, the previous elevator_queue will be
|
|
|
|
* released by elevator_release. The reference of the io scheduler
|
|
|
|
* module get by elevator_get will also be put. So we need to get
|
|
|
|
* a reference of the io scheduler module here to prevent it to be
|
|
|
|
* removed.
|
|
|
|
*/
|
|
|
|
__module_get(qe->type->elevator_owner);
|
|
|
|
elevator_switch_mq(q, NULL);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_elv_switch_back(struct list_head *head,
|
|
|
|
struct request_queue *q)
|
|
|
|
{
|
|
|
|
struct blk_mq_qe_pair *qe;
|
|
|
|
struct elevator_type *t = NULL;
|
|
|
|
|
|
|
|
list_for_each_entry(qe, head, node)
|
|
|
|
if (qe->q == q) {
|
|
|
|
t = qe->type;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!t)
|
|
|
|
return;
|
|
|
|
|
|
|
|
list_del(&qe->node);
|
|
|
|
kfree(qe);
|
|
|
|
|
|
|
|
mutex_lock(&q->sysfs_lock);
|
|
|
|
elevator_switch_mq(q, t);
|
|
|
|
mutex_unlock(&q->sysfs_lock);
|
|
|
|
}
|
|
|
|
|
2017-05-30 18:39:11 +00:00
|
|
|
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
|
|
|
|
int nr_hw_queues)
|
2015-12-18 00:08:14 +00:00
|
|
|
{
|
|
|
|
struct request_queue *q;
|
2018-08-21 07:15:03 +00:00
|
|
|
LIST_HEAD(head);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
int prev_nr_hw_queues;
|
2015-12-18 00:08:14 +00:00
|
|
|
|
2017-04-07 18:16:49 +00:00
|
|
|
lockdep_assert_held(&set->tag_list_lock);
|
|
|
|
|
2018-10-29 19:25:27 +00:00
|
|
|
if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
|
2015-12-18 00:08:14 +00:00
|
|
|
nr_hw_queues = nr_cpu_ids;
|
2020-06-17 06:18:37 +00:00
|
|
|
if (nr_hw_queues < 1)
|
|
|
|
return;
|
|
|
|
if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
|
2015-12-18 00:08:14 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_freeze_queue(q);
|
2018-08-21 07:15:03 +00:00
|
|
|
/*
|
|
|
|
* Switch IO scheduler to 'none', cleaning up the data associated
|
|
|
|
* with the previous scheduler. We will switch back once we are done
|
|
|
|
* updating the new sw to hw queue mappings.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
if (!blk_mq_elv_switch_none(&head, q))
|
|
|
|
goto switch_back;
|
2015-12-18 00:08:14 +00:00
|
|
|
|
2018-10-12 10:07:25 +00:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_debugfs_unregister_hctxs(q);
|
|
|
|
blk_mq_sysfs_unregister(q);
|
|
|
|
}
|
|
|
|
|
2020-05-07 13:03:56 +00:00
|
|
|
prev_nr_hw_queues = set->nr_hw_queues;
|
2019-10-25 16:50:10 +00:00
|
|
|
if (blk_mq_realloc_tag_set_tags(set, set->nr_hw_queues, nr_hw_queues) <
|
|
|
|
0)
|
|
|
|
goto reregister;
|
|
|
|
|
2015-12-18 00:08:14 +00:00
|
|
|
set->nr_hw_queues = nr_hw_queues;
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
fallback:
|
2020-05-13 00:44:05 +00:00
|
|
|
blk_mq_update_queue_map(set);
|
2015-12-18 00:08:14 +00:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_realloc_hw_ctxs(set, q);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
if (q->nr_hw_queues != set->nr_hw_queues) {
|
|
|
|
pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
|
|
|
|
nr_hw_queues, prev_nr_hw_queues);
|
|
|
|
set->nr_hw_queues = prev_nr_hw_queues;
|
2019-02-27 13:35:01 +00:00
|
|
|
blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
|
blk-mq: fallback to previous nr_hw_queues when updating fails
When we try to increate the nr_hw_queues, we may fail due to
shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
and some entries in q->queue_hw_ctx are left with NULL. However,
because queue map has been updated with new nr_hw_queues, some cpus
have been mapped to hw queue which just encounters allocation failure,
thus blk_mq_map_queue could return NULL. This will cause panic in
following blk_mq_map_swqueue.
To fix it, when increase nr_hw_queues fails, fallback to previous
nr_hw_queues and post warning. At the same time, driver's .map_queues
usually use completion irq affinity to map hw and cpu, fallback
nr_hw_queues will cause lack of some cpu's map to hw, so use default
blk_mq_map_queues to do that.
Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-12 10:07:28 +00:00
|
|
|
goto fallback;
|
|
|
|
}
|
2018-10-12 10:07:25 +00:00
|
|
|
blk_mq_map_swqueue(q);
|
|
|
|
}
|
|
|
|
|
2019-10-25 16:50:10 +00:00
|
|
|
reregister:
|
2018-10-12 10:07:25 +00:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list) {
|
|
|
|
blk_mq_sysfs_register(q);
|
|
|
|
blk_mq_debugfs_register_hctxs(q);
|
2015-12-18 00:08:14 +00:00
|
|
|
}
|
|
|
|
|
2018-08-21 07:15:03 +00:00
|
|
|
switch_back:
|
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_elv_switch_back(&head, q);
|
|
|
|
|
2015-12-18 00:08:14 +00:00
|
|
|
list_for_each_entry(q, &set->tag_list, tag_set_list)
|
|
|
|
blk_mq_unfreeze_queue(q);
|
|
|
|
}
|
2017-05-30 18:39:11 +00:00
|
|
|
|
|
|
|
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
|
|
|
|
{
|
|
|
|
mutex_lock(&set->tag_list_lock);
|
|
|
|
__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
|
|
|
|
mutex_unlock(&set->tag_list_lock);
|
|
|
|
}
|
2015-12-18 00:08:14 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
|
|
|
|
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
/* Enable polling stats and return whether they were already enabled. */
|
|
|
|
static bool blk_poll_stats_enable(struct request_queue *q)
|
|
|
|
{
|
|
|
|
if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) ||
|
2018-03-08 01:10:05 +00:00
|
|
|
blk_queue_flag_test_and_set(QUEUE_FLAG_POLL_STATS, q))
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
return true;
|
|
|
|
blk_stat_add_callback(q, q->poll_cb);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_poll_stats_start(struct request_queue *q)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We don't arm the callback if polling stats are not enabled or the
|
|
|
|
* callback is already active.
|
|
|
|
*/
|
|
|
|
if (!test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) ||
|
|
|
|
blk_stat_is_active(q->poll_cb))
|
|
|
|
return;
|
|
|
|
|
|
|
|
blk_stat_activate_msecs(q->poll_cb, 100);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
|
|
|
|
{
|
|
|
|
struct request_queue *q = cb->data;
|
2017-04-07 12:24:03 +00:00
|
|
|
int bucket;
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
|
2017-04-07 12:24:03 +00:00
|
|
|
for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) {
|
|
|
|
if (cb->stat[bucket].nr_samples)
|
|
|
|
q->poll_stat[bucket] = cb->stat[bucket];
|
|
|
|
}
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
}
|
|
|
|
|
2016-11-14 20:03:03 +00:00
|
|
|
static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
|
|
|
|
struct request *rq)
|
|
|
|
{
|
|
|
|
unsigned long ret = 0;
|
2017-04-07 12:24:03 +00:00
|
|
|
int bucket;
|
2016-11-14 20:03:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If stats collection isn't on, don't sleep but turn it on for
|
|
|
|
* future users
|
|
|
|
*/
|
blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:
1. Writeback throttling wants its own dynamically sized window of
statistics. Since the blk-stats statistics are reset after every
window and the wbt windows don't line up with the blk-stats windows,
wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
on how the window lines up, we may miss some I/Os. It's also
unnecessary overhead to get the statistics on every I/O; the hybrid
polling heuristic would be just as happy with the statistics from the
previous full window.
This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.
The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.
wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.
For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.
Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 15:56:08 +00:00
|
|
|
if (!blk_poll_stats_enable(q))
|
2016-11-14 20:03:03 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* As an optimistic guess, use half of the mean service time
|
|
|
|
* for this type of request. We can (and should) make this smarter.
|
|
|
|
* For instance, if the completion latencies are tight, we can
|
|
|
|
* get closer than just half the mean. This is especially
|
|
|
|
* important on devices where the completion latencies are longer
|
2017-04-07 12:24:03 +00:00
|
|
|
* than ~10 usec. We do use the stats for the relevant IO size
|
|
|
|
* if available which does lead to better estimates.
|
2016-11-14 20:03:03 +00:00
|
|
|
*/
|
2017-04-07 12:24:03 +00:00
|
|
|
bucket = blk_mq_poll_stats_bkt(rq);
|
|
|
|
if (bucket < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (q->poll_stat[bucket].nr_samples)
|
|
|
|
ret = (q->poll_stat[bucket].mean + 1) / 2;
|
2016-11-14 20:03:03 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-11-14 20:01:59 +00:00
|
|
|
static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
|
|
|
|
struct request *rq)
|
|
|
|
{
|
|
|
|
struct hrtimer_sleeper hs;
|
|
|
|
enum hrtimer_mode mode;
|
2016-11-14 20:03:03 +00:00
|
|
|
unsigned int nsecs;
|
2016-11-14 20:01:59 +00:00
|
|
|
ktime_t kt;
|
|
|
|
|
2018-01-10 18:30:56 +00:00
|
|
|
if (rq->rq_flags & RQF_MQ_POLL_SLEPT)
|
2016-11-14 20:03:03 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
2018-11-26 15:21:49 +00:00
|
|
|
* If we get here, hybrid polling is enabled. Hence poll_nsec can be:
|
2016-11-14 20:03:03 +00:00
|
|
|
*
|
|
|
|
* 0: use half of prev avg
|
|
|
|
* >0: use this specific value
|
|
|
|
*/
|
2018-11-26 15:21:49 +00:00
|
|
|
if (q->poll_nsec > 0)
|
2016-11-14 20:03:03 +00:00
|
|
|
nsecs = q->poll_nsec;
|
|
|
|
else
|
2020-02-26 12:10:15 +00:00
|
|
|
nsecs = blk_mq_poll_nsecs(q, rq);
|
2016-11-14 20:03:03 +00:00
|
|
|
|
|
|
|
if (!nsecs)
|
2016-11-14 20:01:59 +00:00
|
|
|
return false;
|
|
|
|
|
2018-01-10 18:30:56 +00:00
|
|
|
rq->rq_flags |= RQF_MQ_POLL_SLEPT;
|
2016-11-14 20:01:59 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This will be replaced with the stats tracking code, using
|
|
|
|
* 'avg_completion_time / 2' as the pre-sleep target.
|
|
|
|
*/
|
2016-12-25 11:30:41 +00:00
|
|
|
kt = nsecs;
|
2016-11-14 20:01:59 +00:00
|
|
|
|
|
|
|
mode = HRTIMER_MODE_REL;
|
2019-07-26 18:30:50 +00:00
|
|
|
hrtimer_init_sleeper_on_stack(&hs, CLOCK_MONOTONIC, mode);
|
2016-11-14 20:01:59 +00:00
|
|
|
hrtimer_set_expires(&hs.timer, kt);
|
|
|
|
|
|
|
|
do {
|
2018-01-09 16:29:52 +00:00
|
|
|
if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
|
2016-11-14 20:01:59 +00:00
|
|
|
break;
|
|
|
|
set_current_state(TASK_UNINTERRUPTIBLE);
|
2019-07-30 19:16:55 +00:00
|
|
|
hrtimer_sleeper_start_expires(&hs, mode);
|
2016-11-14 20:01:59 +00:00
|
|
|
if (hs.task)
|
|
|
|
io_schedule();
|
|
|
|
hrtimer_cancel(&hs.timer);
|
|
|
|
mode = HRTIMER_MODE_ABS;
|
|
|
|
} while (hs.task && !signal_pending(current));
|
|
|
|
|
|
|
|
__set_current_state(TASK_RUNNING);
|
|
|
|
destroy_hrtimer_on_stack(&hs.timer);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-11-26 15:21:49 +00:00
|
|
|
static bool blk_mq_poll_hybrid(struct request_queue *q,
|
|
|
|
struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
|
2016-11-04 15:34:34 +00:00
|
|
|
{
|
2018-11-26 15:21:49 +00:00
|
|
|
struct request *rq;
|
|
|
|
|
2019-03-18 14:44:41 +00:00
|
|
|
if (q->poll_nsec == BLK_MQ_POLL_CLASSIC)
|
2018-11-26 15:21:49 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!blk_qc_t_is_internal(cookie))
|
|
|
|
rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
|
|
|
|
else {
|
|
|
|
rq = blk_mq_tag_to_rq(hctx->sched_tags, blk_qc_t_to_tag(cookie));
|
|
|
|
/*
|
|
|
|
* With scheduling, if the request has completed, we'll
|
|
|
|
* get a NULL return here, as we clear the sched tag when
|
|
|
|
* that happens. The request still remains valid, like always,
|
|
|
|
* so we should be safe with just the NULL check.
|
|
|
|
*/
|
|
|
|
if (!rq)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2020-02-26 12:10:15 +00:00
|
|
|
return blk_mq_poll_hybrid_sleep(q, rq);
|
2018-11-26 15:21:49 +00:00
|
|
|
}
|
|
|
|
|
2018-12-02 16:46:26 +00:00
|
|
|
/**
|
|
|
|
* blk_poll - poll for IO completions
|
|
|
|
* @q: the queue
|
|
|
|
* @cookie: cookie passed back at IO submission time
|
|
|
|
* @spin: whether to spin for completions
|
|
|
|
*
|
|
|
|
* Description:
|
|
|
|
* Poll for completions on the passed in queue. Returns number of
|
|
|
|
* completed entries found. If @spin is true, then blk_poll will continue
|
|
|
|
* looping until at least one completion is found, unless the task is
|
|
|
|
* otherwise marked running (or we need to reschedule).
|
|
|
|
*/
|
|
|
|
int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
|
2018-11-26 15:21:49 +00:00
|
|
|
{
|
|
|
|
struct blk_mq_hw_ctx *hctx;
|
2021-06-11 08:28:17 +00:00
|
|
|
unsigned int state;
|
2016-11-04 15:34:34 +00:00
|
|
|
|
2018-12-02 16:46:26 +00:00
|
|
|
if (!blk_qc_t_valid(cookie) ||
|
|
|
|
!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
|
2018-11-26 15:21:49 +00:00
|
|
|
return 0;
|
|
|
|
|
2018-12-02 16:46:26 +00:00
|
|
|
if (current->plug)
|
|
|
|
blk_flush_plug_list(current->plug, false);
|
|
|
|
|
2018-11-26 15:21:49 +00:00
|
|
|
hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
|
|
|
|
|
2016-11-14 20:01:59 +00:00
|
|
|
/*
|
|
|
|
* If we sleep, have the caller restart the poll loop to reset
|
|
|
|
* the state. Like for the other success return cases, the
|
|
|
|
* caller is responsible for checking if the IO completed. If
|
|
|
|
* the IO isn't complete, we'll get called again and will go
|
2020-12-06 14:04:39 +00:00
|
|
|
* straight to the busy poll loop. If specified not to spin,
|
|
|
|
* we also should not sleep.
|
2016-11-14 20:01:59 +00:00
|
|
|
*/
|
2020-12-06 14:04:39 +00:00
|
|
|
if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
|
2018-11-06 20:30:55 +00:00
|
|
|
return 1;
|
2016-11-14 20:01:59 +00:00
|
|
|
|
2016-11-04 15:34:34 +00:00
|
|
|
hctx->poll_considered++;
|
|
|
|
|
2021-06-11 08:28:14 +00:00
|
|
|
state = get_current_state();
|
2018-11-14 04:32:10 +00:00
|
|
|
do {
|
2016-11-04 15:34:34 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
hctx->poll_invoked++;
|
|
|
|
|
2018-11-16 16:48:21 +00:00
|
|
|
ret = q->mq_ops->poll(hctx);
|
2016-11-04 15:34:34 +00:00
|
|
|
if (ret > 0) {
|
|
|
|
hctx->poll_success++;
|
2018-11-16 15:37:34 +00:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-06 20:30:55 +00:00
|
|
|
return ret;
|
2016-11-04 15:34:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (signal_pending_state(state, current))
|
2018-11-16 15:37:34 +00:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2016-11-04 15:34:34 +00:00
|
|
|
|
2021-06-11 08:28:12 +00:00
|
|
|
if (task_is_running(current))
|
2018-11-06 20:30:55 +00:00
|
|
|
return 1;
|
2018-11-26 15:24:43 +00:00
|
|
|
if (ret < 0 || !spin)
|
2016-11-04 15:34:34 +00:00
|
|
|
break;
|
|
|
|
cpu_relax();
|
2018-11-14 04:32:10 +00:00
|
|
|
} while (!need_resched());
|
2016-11-04 15:34:34 +00:00
|
|
|
|
2018-02-13 15:48:12 +00:00
|
|
|
__set_current_state(TASK_RUNNING);
|
2018-11-06 20:30:55 +00:00
|
|
|
return 0;
|
2016-11-04 15:34:34 +00:00
|
|
|
}
|
2018-12-02 16:46:26 +00:00
|
|
|
EXPORT_SYMBOL_GPL(blk_poll);
|
2016-11-04 15:34:34 +00:00
|
|
|
|
2018-10-31 23:01:22 +00:00
|
|
|
unsigned int blk_mq_rq_cpu(struct request *rq)
|
|
|
|
{
|
|
|
|
return rq->mq_ctx->cpu;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(blk_mq_rq_cpu);
|
|
|
|
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
static int __init blk_mq_init(void)
|
|
|
|
{
|
2020-06-11 06:44:41 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_possible_cpu(i)
|
2021-01-23 20:10:27 +00:00
|
|
|
init_llist_head(&per_cpu(blk_cpu_done, i));
|
2020-06-11 06:44:41 +00:00
|
|
|
open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
|
|
|
|
|
|
|
|
cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
|
|
|
|
"block/softirq:dead", NULL,
|
|
|
|
blk_softirq_cpu_dead);
|
2016-09-22 14:05:17 +00:00
|
|
|
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
|
|
|
|
blk_mq_hctx_notify_dead);
|
2020-05-29 13:53:15 +00:00
|
|
|
cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
|
|
|
|
blk_mq_hctx_notify_online,
|
|
|
|
blk_mq_hctx_notify_offline);
|
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 08:20:05 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
subsys_initcall(blk_mq_init);
|