This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.
Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of documenting the locking assumptions of most block layer
functions as a comment, use lockdep_assert_held() to verify locking
assumptions at runtime.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.
No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.
The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.
So remove the big helpful comment and the code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().
So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 23688bf4f8 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.
Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.
So:
- allocate 2 private biosets for blk_queue_bounce, one for
splitting enormous bios and one for cloning bios.
- add code to split a bio that exceeds BIO_MAX_PAGES.
- Fix up the comments in blk_queue_split()
Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code. I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.
Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This makes it available outside of blk-merge.c, and inlining such a trivial
helper seems pretty useful to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we end up doing a request-to-request merge when we have completed
a bio-to-request merge, we free the request from deep down in that
path. For blk-mq-sched, the merge path has to hold the appropriate
lock, but we don't need it for freeing the request. And in fact
holding the lock is problematic, since we are now calling the
mq sched put_rq_private() hook with the lock held. Other call paths
do not hold this lock.
Fix this inconsistency by ensuring that the caller frees a merged
request. Then we can do it outside of the lock, making it both more
efficient and fixing the blk-mq-sched problem of invoking parts of
the scheduler with an unknown lock state.
Reported-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
When we attempt to merge request-to-request, we return a 0/1 if we
ended up merging or not. Change that to return the pointer to the
request that we freed. We will use this to move the freeing of
that request out of the merge logic, so that callers can drop
locks before freeing the request.
There should be no functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.
We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.
We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Instead of allocating a single unused biovec for discard requests, send
them down without any payload. Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.
This has a couple of advantages:
- we don't have to allocate the bio_vec
- the amount of special casing for discard requests in the block
layer is significantly reduced
- using this same scheme for other request types is trivial,
which will be important for implementing the new WRITE_ZEROES
op on devices where it actually requires a payload (e.g. SCSI)
- we can get rid of playing games with the request length, as
we'll never touch it and completions will work just fine
- it will allow us to support ranged discard operations in the
future by merging non-contiguous discard bios into a single
request
- last but not least it removes a lot of code
This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Factor out common code for setting REQ_NOMERGE flag which is being used
out at certain places and make it a helper instead, req_set_nomerge().
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Get rid of the inline.
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a new block layer operation to zero out a range of
LBAs. This allows to implement zeroing for devices that don't use
either discard with a predictable zero pattern or WRITE SAME of zeroes.
The prominent example of that is NVMe with the Write Zeroes command,
but in the future, this should also help with improving the way
zeroing discards work. For this operation, suitable entry is exported in
sysfs which indicate the number of maximum bytes allowed in one
write zeroes operation by the device.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
After arbitrary bio size was introduced, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().
This patch fixes the following kernel crash:
> [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [ 172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [ 172.660399] Oops: 0000 [#1] SMP
> [...]
> [ 172.664780] Call Trace:
> [ 172.664813] [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [ 172.664846] [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> [ 172.664880] [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> [ 172.664912] [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> [ 172.664947] [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> [ 172.664981] [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> [ 172.665016] [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device
Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: stable@vger.kernel.org (4.3+)
Cc: Shaohua Li <shli@fb.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 288dab8a35 ("block: add a separate operation type for secure
erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
all the places REQ_OP_DISCARD was being used to mean either. Fix those.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: 288dab8a35 ("block: add a separate operation type for secure erase")
Signed-off-by: Jens Axboe <axboe@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver updates from Jens Axboe:
"This branch also contains core changes. I've come to the conclusion
that from 4.9 and forward, I'll be doing just a single branch. We
often have dependencies between core and drivers, and it's hard to
always split them up appropriately without pulling core into drivers
when that happens.
That said, this contains:
- separate secure erase type for the core block layer, from
Christoph.
- set of discard fixes, from Christoph.
- bio shrinking fixes from Christoph, as a followup up to the
op/flags change in the core branch.
- map and append request fixes from Christoph.
- NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
exciting!
- nvme-loop fixes from Arnd.
- removal of ->driverfs_dev from Dan, after providing a
device_add_disk() helper.
- bcache fixes from Bhaktipriya and Yijing.
- cdrom subchannel read fix from Vchannaiah.
- set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.
- set of drbd updates and fixes from Fabian, Lars, and Philipp.
- mg_disk error path fix from Bart.
- user notification for failed device add for loop, from Minfei.
- NVMe in general:
+ NVMe delay quirk from Guilherme.
+ SR-IOV support and command retry limits from Keith.
+ fix for memory-less NUMA node from Masayoshi.
+ use UINT_MAX for discard sectors, from Minfei.
+ cancel IO fixes from Ming.
+ don't allocate unused major, from Neil.
+ error code fixup from Dan.
+ use constants for PSDT/FUSE from James.
+ variable init fix from Jay.
+ fabrics fixes from Ming, Sagi, and Wei.
+ various fixes"
* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
nvme/pci: Provide SR-IOV support
nvme: initialize variable before logical OR'ing it
block: unexport various bio mapping helpers
scsi/osd: open code blk_make_request
target: stop using blk_make_request
block: simplify and export blk_rq_append_bio
block: ensure bios return from blk_get_request are properly initialized
virtio_blk: use blk_rq_map_kern
memstick: don't allow REQ_TYPE_BLOCK_PC requests
block: shrink bio size again
block: simplify and cleanup bvec pool handling
block: get rid of bio_rw and READA
block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
NVMe: don't allocate unused nvme_major
nvme: avoid crashes when node 0 is memoryless node.
nvme: Limit command retries
loop: Make user notify for adding loop device failed
nvme-loop: fix nvme-loop Kconfig dependencies
nvmet: fix return value check in nvmet_subsys_alloc()
...
For a front merge, the maximum number of sectors of the
request must be checked against the front merge BIO sector,
not the current sector of the request.
Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.
In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.
Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.
cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e
IO_ROOT=/mnt-cgroup/io
mkdir -p $IO_ROOT
if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi
cd $IO_ROOT
for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done
mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight
RUNTIME=10
(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&
(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&
(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&
sleep $((RUNTIME+1))
for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done
EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686
After applying the patch:
# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the block layer merging code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op
These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.
This patch converts the drivers and cgroup to use the
op_is_write helper. This should just cover the simple
cases. I did dm, md and bcache in their own patches
because they were more involved.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch applies the two introduced helpers to
figure out the 1st and last bvec.
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
After commit e36f62042880(block: split bios to maxpossible length),
bio can be splitted in the middle of a vector entry, then it
is easy to split out one bio which size isn't aligned with block
size, especially when the block size is bigger than 512.
This patch fixes the issue by making the max io size aligned
to logical block size.
Fixes: e36f62042880(block: split bios to maxpossible length)
Reported-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull core block updates from Jens Axboe:
"We don't have a lot of core changes this time around, it's mostly in
drivers, which will come in a subsequent pull.
The cores changes include:
- blk-mq
- Prep patch from Christoph, changing blk_mq_alloc_request() to
take flags instead of just using gfp_t for sleep/nosleep.
- Doc patch from me, clarifying the difference between legacy
and blk-mq for timer usage.
- Fixes from Raghavendra for memory-less numa nodes, and a reuse
of CPU masks.
- Cleanup from Geliang Tang, using offset_in_page() instead of open
coding it.
- From Ilya, rename request_queue slab to it reflects what it holds,
and a fix for proper use of bdgrab/put.
- A real fix for the split across stripe boundaries from Keith. We
yanked a broken version of this from 4.4-rc final, this one works.
- From Mike Krinkin, emit a trace message when we split.
- From Wei Tang, two small cleanups, not explicitly clearing memory
that is already cleared"
* 'for-4.5/core' of git://git.kernel.dk/linux-block:
block: use bd{grab,put}() instead of open-coding
block: split bios to max possible length
block: add call to split trace point
blk-mq: Avoid memoryless numa node encoded in hctx numa_node
blk-mq: Reuse hardware context cpumask for tags
blk-mq: add a flags parameter to blk_mq_alloc_request
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: clarify blk_add_timer() use case for blk-mq
bio: use offset_in_page macro
block: do not initialise statics to 0 or NULL
block: do not initialise globals to 0 or NULL
block: rename request_queue slab cache
This splits bio in the middle of a vector to form the largest possible
bio at the h/w's desired alignment, and guarantees the bio being split
will have some data.
The criteria for splitting is changed from the max sectors to the h/w's
optimal sector alignment if it is provided. For h/w that advertise their
block storage's underlying chunk size, it's a big performance win to not
submit commands that cross them. If sector alignment is not provided,
this patch uses the max sectors as before.
This addresses the performance issue commit d380561113 attempted to
fix, but was reverted due to splitting logic error.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit d380561113.
If we end up splitting on the first segment, we don't adjust
the sector count. That results in hitting a BUG() with attempting
to split 0 sectors.
As this is just a performance issue and not a regression since
4.3 release, let's just rever this change. That gives us more
time to test a real fix for 4.5, which would be marked for
stable anyway.
For h/w that advertise their block storage's underlying chunk size, it's
a big performance win to not submit commands that cross them. This patch
uses that criteria if it is provided. If it is not provided, this patch
uses the max sectors as before.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is a split tracepoint that is supposed to be called when
bio is splitted, and it was called in bio_split function until
commit 4b1faf9316 ("block: Kill bio_pair_split()").
But now, no one reports splits, so this patch adds call to
trace_block_split back in blk_queue_split right after split.
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When bio has only one physical segment, we should set bio's
bi_seg_front_size as the real(final) size of the single segment.
Fixes: 02e707424c2ea(blk-merge: fix blk_bio_segment_split)
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.
Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.
This patch fixes the issue by computing the two variables in
blk_bio_segment_split().
Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.
Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The splitted bio has been already too fat to merge, so mark it
as NOMERGE.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The number of bio->bi_phys_segments is always obtained
during bio splitting, so it is natural to setup it
just after bio splitting, then we can avoid to compute
nr_segment again during merge.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
biovecs has become immutable since v3.13, so it isn't necessary
to allocate biovecs for the new cloned bios, then we can save
one extra biovecs allocation/copy, and the allocation is often
not fixed-length and a bit more expensive.
For example, if the 'max_sectors_kb' of null blk's queue is set
as 16(32 sectors) via sysfs just for making more splits, this patch
can increase throught about ~70% in the sequential read test over
null_blk(direct io, bs: 1M).
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
This fixes a performance regression introduced by commit 54efd50bfd,
and allows us to take full advantage of the fact that we have immutable
bio_vecs. Hand applied, as it rejected violently with commit
5014c311ba.
Signed-off-by: Jens Axboe <axboe@fb.com>
If a driver sets the block queue virtual boundary mask, it means that
it cannot handle gaps so we must not allow those in the integrity
payload as well.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Fixed up by me to have duplicate integrity merge functions, depending
on whether block integrity is enabled or not. Fixes a compilations
issue with CONFIG_BLK_DEV_INTEGRITY unset.
Signed-off-by: Jens Axboe <axboe@fb.com>
We are checking for gaps to previous bio_vec, which can
only detect back merges gaps. Moreover, at the point where
we check for a gap, we don't know if we will attempt a back
or a front merge. Thus, check for gap to prev in a back merge
attempt and check for a gap to next in a front merge attempt.
Signed-off-by: Jens Axboe <axboe@fb.com>
[sagig: Minor rename change]
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
The compiler can't figure out that bvprv is initialized whenever 'prev'
is set to 1 as well. Use a pointer to bvprv instead, setting it to NULL
initially, and get rid of the 'prev' tracking. This dumbs it down
enough that gcc is happy.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull SG updates from Jens Axboe:
"This contains a set of scatter-gather related changes/fixes for 4.3:
- Add support for limited chaining of sg tables even for
architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph.
- Add sg chain support to target_rd. From Christoph.
- Fixup open coded sg->page_link in crypto/omap-sham. From
Christoph.
- Fixup open coded crypto ->page_link manipulation. From Dan.
- Also from Dan, automated fixup of manual sg_unmark_end()
manipulations.
- Also from Dan, automated fixup of open coded sg_phys()
implementations.
- From Robert Jarzmik, addition of an sg table splitting helper that
drivers can use"
* 'for-4.3/sg' of git://git.kernel.dk/linux-block:
lib: scatterlist: add sg splitting function
scatterlist: use sg_phys()
crypto/omap-sham: remove an open coded access to ->page_link
scatterlist: remove open coded sg_unmark_end instances
crypto: replace scatterwalk_sg_chain with sg_chain
target/rd: always chain S/G list
scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN
The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.
This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>