Commit Graph

119 Commits

Author SHA1 Message Date
Linus Torvalds
e75437fb93 Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver update from Jens Axboe:
 "This is the block driver pull request for 3.18.  Not a lot in there
  this round, and nothing earth shattering.

   - A round of drbd fixes from the linbit team, and an improvement in
     asender performance.

   - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and
     hd from Michael Opdenacker.

   - Disable entropy collection from flash devices by default, from Mike
     Snitzer.

   - A small collection of xen blkfront/back fixes from Roger Pau Monné
     and Vitaly Kuznetsov"

* 'for-3.18/drivers' of git://git.kernel.dk/linux-block:
  block: disable entropy contributions for nonrot devices
  xen, blkfront: factor out flush-related checks from do_blkif_request()
  xen-blkback: fix leak on grant map error path
  xen/blkback: unmap all persistent grants when frontend gets disconnected
  rsxx: Remove deprecated IRQF_DISABLED
  block: hd: remove deprecated IRQF_DISABLED
  drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks
  drbd: compute the end before rb_insert_augmented()
  drbd: Add missing newline in resync progress display in /proc/drbd
  drbd: reduce lock contention in drbd_worker
  drbd: Improve asender performance
  drbd: Get rid of the WORK_PENDING macro
  drbd: Get rid of the __no_warn and __cond_lock macros
  drbd: Avoid inconsistent locking warning
  drbd: Remove superfluous newline from "resync_extents" debugfs entry.
  drbd: Use consistent names for all the bi_end_io callbacks
  drbd: Use better variable names
2014-10-18 12:12:45 -07:00
David Vrabel
95afae4814 xen: remove DEFINE_XENBUS_DRIVER() macro
The DEFINE_XENBUS_DRIVER() macro looks a bit weird and causes sparse
errors.

Replace the uses with standard structure definitions instead.  This is
similar to pci and usb device registration.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-06 10:27:57 +01:00
Roger Pau Monné
61cecca865 xen-blkback: fix leak on grant map error path
Fix leaking a page when a grant mapping has failed.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-and-Tested-by: Tao Chen <boby.chen@huawei.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-10-01 16:32:31 -04:00
Vitaly Kuznetsov
12ea729645 xen/blkback: unmap all persistent grants when frontend gets disconnected
blkback does not unmap persistent grants when frontend goes to Closed
state (e.g. when blkfront module is being removed). This leads to the
following in guest's dmesg:

[  343.243825] xen:grant_table: WARNING: g.e. 0x445 still in use!
[  343.243825] xen:grant_table: WARNING: g.e. 0x42a still in use!
...

When load module -> use device -> unload module sequence is performed multiple times
it is possible to hit BUG() condition in blkfront module:

[  343.243825] kernel BUG at drivers/block/xen-blkfront.c:954!
[  343.243825] invalid opcode: 0000 [#1] SMP
[  343.243825] Modules linked in: xen_blkfront(-) ata_generic pata_acpi [last unloaded: xen_blkfront]
...
[  343.243825] Call Trace:
[  343.243825]  [<ffffffff814111ef>] ? unregister_xenbus_watch+0x16f/0x1e0
[  343.243825]  [<ffffffffa0016fbf>] blkfront_remove+0x3f/0x140 [xen_blkfront]
...
[  343.243825] RIP  [<ffffffffa0016aae>] blkif_free+0x34e/0x360 [xen_blkfront]
[  343.243825]  RSP <ffff88001eb8fdc0>

We don't need to keep these grants if we're disconnecting as frontend might already
forgot about them. Solve the issue by moving xen_blkbk_free_caches() call from
xen_blkif_free() to xen_blkif_disconnect().

Now we can see the following:
[  928.590893] xen:grant_table: WARNING: g.e. 0x587 still in use!
[  928.591861] xen:grant_table: WARNING: g.e. 0x372 still in use!
...
[  929.592146] xen:grant_table: freeing g.e. 0x587
[  929.597174] xen:grant_table: freeing g.e. 0x372
...

Backend does not keep persistent grants any more, reconnect works fine.

CC: stable@vger.kernel.org
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-10-01 16:32:23 -04:00
Valentin Priescu
814d04e7df xen-blkback: defer freeing blkif to avoid blocking xenwatch
Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
requests to finish. If the VBD is attached to a hot-swappable disk, then
xenwatch can hang for a long period of time, stalling other watches.

 INFO: task xenwatch:39 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
 Call Trace:
 [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
 [<ffffffff8128f761>] ? list_del+0x11/0x40
 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160
 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
 [<ffffffff8147a26a>] schedule+0x3a/0x60
 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
 [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
 [<ffffffff81346488>] device_release_driver+0x28/0x40
 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0
 [<ffffffff81342c9f>] device_del+0x12f/0x1a0
 [<ffffffff81342d2d>] device_unregister+0x1d/0x60
 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
 [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
 [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10
 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
 [<ffffffff810799d6>] kthread+0x96/0xa0
 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10
 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
 [<ffffffff81485930>] ? gs_change+0x13/0x13

With this patch, when there is still pending I/O, the actual disconnect
is done by the last reference holder (last pending I/O request). In this
case, xenwatch doesn't block indefinitely.

Signed-off-by: Valentin Priescu <priescuv@amazon.com>
Reviewed-by: Steven Kady <stevkady@amazon.com>
Reviewed-by: Steven Noonan <snoonan@amazon.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28 14:17:32 -04:00
Olaf Hering
c926b701fe xen/blkback: disable discard feature if requested by toolstack
Newer toolstacks may provide a boolean property "discard-enable" in the
backend node. Its purpose is to disable discard for file backed storage
to avoid fragmentation. Recognize this setting also for physical
storage.  If that property exists and is false, do not advertise
"feature-discard" to the frontend.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28 14:17:28 -04:00
Roger Pau Monne
abb97b8c50 xen-blkback: init persistent_purge_work work_struct
Initialize persistent_purge_work work_struct on xen_blkif_alloc (and
remove the previous initialization done in purge_persistent_gnt). This
prevents flush_work from complaining even if purge_persistent_gnt has
not been used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-11 20:34:03 -07:00
Jens Axboe
9d4cb8e3a5 Merge branch 'stable/for-jens-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-linus
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.14

which is based off v3.13-rc6. If you would like me to rebase it on
a different branch/tag I would be more than happy to do so.

The patches are all bug-fixes and hopefully can go in 3.14.

They deal with xen-blkback shutdown and cause memory leaks
as well as shutdown races. They should go to stable tree and if you
are OK with I will ask them to backport those fixes.

There is also a fix to xen-blkfront to deal with unexpected state
transition. And lastly a fix to the header where it was using the
__aligned__ unnecessarily.
2014-02-10 12:52:34 -07:00
Roger Pau Monne
80bfa2f6e2 xen-blkif: drop struct blkif_request_segment_aligned
This was wrongly introduced in commit 402b27f9, the only difference
between blkif_request_segment_aligned and blkif_request_segment is
that the former has a named padding, while both share the same
memory layout.

Also correct a few minor glitches in the description, including for it
to no longer assume PAGE_SIZE == 4096.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[Description fix by Jan Beulich]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 13:03:53 -05:00
Roger Pau Monne
c05f3e3c85 xen-blkback: fix shutdown race
Introduce a new variable to keep track of the number of in-flight
requests. We need to make sure that when xen_blkif_put is called the
request has already been freed and we can safely free xen_blkif, which
was not the case before.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:59:30 -05:00
Roger Pau Monne
ef75341133 xen-blkback: fix memory leaks
I've at least identified two possible memory leaks in blkback, both
related to the shutdown path of a VBD:

- blkback doesn't wait for any pending purge work to finish before
  cleaning the list of free_pages. The purge work will call
  put_free_pages and thus we might end up with pages being added to
  the free_pages list after we have emptied it. Fix this by making
  sure there's no pending purge work before exiting
  xen_blkif_schedule, and moving the free_page cleanup code to
  xen_blkif_free.
- blkback doesn't wait for pending requests to end before cleaning
  persistent grants and the list of free_pages. Again this can add
  pages to the free_pages list or persistent grants to the
  persistent_gnts red-black tree. Fixed by moving the persistent
  grants and free_pages cleanup code to xen_blkif_free.

Also, add some checks in xen_blkif_free to make sure we are cleaning
everything.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:58:46 -05:00
Matt Rushton
2ed22e3c3b xen-blkback: fix memory leak when persistent grants are used
Currently shrink_free_pagepool() is called before the pages used for
persistent grants are released via free_persistent_gnts(). This
results in a memory leak when a VBD that uses persistent grants is
torn down.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xen.org
Cc: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Matt Rushton <mrushton@amazon.com>
Signed-off-by: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07 12:58:18 -05:00
Kent Overstreet
4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Vegard Nossum
ea5ec76d76 xen/blkback: fix reference counting
If the permission check fails, we drop a reference to the blkif without
having taken it in the first place. The bug was introduced in commit
604c499cbb (xen/blkback: Check device
permissions before allowing OP_DISCARD).

Cc: stable@vger.kernel.org
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:10:27 -07:00
Jingoo Han
bb8e0e84b3 block: replace strict_strtoul() with kstrtoul()
The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete.  Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:56 -07:00
Linus Torvalds
d4c90b1b9f Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block
Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
2013-07-22 19:02:52 -07:00
Kees Cook
f170168b9a drivers: avoid parsing names as kthread_run() format strings
Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:41 -07:00
Roger Pau Monne
1e0f7a21b2 xen-blkback: check the number of iovecs before allocating a bios
With the introduction of indirect segments we can receive requests
with a number of segments bigger than the maximum number of allowed
iovecs in a bios, so make sure that blkback doesn't try to allocate a
bios with more iovecs than BIO_MAX_PAGES

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-25 10:00:58 -04:00
Roger Pau Monne
2d9105433f xen-blkback: workaround compiler bug in gcc 4.1
The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54)
creates an unbound loop for the second foreach_grant_safe loop in
purge_persistent_gnt.

The workaround is to avoid having this second loop and instead
perform all the work inside the first loop by adding a new variable,
clean_used, that will be set when all the desired persistent grants
have been removed and we need to iterate over the remaining ones to
remove the WAS_ACTIVE flag.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Tom O'Neill <toneill@vmem.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21 15:58:44 -04:00
Konrad Rzeszutek Wilk
8e3f875554 xen/blkback: Check for insane amounts of request on the ring (v6).
Check that the ring does not have an insane amount of requests
(more than there could fit on the ring).

If we detect this case we will stop processing the requests
and wait until the XenBus disconnects the ring.

The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
many responses we have created in the past (rsp_prod_pvt) vs
requests consumed (req_cons) and whether said difference is greater or
equal to the size of the ring, does not catch this case.

Wha the condition does check if there is a need to process more
as we still have a backlog of responses to finish. Note that both
of those values (rsp_prod_pvt and req_cons) are not exposed on the
shared ring.

To understand this problem a mini crash course in ring protocol
response/request updates is in place.

There are four entries: req_prod and rsp_prod; req_event and rsp_event
to track the ring entries. We are only concerned about the first two -
which set the tone of this bug.

The req_prod is a value incremented by frontend for each request put
on the ring. Conversely the rsp_prod is a value incremented by the backend
for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
pushing the responses on the ring).  Both values can
wrap and are modulo the size of the ring (in block case that is 32).
Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.

The culprit here is that if the difference between the
req_prod and req_cons is greater than the ring size we have a problem.
Fortunately for us, the '__do_block_io_op' loop:

	rc = blk_rings->common.req_cons;
	rp = blk_rings->common.sring->req_prod;

	while (rc != rp) {

		..
		blk_rings->common.req_cons = ++rc; /* before make_response() */

	}

will loop up to the point when rc == rp. The macros inside of the
loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
of the ring size. If the frontend has provided a bogus req_prod value
we will loop until the 'rc == rp' - which means we could be processing
already processed requests (or responses) often.

The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
b/c it only tracks how many responses we have internally produced
and whether we would should process more. The astute reader will
notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
arguments - more on this later.

For example, if we were to enter this function with these values:

       	blk_rings->common.sring->req_prod =  X+31415 (X is the value from
		the last time __do_block_io_op was called).
        blk_rings->common.req_cons = X
        blk_rings->common.rsp_prod_pvt = X

The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
is doing:

	req_cons - rsp_prod_pvt >= 32

Which is,
	X - X >= 32 or 0 >= 32

And that is false, so we continue on looping (this bug).

If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
instead (sring->req_prod) of rc, the this macro can do the check:

     req_prod - rsp_prov_pvt >= 32

Which is,
       X + 31415 - X >= 32 , or 31415 >= 32

which is true, so we can error out and break out of the function.

Unfortunatly the difference between rsp_prov_pvt and req_prod can be
at 32 (which would error out in the macro). This condition exists when
the backend is lagging behind with the responses and still has not finished
responding to all of them (so make_response has not been called), and
the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
to use said macro.

Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
a simple check of:

    req_prod - rsp_prod_pvt > RING_SIZE

And with the X values from above:

   X + 31415 - X > 32

Returns true. Also not that if the ring is full (which is where
the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
same condition:

   X + 32 - X > 32

Which is false.

Lets use that macro.
Note that in v5 of this patchset the macro was different - we used an
earlier version.

Cc: stable@vger.kernel.org
[v1: Move the check outside the loop]
[v2: Add a pr_warn as suggested by David]
[v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
[v4: Move wake_up after kthread_stop as suggested by Jan]
[v5: Use RING_REQUEST_PROD_OVERFLOW instead]
[v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gadsa
2013-06-17 15:17:16 -04:00
Konrad Rzeszutek Wilk
604c499cbb xen/blkback: Check device permissions before allowing OP_DISCARD
We need to make sure that the device is not RO or that
the request is not past the number of sectors we want to
issue the DISCARD operation for.

This fixes CVE-2013-2140.

Cc: stable@vger.kernel.org
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
[v1: Made it pr_warn instead of pr_debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:55 -04:00
Stefan Bader
7c4d7d710f xen/blkback: Use physical sector size for setup
Currently xen-blkback passes the logical sector size over xenbus and
xen-blkfront sets up the paravirt disk with that logical block size.
But newer drives usually have the logical sector size set to 512 for
compatibility reasons and would show the actual sector size only in
physical sector size.
This results in the device being partitioned and accessed in dom0 with
the correct sector size, but the guest thinks 512 bytes is the correct
block size. And that results in poor performance.

To fix this, blkback gets modified to pass also physical-sector-size
over xenbus and blkfront to use both values to set up the paravirt
disk. I did not just change the passed in sector-size because I am
not sure having a bigger logical sector size than the physical one
is valid (and that would happen if a newer dom0 kernel hits an older
domU kernel). Also this way a domU set up before should still be
accessible (just some tools might detect the unaligned setup).

[v2: Make xenbus write failure non-fatal]
[v3: Use xenbus_scanf instead of xenbus_gather]
[v4: Rebased against segment changes]

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-07 17:05:53 -04:00
Roger Pau Monne
bb642e8315 xen-blkback: allocate list of pending reqs in small chunks
Allocate pending requests in smaller chunks instead of allocating them
all at the same time.

This change also removes the global array of pending_reqs, it is no
longer necessay.

Variables related to the grant mapping have been grouped into a struct
called "grant_page", this allows to allocate them in smaller chunks,
and also improves memory locality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-07 09:42:17 -04:00
Roger Pau Monne
402b27f9f2 xen-block: implement indirect descriptors
Indirect descriptors introduce a new block operation
(BLKIF_OP_INDIRECT) that passes grant references instead of segments
in the request. This grant references are filled with arrays of
blkif_request_segment_aligned, this way we can send more segments in a
request.

The proposed implementation sets the maximum number of indirect grefs
(frames filled with blkif_request_segment_aligned) to 256 in the
backend and 32 in the frontend. The value in the frontend has been
chosen experimentally, and the backend value has been set to a sane
value that allows expanding the maximum number of indirect descriptors
in the frontend if needed.

The migration code has changed from the previous implementation, in
which we simply remapped the segments on the shared ring. Now the
maximum number of segments allowed in a request can change depending
on the backend, so we have to requeue all the requests in the ring and
in the queue and split the bios in them if they are bigger than the
new maximum number of segments.

[v2: Fixed minor comments by Konrad.
[v1: Added padding to make the indirect request 64bit aligned.
 Added some BUGs, comments; fixed number of indirect pages in
 blkif_get_x86_{32/64}_req. Added description about the indirect operation
 in blkif.h]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[v3: Fixed spaces and tabs mix ups]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 14:16:00 -04:00
Roger Pau Monne
31552ee32d xen-blkback: expand map/unmap functions
Preparatory change for implementing indirect descriptors. Change
xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount
of grants (previously it was limited to
BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req
in the map/unmap functions, so we can map/unmap grants without needing
to pass a pending_req.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:25 -04:00
Roger Pau Monne
bf0720c48c xen-blkback: make the queue of free requests per backend
Remove the last dependency from blkbk by moving the list of free
requests to blkif. This change reduces the contention on the list of
available requests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:25 -04:00
Roger Pau Monne
bb6acb289f xen-blkback: move pending handles list from blkbk to pending_req
Moving grant ref handles from blkbk to pending_req will allow us to
get rid of the shared blkbk structure.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:24 -04:00
Roger Pau Monne
3f3aad5e66 xen-blkback: implement LRU mechanism for persistent grants
This mechanism allows blkback to change the number of grants
persistently mapped at run time.

The algorithm uses a simple LRU mechanism that removes (if needed) the
persistent grants that have not been used since the last LRU run, or
if all grants have been used it removes the first grants in the list
(that are not in use).

The algorithm allows the user to change the maximum number of
persistent grants, by changing max_persistent_grants in sysfs.

Since we are storing the persistent grants used inside the request
struct (to be able to mark them as "unused" when unmapping), we no
longer need the bitmap (unmap_seg).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:23 -04:00
Roger Pau Monne
c6cc142dac xen-blkback: use balloon pages for all mappings
Using balloon pages for all granted pages allows us to simplify the
logic in blkback, especially in the xen_blkbk_map function, since now
we can decide if we want to map a grant persistently or not after we
have actually mapped it. This could not be done before because
persistent grants used ballooned pages, whereas non-persistent grants
used pages from the kernel.

This patch also introduces several changes, the first one is that the
list of free pages is no longer global, now each blkback instance has
it's own list of free pages that can be used to map grants. Also, a
run time parameter (max_buffer_pages) has been added in order to tune
the maximum number of free pages each blkback instance will keep in
it's buffer.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:22 -04:00
Roger Pau Monne
c1a15d08f4 xen-blkback: print stats about persistent grants
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18 09:29:21 -04:00
Roger Pau Monne
ffb1dabd1e xen-blkback: don't store dev_bus_addr
dev_bus_addr returned in the grant ref map operation is the mfn of the
passed page, there's no need to store it in the persistent grant
entry, since we can always get it provided that we have the page.

This reduces the memory overhead of persistent grants in blkback.

While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as
it makes much more sense - as we use that value in bio_add_page
which as the fourth argument expects the offset.

We hadn't used the physical address as part of this at all.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
[v1: s/buf/offset/]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 12:50:00 -04:00
Roger Pau Monne
217fd5e709 xen-blkback: fix foreach_grant_safe to handle empty lists
We may use foreach_grant_safe in the future with empty lists, so make
sure we can handle them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 08:53:15 -04:00
Jan Beulich
0e5e098ac2 xen-blkback: fix dispatch_rw_block_io() error path
Commit 7708992 ("xen/blkback: Seperate the bio allocation and the bio
submission") consolidated the pendcnt updates to just a single write,
neglecting the fact that the error path relied on it getting set to 1
up front (such that the decrement in __end_block_io_op() would actually
drop the count to zero, triggering the necessary cleanup actions).

Also remove a misleading and a stale (after said commit) comment.

CC: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-18 16:31:22 -04:00
Zoltan Kiss
986cacbd26 xen/blkback: Change statistics counter types to unsigned
These values shouldn't be negative, but after an overflow their value
can turn into negative, if they are signed. xentop can show bogus
values in this case.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11 13:56:54 -04:00
David Vrabel
0e367ae465 xen/blkback: correctly respond to unknown, non-native requests
If the frontend is using a non-native protocol (e.g., a 64-bit
frontend with a 32-bit backend) and it sent an unrecognized request,
the request was not translated and the response would have the
incorrect ID.  This may cause the frontend driver to behave
incorrectly or crash.

Since the ID field in the request is always in the same place,
regardless of the request type we can get the correct ID and make a
valid response (which will report BLKIF_RSP_EOPNOTSUPP).

This bug affected 64-bit SLES 11 guests when using a 32-bit backend.
This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES
source) and would crash in blkif_int() as the ID in the response would
be invalid.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11 13:54:28 -04:00
Chen Gang
a72d9002f8 xen/xen-blkback: preq.dev is used without initialized
If call xen_vbd_translate failed, the preq.dev will be not initialized.
Use blkif->vbd.pdevice instead (still better to print relative info).

Note that before commit 01c681d4c7
(xen/blkback: Don't trust the handle from the frontend.)
the value bogus, as it was the guest provided value from req->u.rw.handle
rather than the actual device.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-01 08:41:43 -05:00
Roger Pau Monne
087ffecdaa xen-blkback: use balloon pages for persistent grants
With current persistent grants implementation we are not freeing the
persistent grants after we disconnect the device. Since grant map
operations change the mfn of the allocated page, and we can no longer
pass it to __free_page without setting the mfn to a sane value, use
balloon grant pages instead, as the gntdev device does.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: stable@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:17:21 -05:00
Konrad Rzeszutek Wilk
01c681d4c7 xen/blkback: Don't trust the handle from the frontend.
The 'handle' is the device that the request is from. For the life-time
of the ring we copy it from a request to a response so that the frontend
is not surprised by it. But we do not need it - when we start processing
I/Os we have our own 'struct phys_req' which has only most essential
information about the request. In fact the 'vbd_translate' ends up
over-writing the preq.dev with a value from the backend.

This assignment of preq.dev with the 'handle' value is superfluous
so lets not do it.

Cc: stable@vger.kernel.org
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:17:03 -05:00
Jan Beulich
9d092603cc xen-blkback: do not leak mode property
"be->mode" is obtained from xenbus_read(), which does a kmalloc() for
the message body. The short string is never released, so do it along
with freeing "be" itself, and make sure the string isn't kept when
backend_changed() doesn't complete successfully (which made it
desirable to slightly re-structure that function, so that the error
cleanup can be done in one place).

Reported-by: Olaf Hering <olaf@aepfle.de>
CC: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:16:52 -05:00
Jens Axboe
b6c46cfa31 Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.8

which has a bug-fix to the xen-blkfront and xen-blkback driver
when using the persistent mode. An issue was discovered where LVM
disks could not be read correctly and this fixes it. There
is also a change in llist.h which has been blessed by akpm.
2012-12-19 20:37:10 +01:00
Linus Torvalds
9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Roger Pau Monne
7dc341175a xen-blkback: implement safe iterator for the list of persistent grants
Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-07 15:13:09 -05:00
Roger Pau Monne
4d4f270f18 xen-blkback: move free persistent grants code
Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26 14:58:11 -05:00
Roger Pau Monne
cb5bd4d19b xen/blkback: persistent-grants fixes
This patch contains fixes for persistent grants implementation v2:

 * handle == 0 is a valid handle, so initialize grants in blkback
   setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
   by Konrad Rzeszutek Wilk.

 * new_map is a boolean, use "true" or "false" instead of 1 and 0.
   Reported by Konrad Rzeszutek Wilk.

 * blkfront announces the persistent-grants feature as
   feature-persistent-grants, use feature-persistent instead which is
   consistent with blkback and the public Xen headers.

 * Add a consistency check in blkfront to make sure we don't try to
   access segments that have not been set.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-04 10:35:40 -05:00
Roger Pau Monne
0a8704a51f xen/blkback: Persistent grant maps for xen blk drivers
This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
2012-10-30 09:50:04 -04:00
Oliver Chick
1f999572f2 xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:20 +01:00
Wei Yongjun
654dbef214 xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:36:27 +01:00
Konrad Rzeszutek Wilk
2911758f14 xen/blkback: Fix compile warning
drivers/block/xen-blkback/xenbus.c:260:5: warning: symbol 'xenvbd_sysfs_addif' was not declared. Should it be static?
drivers/block/xen-blkback/xenbus.c:284:6: warning: symbol 'xenvbd_sysfs_delif' was not declared. Should it be static?

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:32:43 +01:00
Linus Torvalds
f1c6872e49 Features:
* Allow a Linux guest to boot as initial domain and as normal guests
    on Xen on ARM (specifically ARMv7 with virtualized extensions).
    PV console, block and network frontend/backends are working.
 Bug-fixes:
  * Fix compile linux-next fallout.
  * Fix PVHVM bootup crashing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQbJELAAoJEFjIrFwIi8fJSI4H/32qrQKyF5IIkFKHTN9FYDC1
 OxEGc4y47DIQpGUd/PgZ/i6h9Iyhj+I6pb4lCevykwgd0j83noepluZlCIcJnTfL
 HVXNiRIQKqFhqKdjTANxVM4APup+7Lqrvqj6OZfUuoxaZ3tSTLhabJ/7UXf2+9xy
 g2RfZtbSdQ1sukQ/A2MeGQNT79rh7v7PrYQUYSrqytjSjSLPTqRf75HWQ+eapIAH
 X3aVz8Tn6nTixZWvZOK7rAaD4awsFxGP6E46oFekB02f4x9nWHJiCZiXwb35lORb
 tz9F9td99f6N4fPJ9LgcYTaCPwzVnceZKqE9hGfip4uT+0WrEqDxq8QmBqI5YtI=
 =gxJD
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull ADM Xen support from Konrad Rzeszutek Wilk:

  Features:
   * Allow a Linux guest to boot as initial domain and as normal guests
     on Xen on ARM (specifically ARMv7 with virtualized extensions).  PV
     console, block and network frontend/backends are working.
  Bug-fixes:
   * Fix compile linux-next fallout.
   * Fix PVHVM bootup crashing.

  The Xen-unstable hypervisor (so will be 4.3 in a ~6 months), supports
  ARMv7 platforms.

  The goal in implementing this architecture is to exploit the hardware
  as much as possible.  That means use as little as possible of PV
  operations (so no PV MMU) - and use existing PV drivers for I/Os
  (network, block, console, etc).  This is similar to how PVHVM guests
  operate in X86 platform nowadays - except that on ARM there is no need
  for QEMU.  The end result is that we share a lot of the generic Xen
  drivers and infrastructure.

  Details on how to compile/boot/etc are available at this Wiki:

    http://wiki.xen.org/wiki/Xen_ARMv7_with_Virtualization_Extensions

  and this blog has links to a technical discussion/presentations on the
  overall architecture:

    http://blog.xen.org/index.php/2012/09/21/xensummit-sessions-new-pvh-virtualisation-mode-for-arm-cortex-a15arm-servers-and-x86/

* tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (21 commits)
  xen/xen_initial_domain: check that xen_start_info is initialized
  xen: mark xen_init_IRQ __init
  xen/Makefile: fix dom-y build
  arm: introduce a DTS for Xen unprivileged virtual machines
  MAINTAINERS: add myself as Xen ARM maintainer
  xen/arm: compile netback
  xen/arm: compile blkfront and blkback
  xen/arm: implement alloc/free_xenballooned_pages with alloc_pages/kfree
  xen/arm: receive Xen events on ARM
  xen/arm: initialize grant_table on ARM
  xen/arm: get privilege status
  xen/arm: introduce CONFIG_XEN on ARM
  xen: do not compile manage, balloon, pci, acpi, pcpu and cpu_hotplug on ARM
  xen/arm: Introduce xen_ulong_t for unsigned long
  xen/arm: Xen detection and shared_info page mapping
  docs: Xen ARM DT bindings
  xen/arm: empty implementation of grant_table arch specific functions
  xen/arm: sync_bitops
  xen/arm: page.h definitions
  xen/arm: hypercalls
  ...
2012-10-07 07:13:01 +09:00
Stefano Stabellini
2fc136eecd xen/m2p: do not reuse kmap_op->dev_bus_addr
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.

Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).

CC: stable@kernel.org
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-12 11:21:40 -04:00