Pull block driver updates from Jens Axboe:
"Here are the driver related changes for 3.5. It contains:
- The floppy changes from Jiri. Jiri is now also marked as the
maintainer of floppy.c, I shall be publically branding his forehead
with red hot iron at the next opportune moment.
- A batch of drbd updates and fixes from the linbit crew, as well as
fixes from others.
- Two small fixes for xen-blkfront courtesy of Jan."
* 'for-3.5/drivers' of git://git.kernel.dk/linux-block: (70 commits)
floppy: take over maintainership
floppy: remove floppy-specific O_EXCL handling
floppy: convert to delayed work and single-thread wq
xen-blkfront: module exit handling adjustments
xen-blkfront: properly name all devices
drbd: grammar fix in log message
drbd: check MODULE for THIS_MODULE
drbd: Restore the request restart logic
drbd: introduce a bio_set to allocate housekeeping bios from
drbd: remove unused define
drbd: bm_page_async_io: properly initialize page->private
drbd: use the newly introduced page pool for bitmap IO
drbd: add page pool to be used for meta data IO
drbd: allow bitmap to change during writeout from resync_finished
drbd: fix race between drbdadm invalidate/verify and finishing resync
drbd: fix resend/resubmit of frozen IO
drbd: Ensure that data_size is not 0 before using data_size-1 as index
drbd: Delay/reject other state changes while establishing a connection
drbd: move put_ldev from __req_mod() to the endio callback
drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
...
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.
In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).
Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN). The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.
In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.
Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.
The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).
To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
THIS_MODULE is NULL only when drbd is compiled as built-in,
so the #ifdef CONFIG_MODULES should be #ifdef MODULE instead.
This fixes the warning:
drivers/block/drbd/drbd_main.c: In function ‘drbd_buildtag’:
drivers/block/drbd/drbd_main.c:4187:24: warning: the comparison will always evaluate as ‘true’ for the address of ‘__this_module’ will never be NULL [-Waddress]
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
It got lost with the commit 5a7bbad27a
"block: remove support for bio remapping from ->make_request"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Don't rely on availability of bios from the global fs_bio_set,
we should use our own bio_set for meta data IO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
...).
Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Symptom: messages similar to
"FIXME asender in bm_change_bits_to,
bitmap locked for 'write from resync_finished' by worker"
If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered. If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.
To fix this, introduce the drbd_bm_write_copy_pages() variant.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.
If *in that very moment* a new verify or resync is triggered,
this can race:
ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.
This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.
Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).
Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".
Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit
drbd: Implemented real timeout checking for request processing time
if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.
This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This could be exploited by a peer which runs modified code.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Changes to the role and disk state should be delayed or rejected
while we establish a connection.
This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.
The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
One invocation in the endio handler is good enough,
we don't need mention it for each of the different ways
it calls __req_mod().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Just because this request happened during a resync does
not mean it may pretend to have been barrier-acked.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
READ_RETRY_REMOTE_CANCELED needs to be grouped with the other _CANCELED
cases, not with CONNECTION_LOST_WHILE_PENDING, as that would complete
(fail) the bio even if the device became suspended.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
OOS_HANDED_TO_NETWORK should not be grouped with the various
*_CANCELED/*_FAILED cases.
Also, not only clear the RQ_NET_QUEUED flag, but also mark it RQ_NET_DONE,
so it can be distinguished from a local-only request even after that.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We used to have a barrier implementation where barrier_nr 0 was
reserved. That is long gone. Just use the full sequence space.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We assumed only bios with bi_idx == 0 would end up
in drbd_make_request().
That is wrong.
At least device mapper, in __clone_and_map(), may submit
clones only covering a partial bio, but sharing
the original bvec, by adjusting bi_idx and relevant
other bio members of the clone.
We used __bio_for_each_segment() in various places,
even though that is documented as
* drivers should not use the __ version unless they _really_ want to
* run through the entire bio and not just pending pieces
Impact: we would send the full bio bvec, even for the clone
with bi_idx > 0, which will cause data corruption on the
peer (because we submit wrong data at the clone offset),
and will cause a DRBD protocol error, disconnect/reconnect
and resync (thus fixing the corruption),
because the next package header would be expected right
in the middle of the sent data, causing DRBD magic mismatch.
Fix: drop the assert, and use bio_for_each_segment()
instead of the __ version.
Conflicts:
drbd/drbd_tracing.c
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If a SyncTarget node gets a P_RS_DATA_REPLY before a P_DATA packet
for the same sector, it simply submits these two IO requests.
This is be possible because on the SyncSource node, the data of the
P_RS_DATA_REPLY packet was read from disk. Immediately after that a
write request from upper layers came in.
The disk scheduler or even the "hardware" queues on the disk drive might
reorder these writes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE
at the same time, and it happens that the line
remote = remote && drbd_should_do_remote(s);
stills sees C_WF_BITMAP_S, and
send_oos = rw == WRITE && drbd_should_send_oos(s);
already sees C_SYNC_SOURCE both are 0.
This causes the write to not be mirrored, but marked as out-of-sync on the
Sync_Source node.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Without this, iostat frequently sees bogus svctime and >= 100% "utilization".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.
Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()
Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().
This patch adds the timeout to all drbd_md_sync_page_io().
Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.
Actually triggered by the Testuite on kugel.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.
Log excerpt:
Remote failed to finish a request within ko-count * timeout
peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
asender terminated
...
peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
...
Connection closed
peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
receiver terminated
Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.
The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.
Fixed in several code paths:
Make sure the connection state is NetworkFailure or worse
before starting the cleanup in drbd_disconnect().
This should make sure the cleanup won't miss any requests.
Disallow receive_state() to "upgrade" the connection state
from an error state. This will make sure the "illegal" state
transition won't happen.
For all connection failure states,
relax the safe-guard in sanitize_state() again
to silently mask out those state changes
(e.g. Timeout -> Connected becomes Timeout -> Timeout).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks
left to be resynced (rs_left) in the current resync extent.
If it detects a mismatch, it complains, and forces a disconnect using
drbd_force_state(mdev, NS(conn, C_DISCONNECTING));
Unfortunately, this may be called while holding the req_lock,
and drbd_force_state() want's to aquire that lock itself. Deadlock.
Don't force a disconnect, but fix up rs_left by recounting and
reassigning the number of dirty blocks in that extent.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This bug might have caused troubles if disk-barriers and the ahead-behind
more are enabled at the same time.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
DRBD state changes schedule after_state_ch() actions to a worker thread,
which decides on the old and new states of that change, whether to send
an informational state update packet (P_STATE) to the peer.
If it decides to drbd_send_state(), it would however always send the
_curent_ state, which, if a second state change happens before the
after_state_ch() of the first ran, may "fast-forward" the peer's view
about this node. In most cases that is harmless, but sometimes this can
confuse DRBD, for example into not actually starting a necessary resync
if you do a very tight detach/attach loop on a Connected Secondary.
Fix this by always sending the "new" state of the respective state
transition which scheduled this after_state_ch() work.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When detaching, even cleanly detaching due to administrator request,
we always go through D_FAILED before we become D_DISKLESS.
Don't let that state change race with an in-flight meta data IO,
or that one might think it actually experienced an IO error.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_state_lock() is only there to serialize cluster wide state
changes. Testing the local disk state needs to happen while
holding the global_state_lock.
Otherwise you might see something like this (Oct 6 on kugel)
14:20:24 drbd0: conn( WFSyncUUID -> Connected ) disk( Inconsistent -> Failed )
14:20:24 drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
14:20:24 drbd0: conn( Connected -> SyncTarget ) disk( Failed -> Inconsistent )
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We have one pre-allocated page to do certain synchronous meta data IO with,
using it is serialized like so:
drbd_md_get_buffer();
drbd_md_sync_page_io();
drbd_md_sync_page_io();
...
drbd_md_put_buffer();
In drbd_md_sync_page_io() there is an
ASSERT(atomic_read(&mdev->md_io_in_use) == 1);
We want to be able to timeout on unresponsive lower level devices, so we
can "detach" in that case. Inside drbd_md_sync_page_io() we grab an extra
reference, to not have a dangling pointer in case a delayed IO eventually
does still complete, even after we "detached" already.
We need to put the extra reference before we signal completion from the
completion handler, or the second drbd_md_sync_page_io() above may
trigger the assert (reference count still 2).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.
Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting. But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING. The
restart of drbd_init() then resets thi->t_state to RUNNING.
I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.
The two parts of the fix:
* Do not cause the thread to restart if we detect the issue
with the sockets while we are in C_WF_CONNECTION.
* Make sure that all actions that would have set us to C_BROKEN_PIPE
happen before the state change to C_WF_REPORT_PARAMS.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
...when the peer has inconsistent data. In that case we failed to
clear the susp_nod flag. When the local disk was attached again
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Now, the new edition of the clause only fires if a diskless
peer gets promoted.
This is a fixup for "drbd: Delayed creation of current-UUID".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Bitmap IO may happend in the context of an application write,
in the generic block IO path. We need to use GFP_NOIO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When the disk-timeout is active, and it expires for a single request,
we consider the local disk as D_FAILED. Note: With this change,
I made both timeout based state transitions HARD state transitions.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>