Return an optional extent flags field from our lookup functions and wire up
callers to treat unwritten regions as holes for the purpose of returning
zeros to the user.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Due to the size of our group bitmaps, we'll never have a leaf node extent
record with more than 16 bits worth of clusters. Split e_clusters up so that
leaf nodes can get a flags field where we can mark unwritten extents.
Interior nodes whose length references all the child nodes beneath it can't
split their e_clusters field, so we use a union to preserve sizing there.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We need to fill holes during a splice write. Provide our own splice write
actor which can call ocfs2_file_buffered_write() with a splice-specific
callback.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Do this instead of filemap_fdatawrite() - this way we sync only the
range between i_size and the cluster boundary.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
do_sync_file_range() accepts a file * from which it takes an address_space to
sync. Abstract out the bulk of the function into do_sync_mapping_range()
which takes the address_space directly. This way callers who want to sync an
address_space directly can take advantage of the functionality provided.
do_sync_file_range() is preserved as a small wrapper around
do_sync_mapping_range().
Ocfs2 in particular would like to use this to initiate a sync of a specific
inode range during truncate, where a file * may not be available.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since we don't zero on extend anymore, truncate needs to be fixed up to zero
the part of a file between i_size and and end of it's cluster. Otherwise a
subsequent extend could expose bad data.
This introduced a new helper, which can be used in ocfs2_write().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_get_block() didn't understand sparse files, fix that. Also remove some
code that isn't really useful anymore. We can fix up
ocfs2_direct_IO_get_blocks() at the same time.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Unfortunately, ocfs2 can no longer make use of generic_file_aio_write_nlock()
because allocating writes will require zeroing of pages adjacent to the I/O
for cluster sizes greater than page size.
Implement a custom file write here, which can order page locks for zeroing.
This also has the advantage that cluster locks can easily be ordered outside
of the page locks.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Right now, file allocation for ocfs2 is done within ocfs2_extend_file(),
which is either called from ->setattr() (for an i_size change), or at the
top of ocfs2_file_aio_write().
Inodes on file systems with sparse file support will want to do their
allocation during the actual write call.
In either case the cluster locking decisions are the same. We abstract out
that code into a new function, ocfs2_lock_allocators() which will be used by
a later patch to enable writing to sparse files.
This also provides a nice cleanup of ocfs2_extend_allocation().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
For ocfs2_truncate_file(), we eliminate the "simple" truncate case which no
longer exists since i_size is not tied to i_clusters. In
ocfs2_extend_file(), we skip the allocation / page zeroing code for file
systems which understand sparse files.
The core truncate code is changed to do a bottom up tree traversal. This
gets abstracted out into it's own function. To make things more readable,
most of the special case handling for in-inode extents from
ocfs2_do_truncate() is also removed.
Though write support for sparse files comes in a later patch, we at least
update ocfs2_prepare_inode_for_write() to skip allocation for sparse files.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The code in extent_map.c is not prepared to deal with a subtree being
rotated between lookups. This can happen when filling holes in sparse files.
Instead of a lengthy patch to update the code (which would likely lose the
benefit of caching subtree roots), we remove most of the algorithms and
implement a simple path based lookup. A less ambitious extent caching scheme
will be added in a later patch.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Introduce tree rotations into the b-tree code. This will allow ocfs2 to
support sparse files. Much of the added code is designed to be generic (in
the ocfs2 sense) so that it can later be re-used to implement large
extended attributes.
This patch only adds the rotation code and does minimal updates to callers
of the extent api.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
There are two checks in there (one for inode newness, one for other mounted
nodes) which are unnecessary, so remove them. The DLM will allow the trylock
in either case without any messaging overhead.
Removing these makes ocfs2_request_delete() a one liner function, so just
move the trylock out one level into ocfs2_query_inode_wipe().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Remove node messaging code that becomes unused with the delete inode vote
removal.
[Removed even more cruft which I spotted during review --Mark]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Ocfs2 currently does cluster-wide node messaging to check the open state of
an inode during delete. This patch removes that mechanism in favor of an
inode cluster lock which is taken at shared read when an inode is first read
and dropped in clear_inode(). This allows a deleting node to test the
liveness of an inode by attempting to take an exclusive lock.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This brings the SAD info in sync with net-2.6.22/net-2.6
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't want to print anything at all in ocfs2_lookup() when getting an
error from ocfs2_iget() - it could be something as innocuous as a signal
being detected in the dlm.
ocfs2_permission() should filter on -ENOENT which ocfs2_meta_lock() can
return if the inode was deleted on another node.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We have noticed panic() hanging leading us to a situation in which
the node, while otherwise dead, is still disk heartbeating. This
leads to a hung cluster as the other nodes are waiting for this
node to stop disk heartbeating. This situation is only resolved
by power resetting the box.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We don't want the extent map and uptodate cache destruction in
ocfs2_meta_lock_update() on a local mount, so skip that.
This fixes several bugs with uptodate being cleared on buffers and extent
maps being corrupted.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
In dlm_migrate_all_locks(), we currently call cond_resched_lock() after
processing each lockres in a hash bucket. Move it outside the loop so as to
call it only after the entire hash bucket has been processed.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
There is a possibility that dlm_remaster_locks could overwride node->state
with DLM_RECO_NODE_DATA_REQUESTED after dlm_reco_data_done_handler sets the
node->state to DLM_RECO_NODE_DATA_DONE. This could lead to recovery getting
stuck and requires a cluster reboot. Synchronize with dlm_reco_state_lock
spinlock.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
A couple of routines need their arguments to be const.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This starts bringing the PowerPC and Sparc64 implemetations back closer
together.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should be set to the total number of pages that the
system will really have available after things like
initmem, the bootmem map, and initrd are freed up.
Signed-off-by: David S. Miller <davem@davemloft.net>
While useful in odd circumstances to debug something, they are
normally totally unused and anyone can fetch this code out of the
history if they really need it.
And in any event, the person who needs this kind of code is usually me
:-)
Signed-off-by: David S. Miller <davem@davemloft.net>
__get_phys is only called from init.c as is prom_virt_to_phys(),
__get_iospace() is not called at all, and sun4u_get_pte() is largely
misnamed.
Privatize the implementation and helper functions of
sun4u_get_phys() to mm/init.c, and rename to
kvaddr_to_paddr().
The only used of this thing is flush_icache_range(), and thus
things can be considerably further simplified. For example,
we should only see module or PAGE_OFFSET kernel addresses here,
so we don't need the OBP firmware range handling at all.
Signed-off-by: David S. Miller <davem@davemloft.net>
Kick out empty entries as soon as we spot them, and use memmove()
instead of a silly loop to make the operation more clear.
Signed-off-by: David S. Miller <davem@davemloft.net>
Decrease the SECTION_SIZE_BITS --> MAX_PHYSADDR_BITS
range a little bit.
The cost of going to SPARSEMEM_STATIC becomes 8K of BSS space, and in
return we save a pointer dereferences on every page struct lookup.
Even better we hit the main kernel image for the base address which is
in a hugepage locked TLB entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
This helps deal with the invisible bridge that sits between
the host controller and the top-most visisble PCI devices
on hypervisor systems.
For example, on T1000 the bus-range property says 2 --> 4
and so there is a PCI express bridge at bus 2, devfn 0, etc.
So if we don't force the dummy host controller to bus zero,
we'll try to create two devices with the same domain/bus/devfn
triplet.
Also, add some more log diagnostics to make debugging stuff like this
easyer.
Signed-off-by: David S. Miller <davem@davemloft.net>
We fake up a dummy one in all cases because that is the simplest
thing to do and it happens to be necessary for hypervisor systems.
Signed-off-by: David S. Miller <davem@davemloft.net>