fc0561cefc
xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
56 lines
2.0 KiB
C
56 lines
2.0 KiB
C
/*
|
|
* Copyright (c) 2000,2005 Silicon Graphics, Inc.
|
|
* All Rights Reserved.
|
|
*
|
|
* This program is free software; you can redistribute it and/or
|
|
* modify it under the terms of the GNU General Public License as
|
|
* published by the Free Software Foundation.
|
|
*
|
|
* This program is distributed in the hope that it would be useful,
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
* GNU General Public License for more details.
|
|
*
|
|
* You should have received a copy of the GNU General Public License
|
|
* along with this program; if not, write the Free Software Foundation,
|
|
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
|
|
*/
|
|
#ifndef __XFS_INODE_ITEM_H__
|
|
#define __XFS_INODE_ITEM_H__
|
|
|
|
/* kernel only definitions */
|
|
|
|
struct xfs_buf;
|
|
struct xfs_bmbt_rec;
|
|
struct xfs_inode;
|
|
struct xfs_mount;
|
|
|
|
typedef struct xfs_inode_log_item {
|
|
xfs_log_item_t ili_item; /* common portion */
|
|
struct xfs_inode *ili_inode; /* inode ptr */
|
|
xfs_lsn_t ili_flush_lsn; /* lsn at last flush */
|
|
xfs_lsn_t ili_last_lsn; /* lsn at last transaction */
|
|
unsigned short ili_lock_flags; /* lock flags */
|
|
unsigned short ili_logged; /* flushed logged data */
|
|
unsigned int ili_last_fields; /* fields when flushed */
|
|
unsigned int ili_fields; /* fields to be logged */
|
|
unsigned int ili_fsync_fields; /* logged since last fsync */
|
|
} xfs_inode_log_item_t;
|
|
|
|
static inline int xfs_inode_clean(xfs_inode_t *ip)
|
|
{
|
|
return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
|
|
}
|
|
|
|
extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
|
|
extern void xfs_inode_item_destroy(struct xfs_inode *);
|
|
extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
|
|
extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
|
|
extern void xfs_iflush_abort(struct xfs_inode *, bool);
|
|
extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
|
|
xfs_inode_log_format_t *);
|
|
|
|
extern struct kmem_zone *xfs_ili_zone;
|
|
|
|
#endif /* __XFS_INODE_ITEM_H__ */
|