bc30196f71
Maintainer information and documentation for drivers/nvdimm Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jens Axboe <axboe@fb.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
284 lines
12 KiB
Plaintext
284 lines
12 KiB
Plaintext
BTT - Block Translation Table
|
||
=============================
|
||
|
||
|
||
1. Introduction
|
||
---------------
|
||
|
||
Persistent memory based storage is able to perform IO at byte (or more
|
||
accurately, cache line) granularity. However, we often want to expose such
|
||
storage as traditional block devices. The block drivers for persistent memory
|
||
will do exactly this. However, they do not provide any atomicity guarantees.
|
||
Traditional SSDs typically provide protection against torn sectors in hardware,
|
||
using stored energy in capacitors to complete in-flight block writes, or perhaps
|
||
in firmware. We don't have this luxury with persistent memory - if a write is in
|
||
progress, and we experience a power failure, the block will contain a mix of old
|
||
and new data. Applications may not be prepared to handle such a scenario.
|
||
|
||
The Block Translation Table (BTT) provides atomic sector update semantics for
|
||
persistent memory devices, so that applications that rely on sector writes not
|
||
being torn can continue to do so. The BTT manifests itself as a stacked block
|
||
device, and reserves a portion of the underlying storage for its metadata. At
|
||
the heart of it, is an indirection table that re-maps all the blocks on the
|
||
volume. It can be thought of as an extremely simple file system that only
|
||
provides atomic sector updates.
|
||
|
||
|
||
2. Static Layout
|
||
----------------
|
||
|
||
The underlying storage on which a BTT can be laid out is not limited in any way.
|
||
The BTT, however, splits the available space into chunks of up to 512 GiB,
|
||
called "Arenas".
|
||
|
||
Each arena follows the same layout for its metadata, and all references in an
|
||
arena are internal to it (with the exception of one field that points to the
|
||
next arena). The following depicts the "On-disk" metadata layout:
|
||
|
||
|
||
Backing Store +-------> Arena
|
||
+---------------+ | +------------------+
|
||
| | | | Arena info block |
|
||
| Arena 0 +---+ | 4K |
|
||
| 512G | +------------------+
|
||
| | | |
|
||
+---------------+ | |
|
||
| | | |
|
||
| Arena 1 | | Data Blocks |
|
||
| 512G | | |
|
||
| | | |
|
||
+---------------+ | |
|
||
| . | | |
|
||
| . | | |
|
||
| . | | |
|
||
| | | |
|
||
| | | |
|
||
+---------------+ +------------------+
|
||
| |
|
||
| BTT Map |
|
||
| |
|
||
| |
|
||
+------------------+
|
||
| |
|
||
| BTT Flog |
|
||
| |
|
||
+------------------+
|
||
| Info block copy |
|
||
| 4K |
|
||
+------------------+
|
||
|
||
|
||
3. Theory of Operation
|
||
----------------------
|
||
|
||
|
||
a. The BTT Map
|
||
--------------
|
||
|
||
The map is a simple lookup/indirection table that maps an LBA to an internal
|
||
block. Each map entry is 32 bits. The two most significant bits are special
|
||
flags, and the remaining form the internal block number.
|
||
|
||
Bit Description
|
||
31 - 30 : Error and Zero flags - Used in the following way:
|
||
Bit Description
|
||
31 30
|
||
-----------------------------------------------------------------------
|
||
00 Initial state. Reads return zeroes; Premap = Postmap
|
||
01 Zero state: Reads return zeroes
|
||
10 Error state: Reads fail; Writes clear 'E' bit
|
||
11 Normal Block – has valid postmap
|
||
|
||
|
||
29 - 0 : Mappings to internal 'postmap' blocks
|
||
|
||
|
||
Some of the terminology that will be subsequently used:
|
||
|
||
External LBA : LBA as made visible to upper layers.
|
||
ABA : Arena Block Address - Block offset/number within an arena
|
||
Premap ABA : The block offset into an arena, which was decided upon by range
|
||
checking the External LBA
|
||
Postmap ABA : The block number in the "Data Blocks" area obtained after
|
||
indirection from the map
|
||
nfree : The number of free blocks that are maintained at any given time.
|
||
This is the number of concurrent writes that can happen to the
|
||
arena.
|
||
|
||
|
||
For example, after adding a BTT, we surface a disk of 1024G. We get a read for
|
||
the external LBA at 768G. This falls into the second arena, and of the 512G
|
||
worth of blocks that this arena contributes, this block is at 256G. Thus, the
|
||
premap ABA is 256G. We now refer to the map, and find out the mapping for block
|
||
'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
|
||
|
||
|
||
b. The BTT Flog
|
||
---------------
|
||
|
||
The BTT provides sector atomicity by making every write an "allocating write",
|
||
i.e. Every write goes to a "free" block. A running list of free blocks is
|
||
maintained in the form of the BTT flog. 'Flog' is a combination of the words
|
||
"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
|
||
|
||
lba : The premap ABA that is being written to
|
||
old_map : The old postmap ABA - after 'this' write completes, this will be a
|
||
free block.
|
||
new_map : The new postmap ABA. The map will up updated to reflect this
|
||
lba->postmap_aba mapping, but we log it here in case we have to
|
||
recover.
|
||
seq : Sequence number to mark which of the 2 sections of this flog entry is
|
||
valid/newest. It cycles between 01->10->11->01 (binary) under normal
|
||
operation, with 00 indicating an uninitialized state.
|
||
lba' : alternate lba entry
|
||
old_map': alternate old postmap entry
|
||
new_map': alternate new postmap entry
|
||
seq' : alternate sequence number.
|
||
|
||
Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
|
||
padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
|
||
done such that for any entry being written, it:
|
||
a. overwrites the 'old' section in the entry based on sequence numbers
|
||
b. writes the 'new' section such that the sequence number is written last.
|
||
|
||
|
||
c. The concept of lanes
|
||
-----------------------
|
||
|
||
While 'nfree' describes the number of concurrent IOs an arena can process
|
||
concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
|
||
process.
|
||
nlanes = min(nfree, num_cpus)
|
||
A lane number is obtained at the start of any IO, and is used for indexing into
|
||
all the on-disk and in-memory data structures for the duration of the IO. If
|
||
there are more CPUs than the max number of available lanes, than lanes are
|
||
protected by spinlocks.
|
||
|
||
|
||
d. In-memory data structure: Read Tracking Table (RTT)
|
||
------------------------------------------------------
|
||
|
||
Consider a case where we have two threads, one doing reads and the other,
|
||
writes. We can hit a condition where the writer thread grabs a free block to do
|
||
a new IO, but the (slow) reader thread is still reading from it. In other words,
|
||
the reader consulted a map entry, and started reading the corresponding block. A
|
||
writer started writing to the same external LBA, and finished the write updating
|
||
the map for that external LBA to point to its new postmap ABA. At this point the
|
||
internal, postmap block that the reader is (still) reading has been inserted
|
||
into the list of free blocks. If another write comes in for the same LBA, it can
|
||
grab this free block, and start writing to it, causing the reader to read
|
||
incorrect data. To prevent this, we introduce the RTT.
|
||
|
||
The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
|
||
into rtt[lane_number], the postmap ABA it is reading, and clears it after the
|
||
read is complete. Every writer thread, after grabbing a free block, checks the
|
||
RTT for its presence. If the postmap free block is in the RTT, it waits till the
|
||
reader clears the RTT entry, and only then starts writing to it.
|
||
|
||
|
||
e. In-memory data structure: map locks
|
||
--------------------------------------
|
||
|
||
Consider a case where two writer threads are writing to the same LBA. There can
|
||
be a race in the following sequence of steps:
|
||
|
||
free[lane] = map[premap_aba]
|
||
map[premap_aba] = postmap_aba
|
||
|
||
Both threads can update their respective free[lane] with the same old, freed
|
||
postmap_aba. This has made the layout inconsistent by losing a free entry, and
|
||
at the same time, duplicating another free entry for two lanes.
|
||
|
||
To solve this, we could have a single map lock (per arena) that has to be taken
|
||
before performing the above sequence, but we feel that could be too contentious.
|
||
Instead we use an array of (nfree) map_locks that is indexed by
|
||
(premap_aba modulo nfree).
|
||
|
||
|
||
f. Reconstruction from the Flog
|
||
-------------------------------
|
||
|
||
On startup, we analyze the BTT flog to create our list of free blocks. We walk
|
||
through all the entries, and for each lane, of the set of two possible
|
||
'sections', we always look at the most recent one only (based on the sequence
|
||
number). The reconstruction rules/steps are simple:
|
||
- Read map[log_entry.lba].
|
||
- If log_entry.new matches the map entry, then log_entry.old is free.
|
||
- If log_entry.new does not match the map entry, then log_entry.new is free.
|
||
(This case can only be caused by power-fails/unsafe shutdowns)
|
||
|
||
|
||
g. Summarizing - Read and Write flows
|
||
-------------------------------------
|
||
|
||
Read:
|
||
|
||
1. Convert external LBA to arena number + pre-map ABA
|
||
2. Get a lane (and take lane_lock)
|
||
3. Read map to get the entry for this pre-map ABA
|
||
4. Enter post-map ABA into RTT[lane]
|
||
5. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
|
||
6. If ERROR flag set in map, end IO with EIO (go to step 8)
|
||
7. Read data from this block
|
||
8. Remove post-map ABA entry from RTT[lane]
|
||
9. Release lane (and lane_lock)
|
||
|
||
Write:
|
||
|
||
1. Convert external LBA to Arena number + pre-map ABA
|
||
2. Get a lane (and take lane_lock)
|
||
3. Use lane to index into in-memory free list and obtain a new block, next flog
|
||
index, next sequence number
|
||
4. Scan the RTT to check if free block is present, and spin/wait if it is.
|
||
5. Write data to this free block
|
||
6. Read map to get the existing post-map ABA entry for this pre-map ABA
|
||
7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
|
||
8. Write new post-map ABA into map.
|
||
9. Write old post-map entry into the free list
|
||
10. Calculate next sequence number and write into the free list entry
|
||
11. Release lane (and lane_lock)
|
||
|
||
|
||
4. Error Handling
|
||
=================
|
||
|
||
An arena would be in an error state if any of the metadata is corrupted
|
||
irrecoverably, either due to a bug or a media error. The following conditions
|
||
indicate an error:
|
||
- Info block checksum does not match (and recovering from the copy also fails)
|
||
- All internal available blocks are not uniquely and entirely addressed by the
|
||
sum of mapped blocks and free blocks (from the BTT flog).
|
||
- Rebuilding free list from the flog reveals missing/duplicate/impossible
|
||
entries
|
||
- A map entry is out of bounds
|
||
|
||
If any of these error conditions are encountered, the arena is put into a read
|
||
only state using a flag in the info block.
|
||
|
||
|
||
5. In-kernel usage
|
||
==================
|
||
|
||
Any block driver that supports byte granularity IO to the storage may register
|
||
with the BTT. It will have to provide the rw_bytes interface in its
|
||
block_device_operations struct:
|
||
|
||
int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
|
||
|
||
It may register with the BTT after it adds its own gendisk, using btt_init:
|
||
|
||
struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
|
||
u32 lbasize, u8 uuid[], int maxlane);
|
||
|
||
note that maxlane is the maximum amount of concurrency the driver wishes to
|
||
allow the BTT to use.
|
||
|
||
The BTT 'disk' appears as a stacked block device that grabs the underlying block
|
||
device in the O_EXCL mode.
|
||
|
||
When the driver wishes to remove the backing disk, it should similarly call
|
||
btt_fini using the same struct btt* handle that was provided to it by btt_init.
|
||
|
||
void btt_fini(struct btt *btt);
|
||
|