Log Store format and compaction
Designing reliable append-only stores is hard. This blog post outlines design for new append-only store for MapDB4. It has following features:
- Append-only log based files. Old record are never overwritten
- Compaction in background processes, does not block IO
- Lock-free snapshot and isolation
- Parallel writters
- Transactions with rollback
Store
is basic low level storage layer in MapDB. It maps recid (record ID) into binary record (byte[]). Here is interface definition
Files
LogStore
uses directory with multiple subdirectories:
data
- contains data filesblob
- stores large records, those are too big to be stored directly in log filegenerations
- file generation markers, used for replay
Data files
- Data file has maximal size
- store wide, configured at store creation (typically 128MB)
- maximal value is 2GB (ByteBuffer limit), file offset is signed int
- New records are appended to end of file
- When file becomes too big
- It is closed and
- New file is started
- Each Data File has 4byte number
- File number starts from 0 and increments for each file
- Newer file has higher file number
Blob files
- Some records are too big for log
- Compaction moves data, large records would cause overhead
- Large records are stored in separate files, in dedicated ‘blob’ directory
- Cut off size is configurable at store creation, typically 64KB
File Generations
- File number is N-bit unsigned long
- N depends on maximal Data File size, together they form 8-byte long
- Overtime this number overflows and starts again from #0
- To deal with overflow ‘file generations’ are used
- It ensures that data file #0 is deleted by the time it overflows
- It finds oldest log file for log replay (data file #0 might not be oldest)
- There are 64K file generations, highest two bytes in file number
- Generations form cyclic buffer
- Used generations are continously allocated
- As new file are added, old deleted this allocated window moves in cyclic buffer
- There is a ‘hole’ in cyclic buffer, log starts at upper end of this hole
- For example if used generations are 0-1 the hole is 2-64K, upper end of hole is 0
- Used generations are continously allocated
- Each generation has empty file in ‘generations’ directory
- if file with given name exists, generation is used
- New Data File can be only created if its generation is used
- New Data File can not be created if there is no hole in generation cyclic buffer
- it would be impossible to find oldest Data File and start of the log
Log Entries
- Each data file is composed of Log Entries
- It starts at beginning of file, log replay traverses file from start to end
- Each Log Entry starts with 1 byte header, that identifies its type
Log Entry Headers (they will most likely change in final version):
0
- invalid header- reserved to detect data corruption
1
- eof- end replay and skip to next file
2
- skip single byte- used for padding
3
- commit- written by tx commit
- apply changes from last commit
4
- rollback- written by tx rollback
- discard changes since last commit
5
- record- followed by 4 byte signed int (record size) and
- 8 byte recid
byte[]
that contains record
6
- blob record- followed by 8-byte recid
- followed by 8-byte long, blob file number
Compaction
- Compaction removes older version of records and reclaims space.
- It takes old Data File and moves active record into newer Data File, old file is then deleted
Algorithm:
- Select file #N
- big part of file should be unused, some file stats are needed
- Replay file, iterate over all records
- If record is active (its recid is in Index Table)
- append this record to new file
- use
FileChannel#transferTo()
- use
- if not, ignore this record
- append this record to new file
- at this point file should have no active records
- unmap delete
TODO: can transaction spread over multiple files?
Index Table
- Most recent version of record is determined by Index Table
- Index Table maps Recid (long) to File Position (long)
- Index Table is updated when record is inserted, updated or deleted
- When store is opened, all Data Files must be replayed to reconstruct index table
- Index Table can be saved every Nth files, that will require only a few Data Files
Snapshots makes immutable copy of Index Table
- Index Table can be stored in
- on-heap
Map<Long,Long>
- memory mapped flat temporary file,
offset=recid*8
- some sort of persistent (crash resistant) file, to prevent full log replay
- on-heap