This chapter is storage specification for MapDB files.
File operations (such as file create, rename or sync) must be atomic and must survive system crash. In case of crash there is recovery operation after restart. If file operation did not finished it reverts everything into last stable state. That means file operations are atomic (they either succeed or fail without side effects).
To ensure crash resistance and atomicity MapDB relies on marker files. Those are empty files created and deleted using atomic filesystem API. Marker files have the same name as main file, but with .$X suffix.
Empty file creation is atomic operation, but populating file with content is not. MapDB needs file population to be atomic, and uses uses .$c marker file for that.
File creation sequence:
File.createNewFile()In case of recovery, or when file is being opened, follow this sequence:
In case of failure throw an data corruption exception.
Temporary file in MapDB is write-able file without crash protection (usually by write-ahead-log). Compared to File Create this file is opened continuously and only closed on system shutdown. If file was not closed, it most likely becomes corrupted and MapDB will refuse to reopen in.
File Create sequence is also used for temporary file without crash protection. In that case marker file stays while the main file is opened for write. If there is an crash, recovery sequence will find marker file, assume that main file was not closed correctly and will refuse to open it. In this case main file should be discarded and recreated from original data source. Or user can remove marker file and try his luck.
File Rename is used in StoreDirect compaction. Store is recreated in new file, and old file is replaced with new content. The ‘old file’ is file which is being replaced, it will be deleted before File Rename. The ‘new file’ replaces old file and has its name changed.
MapDB needs file move to be atomic, and supported in range variety of platforms. There are following problems:
java.nio.file.Files#move is atomic, but it might fail in some casesFile rename has following sequence:
java.nio.file.Files#move in atomic or non-atomic way. But rename operation must be finished and synced to disk.TODO this does not work on windows with memory mapped files. We need plan B with Volume copy, without closing them.
Recovery sequence is simple. If following files exist:
Than discard the old file if present and continue rename sequence from step ‘delete old file’
Rolling file is a single file, but continuously replaced with new content. To make content replacement atomic, the content of file is written into new file, synced and then old file is deleted. File name has ‘.N’ suffix, where N is sequential number increased with each commit. Rolling file is used in StoreTrivial.
There is following sequence for updating rolling file with new content. Ther is ‘old file’ with original content and number N and ‘new file’ with number N+1.
And there is following sequence for recovery
On commit or close, write cache needs to be flushed to disk, in MapDB this is called sync. We also need to detect corrupted files if system crashes in middle of write.
There are following ways to sync file:
Every non empty file created by MapDB has 16 byte header. It contains header, file version, bitfield for optional features and optional checksum for entire file.
Bites:
can have following values:
has following values. It is 8-byte long, number here is from least significant bit.
is either XXHash or CRC32. It is calculated as (checksum from 16th byte to end)+vol.getLong(0). If checksum is 0 the 1 value is used instead. 0 indicates checksum is disabled.
StoreDirect uses update in place. It keeps track of free space released by record deletion and reuses it. It has zero protection from crash, all updates are written directly into store. Write operations are very fast, but data corruption is almost guaranteed when JVM crashes. StoreDirect uses parity bits as passive protection from returning incorrect data after corruption. Internal data corruption should be detected reasonably fast.
StoreDirect allocates space in ‘pages’ of size 1MB. Operations such as readLong, readByte[] must be aligned so they do not cross page boundaries.
Header in StoreDirect format is composed by number of 8-byte longs. Each offset here is multiplied by 8
This is followed by Long Stack Master Pointers. Those are used to track free space, unused recids and other information.
8 - Free recid Long Stack, unused Recids are put here9 - Free records 16 - Long Stack with offsets of free records with size 1610 - Free records 32 - Long Stack with offsets of free records with size 32 etc…8+4095 - Free records 65520 - Long Stack with offsets of free records with size 65520 bytes (maximal unlinked record size). 4095 = 65520/16 is number of Free records Long Stacks.8+4095+1 until 8+4095+4 - Unused long stacks - Those could be used latter for some other purpose.Rest of the zero page (up to offset 1024*1024) is used as Index Page (sometimes it is refered as Zero Index Page). Offset to next Index Page (First Index Page) is at 8+4095+4+1, Zero Index Page checksum is at 8+4095+4+2. First recid value is at 8+4095+4+3.
Index page starts at N*PAGE_SIZE, except Zero Index Page which starts at 8 * (8+4095 + 4 + 1). Index page contains at start:
page+0) is pointer to next index page, Parity 16page+8) in page is checksum of all values on page (add all values)Rest of the index page is filled with index values.
Index value translates Record ID (recid) into offset in file and record size. Position and size of record might change as data are updated, that makes index tables necessary. Index Value is 8 byte long with parity 1
val>>48 to get itval&MOFFSET to get itlinked && size==0 indicates null record. Use val&MLINKED.val&MUNUSEDval&MARCHIVEMaximal record size is 64KB (16bits). To store larger records StoreDirect links multiple records into single one. Linked records starts with Index Value where Linked Record is indicates by a bit. If this bit is not set, entire record is reserved for record data. If Linked bit is set, than first 8 bytes store Record Link with offset and size of the next part.
Structure of Record Link is similar to Index Value. Except parity is 3.
val>>48 to get itval&MOFFSET to get itbite 4 - true if next record is linked, false if next record is last and not linked (is tail of linked record).
Use val&MLINKED
Tail of linked record (last part) does not have 8-byte Record Link at beginning.
Long Stack is linked queue of longs stored as part of storage. It supports two operations: put and take, longs are returned in FIFO order. StoreDirect uses this structure to keep track of free space. Space allocation involves taking long from stack. There are more stacks, each size has its own stack, there is also stack to keep track of free recids. For space usage there are in total 64K / 16 = 4096 Long Stacks (maximal non-linked record size is 64K and records are aligned to 16 bytes).
Long stack is organized similar way as linked record. It is stored in chunks, each chunks contains multiple long values and link to next chunk. Chunks size varies. Long values are stored in bidirectional-packed form, to make unpacking possible in both directions. Single value occupies from 2 bytes to 9 bytes. TODO explain bidi-packing, for now see DataIO class.
Each Long Stack is identified by master pointer, which points to its last chunk. Master Pointer for each Long Stack is stored in head of storage file at its reserved offset (zero page). Head chunk is not linked directly, one has to fully traverse Long Stack to get to head.
Structure of Long Stack Chunk is as follow:
Master Link structure:
Adding value to Long Stack goes as follow:
Taking value:
WAL protects storage from data corruption if transactions are enabled. Technically it is sequence of instructions written to append-only file. Each instruction says something like: ‘write this data at this offset’. TODO explain WAL.
WAL is stored in sequence of files.
Each instruction starts with single byte header. First 3 bits indicate type of instruction. Last 5 bits contain checksum to verify instruction.
Type of instructions:
bit parity from offset & 31(bit count from 15 bytes + 1)&31(bit count from size + bit count from offset + 1 )&31(bit count from 3 bytes + 1)&31bit count from offset & 31bit count from offset & 31bit count from offset & 31SortedTableMap uses its own file format. File is split into page, where page size is power of two and maximal page size 1MB.
Each page has header. Header size is bigger for zero page, because it also contains file header. TODO header size.
After header there is a series of 4-byte integers.
First integer is number of nodes on this page (N). It is followed by N*2 integers. First N integers are offsets of key arrays for each node. Next N integers are offsets for value arrays for each node. Offsets are relative to page offset. The last integer points to end of data, rest of the page after that offset is filled with zeroes.
Offsets of key array (number i) are stored at: PAGE_OFFSET + HEAD_SIZE + I*4.
Offsets of value array (number i) are stored at: PAGE_OFFSET + HEAD_SIZE + (NODE_COUNT + I) * 4.