Memory mapped files, allocation and JVM crash
I have some experience when it comes to memory map (mmap) files and database storage/ Long time ago I added mmap storage to H2 SQL database. MapDB was first pure java db to really utilize memory mapped files.
There is older article about mmap files I wrote last year. This article summarises experience I had with mmap files so far. It also outlines new approach I use in IODB, a storage engine developed for blockchain applications. This work is developed for and sponsored by Input Output HK.
Common problems with mmap files
Let’s start with common issues memory mapped (mmap) files have. They were part of Java for 15 year since 1.4 release. But mmap implementation is far from perfect. JVM has number of unfixed bugs related to memory mapped files. One of them is 15 years old now, there is small group of fans who celebrate its birthday every year ;-)
Those issues are:
Limited number of mmap handles
There is limited number of mmap handles per process (64K on Linux), If you use too many handles, JVM process will not be able to mmap new files. Usually your code will fail with exception:
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
But this failure may happen anywhere in JVM (when it loads new lib for example), and could even cause JVM crash.
You can count number of mmap handles process is using with this command: sudo cat /proc/$PID/maps | wc -l
You can also increase number of mmap handles in /proc/sys/vm/max_map_count
.
No unmap
Each mapped ByteBuffer
uses one mmap handle. But there is no official way to release this handle.
There is no official way to close memory mapped buffer. JVM releases file handles only after ByteBuffer
gets garbage collected. If GC is not invoked for some time (large free heap), you will run out of file handles.
Workaround is to use cleaner hack. It uses reflection to access undocumented API that releases mmap file handle, before ByteBuffer gets garbage collected.
However the ByteBuffer
can not be used after it was unmapped, any access to it lead to invalid memory operation and causes JVM to crash. So cleaner hack should be used together with some locking mechanism, that makes ByteBuffer
inaccessible after it was closed.
Windows file delete
If file region was mmaped on Windows, it is not possible to delete that fil. File remains undeletable, even after it was correctly closed, unmapped and JVM process exits. To delete this file, one has to restart operating system (anyone remembers deleteAfterRestart hook in Delphi? :-)) This was reported, but JVM team can not solve it, since it is Windows feature.
This problem is so annoying, that we recommend not to use mmap files on Windows.
JVM crash when disk gets full
It is possible to crash JVM using only official API:
- create empty file
- mmap large section of this file for write
- write into mmaped
ByteByffer
- operating system will keep modified memory pages for delayed write
- if delayed write fails, OS will send SIGBUS signal to JVM process
- JVM process does not handle SIGBUS signal, so JVM process gets killed by Linux kernel
One would think IO errors are rare. But it is very common situation when free disk space runs out. Most operating systems (such as Linux) use sparse files, that means no disk space is really allocated until first write. If operating system fails to allocate space (no free space on disk), delayed write fails and the JVM gets killed.
There is old bug report opened for MapDB. It was solved by preallocating space, when file expands it is filled with FileOutputStream
to ensure all file space is allocated. However this slows down MapDB inserts significantly, and better solution is needed.
mmap regions have fixed size, and do not expand
FileChannel#map returns fixed sizeByteBuffer
. There is no way to change its size after file expands.
MapDB solved this problem by mapping files in small buffers (1MB). However that uses too many file handles, and would cause crash when large stores are opened (>64GB).
Also it very slow on large files. It takes a long time to mmap a few thousands new ByteBuffer
(1MB regions and 20GB file) when file is open. And it takes a long time to sync such file to disk (iterate and call force()
over all ByteBuffer
).
2GB limit per single region
ByteBuffer
use 32bit signed integer for addressing, that limits size of single mmap region to 2 GB. One has to mmap multiple regions to handle larger files.
New approach
Over time I learned better way to handle memory mapped files. Here are a few pieces of the puzzle:
Read-only mmap, write with FileChannel
LMDB is key-value store which provides btree on top of mmap files. It is written in C, uses low level API and is pretty efficient. I recently studied its source code to see how its design could be reused in IODB.
The most interesting is that LMDB does not use mmap files for writing. Files are mmaped in read-only mode and uses FileChannel
like method to write changes to disk. Once changes are written and synced, they are visible to read-only mmaped ByteBuffer
.
This approach has some interesting features:
There is no JVM crash if disk becomes full and delayed write fails
EDIT: FileChannel (or any buffered writer) does not preserve write order unless it is append-only. LMDB does double sync to make changes durable.
Performance difference between writable mmap files and
FileChannel
is minimal if larger blocks are used.This works extremely well for append-only files
So this could be a new way to write append-only files such as Write Ahead Log.
mmap beyond end of file
Here is solution to another problem: how to increase size of mmaped ByteBuffer
, if the underlying file expands, but buffers are fixed size? MapDB maps file in 1MB increments, but that is very slow and requires too many resources (file handles).
Solution is very simple, and works pretty well on Linux. Let’s mmap beyond the end of file:
- create empty file
- open writable
FileChannel
for this file - use FileChannel#map() to map 2GB read-only
ByteBuffer
over this file - file size grows to 2GB, but it is sparse file, no disk space is actually used
- call FileChannel#truncate() to shrink file size back to zero
Now there is an empty file, with large read-only ByteBuffer
mmaped over it. File can be safely modified (and expanded) with FileeChannel
. File changes will be visible in mmapedByteBuffer
.
Once file grows over 2GB, mmap another 2GB region and use FileChannel#truncate()
again.
This approach does not work on all operating systems. In Windows, the FileChannel.truncate()
does not shrink file, empty file
still has 2GB. This method can still be usable on Windows, we just use smaller regions (128MB vs 2GB) and file can be truncated when store is closed (after unmap).
Fast disk space allocation
Until recently I though that region of a sparse file has only two states:
- Large file exists, its disk space is not allocated, data were not written yet
- Large file exists, its disk space is allocated and data were written
But apparently on Linux there is a third state, where the disk space is allocated, but no data were written yet. There is fallocate
call which handles that
JVM does not provide any “legal” way to call fallocate
directly. There are other options:
- JNA call, for example in this library.
- Another option is to execute
/usr/lib/fallocate
command. - And finally the
FileChannel.transferFrom()
from sparse file seems to have the same effect.
Why this is excellent? If sparse file is in this allocated state, the delayed write can not fail when disk is full. So when the file expands, we do not have to overwrite new region with zeroes, but can use fallocate
instead. That is much faster.
Comments
András Turi • 4 years ago
Hi!
About the “Windows file delete” part, I tried to reproduce the problem on Windows 7, but it didn’t seem to be that bad; after killing the JVM process manually while having a file memory-mapped, I was able to delete the file when the process was gone. I also tried the example code from the linked bug-report, indeed the mapped file is not deletable while the process is still running. However, after the process is gone, the file can be deleted. So I agree that there is the annoying problem of undeletable files, but an OS restart is not necessary, only killing the JVM process. Or am I missing something? Maybe this was worse in older Windows versions? Thanks Andras
–
Avatar David Bridges • 4 years ago
Awesome, thanks for sharing! I just recently discovered mmap files after running across this: https://www.mindjet.com/mma… and I cannot believe no one told me before how helpful they are, and the software itself! But since I just started, I needed some helpful tips to progress and this is amazing!!!
–
Francesco Nigro • 4 years ago
Hi!! Thanks to have shared, it is full of valuable informations! Just a couple of questions.. You mentioned that “FileChannel preserves write order”: MappedByteBuffer writes do not provide the same guarantees?
The content of a read-only mapping is in sync with what FileChannel has written only after a fsync? AFAIK on >2.6 Linux there is an unified page cache so I’m expecting that FileChannel and the read-only mapping are using the same pages, hence the content will be the same..am I missing something?