May 18, 2012
Manipulating the kernel's page cache with vmtouch
I came across vmtouch the other day which looked rather interesting. It allows you to check how much of a file or directory is currently loaded into the kernel’s virtual memory page cache. This is useful if you want to check your assumptions about why a process is performing outside what you expect (too fast/too slow) by seeing how much of a file is pre-loaded in RAM.
You can also evict particular files from RAM that are already loaded into cache or inform the kernel that you would like certain files/directories to be kept in RAM if possible (the kernel will still boot things out of the file cache if it deems it necessary to provide RAM for other processes AFAIK).
My thoughts on its use were to:
- Preload data files before processing to avoid page faults occuring during processing. For example, to load a large pre-computed dictionary or rainbow table for brute forcing a password file.
- Load a database into memory post-boot of a DB machine and lock it in place so it performs optimally from the get go (this assumes you have enough RAM to keep the whole thing there). I believe that this is basically how Redis works; loading itself and its data into RAM before allowing queries. As DB machines typically are dedicated machines, there’s no point waiting for page faults to occur on parts of the database that haven’t been loaded yet.
- Ensure a static file proxy (whose only job is serving images/css/js for a website) has the static file directory loaded into RAM when it is brought up so it can start serving files quickly straight away (and from cache rather than needing I/O). In this instance, unless you had plenty of RAM for everything, you’d allow the kernel to swap out unused files as and when needed.
These pre-loading use cases could be incredibly useful when dealing with Virtual Machines whose backing store may not even be local to the VM (either on a SAN or are on a “cloud” service backing store). As the disk I/O is actually on another machine and transmitted to “disk” via the network, I/O operations are even more expensive than on a dedicated machine.