Introducing funlinkat

April 9, 2019, 4:45 p.m.

In operating systems there are often long-term problems. One of them was described by lcamtuf in 2002 (this is the earliest mention I was able to find, and this article motivated me to do the work). For many years we didn’t have a way to remove files in race free way. Let’s first look closer at the problem.

One of the first sycalls which was created in Unix-like systems is unlink. In FreeBSD this syscall is number 10 (source) and in Linux, the number is dependent on the architecture but for most of them is also the tenth syscall (source). This indicated that this is one of the primary syscalls. The unlink syscall is very simple and we provide one single path to the file that we want to remove.

Removing file

The “removing file” process itself is very interesting so let’s spend a moment to understand the it. First, by removing the file we are removing a link from the directory to it. In Unix-like systems we can have many links to a single file (hard links). When we remove all links to the file, the file system will mark the blocks used by the file as free (a different file system will behave differently but let’s not jump into a second digression). This is why the process is called unlinking and not “removing file”. While we unlink the file two or three things will happen:

  1. We will remove an entry in the directory with the filename.
  2. We will decrease a file reference count (in inode).
  3. If links go to zero - the file will be removed from the disk (again this doesn't mean that the blocks from the disk will be filled with zeros, though this may happen depending on the file system and configuration. However, in most cases this means that the file system will mark those blocks to as free and use them to write new data later).

This mostly means that “removing file” from a directory is an operation on the directory and not on the file (inode) itself.

The situation is slightly different when removing directories, but we will not go into this topic today. 

Another interesting subject is what happens if our system will perform only first or second step from the list. This depends on the file system and this is also something we will leave for another time.

The race condition while unlinking

The problem with the unlink and even unlinkat function is that we don’t have any guarantee of which file we really are unlinking. Let’s
look at an example below:

We have some stats about the file that we want to unlink. We performed some tests. In the same time another process removed our file and recreated it. When we finally try to remove our file it is no longer the same file. It’s a classic race condition.

This situation occurs almost every day; for example, with process pid files. Pid files are files which are created by the process to indicate that it’s already running and what its process identifier is. The process repeatedly removes its pid files while it is exiting. The process is performing sanitary checks using stat to check if it will remove the right file, but if some other process overwrites this pid file before unlinking and after getting information about the file we would remove the wrong pid file.

The file descriptors for the help

In Unix-like operating systems we can get a handle for our file called file - a descriptor. File descriptors guarantee us that all the operations that we will be performing on it are done on the same file (inode). Even if someone was to unlink a number of directories entries, the operating system will not free the structures behind the file descriptor, and we can detect the file that was removed by someone and recreated (or just unlinked). So, for example, we have an alternative functions fstat which allows us to get file status of the given descriptor.

In the example above, thanks to using a file descriptor (fd), we receive the status of file that we opened, and we can write to the file we wanted and on which we performed sanitary checks, no matter if the link was removed or recreated. Unfortunately, during unlinking we still unlinked the wrong file.

The problem with design

We already know that the file may have many links on the disk which point to the single inode. What happens when we open the file? Simplifying: kernel creates a memory representation of the inode (the inode itself is stored on the disk) called vnode. This single representation is used by all processes to refer the inode to the disk. If in a process we open the same file (inode) using different names (for example through hard links) all those files will be linked to the single vnode. That means that the pathname is not stored in the kernel (though this is not exactly true anymore – FreeBSD for example may have additional structures which contain the file name for debugging purposes but you should not depend on them in the kernel). 

This is basically the reason why we don’t have a fdunlink function so that instead of the path we are providing just the file descriptor to the file. If we performed the fdunlink syscall, the kernel wouldn’t know which directory entry you would like to remove. Another problem is more architectural: as we discussed earlier unlinking is really an operation on the directory not on the file (inode) itself, so using funlink(fd) may create some confusion because we are not removing the inode corresponding to the file descriptor, we are performing action on the directory which points to the file.

funlinkat

After some discussion we decided that the only sensible option for FreeBSD would be to create a funlink function. This syscall would only performs additional sanitary checks if we are removing a directory entry which corresponds to the inode stored which refers to the file descriptor.
int funlinkat(int dfd, const char *path, int fd, int flags);

The API above will check if the path opened relative to the dfd points to the same vnode. Thanks to that we removed a race condition because all those sanitary checks are performed in the kernel mode while the file system is locked and there is no possibility to change it.

The fd parameter may be set to the FD_NONE value which will mean that the sanitary check should not be performed and funlinkat will behave just like unlinkat.

As you can notice I often refer to the unlink syscall but at the end the APIs looks like unlinkat syscall. It is true that the unlink syscall is very old and kind of deprecated. That said I referred to unlink because it’s just simpler. These days unlink simply uses the same code as unlinkat.

Linux

After I commited my code, FreeBSD Help  started a thread on Twitter in which he pointed out to me that a similar proposal is made for Linux but they are a few years old and still not upstreamed. lcamtuf also did some proposal patches in 2002 but they were never upstreamed. That said, Linux is also working on getting such API and I’m planning to reach out to the authors to discuss the API looks because they proposed something slightly different.