The big kernel lock (BKL) is an old serialization method that we are trying to get rid of, replacing it with more fine-grained locking, in particular mutex, spinlock and RCU, where appropriate.
The BKL is a recursive lock, meaning that you can take it from a thread that already holds it. This may sound convenient, but easily introduces all sorts of bugs. Another problem is that the BKL is automatically released when a thread sleeps. This avoids lock order problems with mutexes in some circumstances, but also creates more problems because it makes it really hard to track what code is executed under the lock.
A number of areas need to take care of independently:
llseek
The problem is in the llseek callback of the struct file_operations. Drivers and filesystems implement it to move the file pointer.
The problem arises when it's not implemented by a driver or a filesystem. In this case, the vfs layer calls a default one called default_llseek() that just change the file pointer and does nothing else. But to protect against concurrent calls to llseek on a same file, default_llseek() protects protects this file pointer change using the BKL.
The thing is rather evil because not only do we have a lot of existing drivers that don't implement llseek, but also every new driver/filesystem that gets merged and that don't implement llseek falls back to the default_llseek() implementation.
It means two things: It can't stop bleeding and it does more and more. And we can't remove the bkl (or at least making it modular) until we get rid of default_llseek() (or at least making it modular :-).
So the strategy is to give a sane llseek implementation to these drivers, once there is no llseek stub, we can start punching default_llseek() (ie: making it modular => build it only if drivers depend on the bkl, or may be an even more granular dependency).
The sane existing implementations are the following:
- generic_file_llseek(), does the same thing than default_llseek(): move the file pointer accordingly to the offset given by the user. The difference is that it protects the operation using the inode mutex instead of the bkl.
If you see that the driver/filesystem uses file->f_pos, or the offset parameter in one of its file operations callbacks, then choose this. Because if the driver uses the file pointer for its work, it will expect that llseek with behave like before with default_llseek(). It's just that the protection will change.
- ..and about this protection
- default_llseek(), if you see that the driver uses the file offset as described above, plus it uses the bkl somewhere. Well that requires a bit of review. Try to see if the offset is ever read or written under the bkl, if it clearly doesn't then you can pick generic_file_llseek() but take care about describing why it's safe in your changelog. Actually take care of describing why it's safe in your changelog whatever callback you choose. Often, a simple "this driver doesn't use the bkl" suffice.
- noop_llseek(), if you see the driver never use the file pointer, then choose this. This callback won't update the file pointer, it won't do anything in fact, which is exactly what we want if we don't use the file pointer. Note if you use this, you need to base your work on the -mmotm tree (in fact for all that, I would suggest you to
base your work on linux-next: http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=summary)
- no_llseek(), well it could be the good solution sometimes but preferably don't use it. It might make us lose our time. For the background: a driver that don't implement llseek is actually seekable because it falls back to default_llseek(). So a userspace program can seek on its files, and there may be some that do it, even if that has no effect for the driver. In this case it is tempting to use no_llseek(): it tells the file is non-seekable and any userspace program that try to seek on such file will get a -ENOTTY error. So these userspace program that seek even if it is useless may be broken because of that. Most maintainers refuse such change, this is why it's better to use noop_llseek() as it does nothing but doesn't break things either.
ioctl
Like llseek, file_operations can contain a .ioctl callback. This is always called with the BKL held. In order to remove the BKL from the core VFS code, all file_operations should be converted to use the .unlocked_ioctl callback instead. This can be done in one of three ways:
- removing the BKL from the particular file entirely, either by proving that it's not needed, or by replacing it with a localized lock
- adding explicit lock_kernel/unlock_kernel statements in the ioctl method.
- After a patch from Arnd has been applied, change the name of the callback from .ioctl= to .locked_ioctl=, and add .unlocked_ioctl=deprecated_ioctl. This does not change any of the code, but at least makes it possible to move the BKL usage from VFS to a separate module.
TTY layer
Block layer
File locking (fs/flock.c)
super block operations