The big kernel lock (BKL) is an old serialization method that we are trying to get rid of, replacing it with more fine-grained locking, in particular mutex, spinlock and RCU, where appropriate. The BKL is a recursive lock, meaning that you can take it from a thread that already holds it. This may sound convenient, but easily introduces all sorts of bugs. Another problem is that the BKL is automatically released when a thread sleeps. This avoids lock order problems with mutexes in some circumstances, but also creates more problems because it makes it really hard to track what code is executed under the lock. A number of areas need to take care of independently: == llseek == The problem is in the {{{llseek}}} callback of the {{{struct file_operations}}}. Drivers and filesystems implement it to move the file pointer. The problem arises when it's not implemented by a driver or a filesystem. In this case, the VFS layer calls a default one called {{{default_llseek()}}} that just change the file pointer and does nothing else. But to protect against concurrent calls to {{{llseek}}} on a same file, {{{default_llseek()}}} protects this file pointer change using the BKL. The thing is rather evil because not only do we have a lot of existing drivers that don't implement {{{llseek}}}, but also every new driver/filesystem that gets merged and that don't implement {{{llseek}}} falls back to the {{{default_llseek()}}} implementation. It means two things: It can't stop bleeding and it does more and more. And we can't remove the BKL (or at least making it modular) until we get rid of {{{default_llseek()}}} (or at least making it modular :) ). So the strategy is to give a sane {{{llseek}}} implementation to these drivers, once there is no {{{llseek}}} stub, we can start punching {{{default_llseek()}}} (ie: making it modular; build it only if drivers depend on the BKL, or may be an even more granular dependency). The sane existing implementations are the following: * {{{generic_file_llseek()}}}, does the same thing than {{{default_llseek()}}}: move the file pointer accordingly to the offset given by the user. The difference is that it protects the operation using the inode mutex instead of the BKL. If you see that the driver/filesystem uses {{{file->f_pos}}}, or the offset parameter in one of its file operations callbacks, then choose this. Because if the driver uses the file pointer for its work, it will expect that {{{llseek}}} with behave like before with {{{default_llseek()}}}. It's just that the protection will change. ...and about this protection * {{{default_llseek()}}}, if you see that the driver uses the file offset as described above, plus it uses the BKL somewhere. Well that requires a bit of review. Try to see if the offset is ever read or written under the BKL, if it clearly doesn't then you can pick {{{generic_file_llseek()}}} but take care about describing why it's safe in your changelog. Actually take care of describing why it's safe in your changelog whatever callback you choose. Often, a simple "this driver doesn't use the BKL" suffice. * {{{noop_llseek()}}}, if you see the driver never use the file pointer, then choose this. This callback won't update the file pointer, it won't do anything in fact, which is exactly what we want if we don't use the file pointer. Note if you use this, you need to base your work on the -mmotm tree (in fact for all that, I would suggest you to base your work on [http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=summary linux-next: http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=summary]). * {{{no_llseek()}}}, well it could be the good solution sometimes but preferably don't use it. It might make us lose our time. For the background: a driver that doesn't implement {{{llseek}}} is actually seekable because it falls back to {{{default_llseek()}}}. So a userspace program can seek on its files, and there may be some that do it, even if that has no effect for the driver. In this case it is tempting to use {{{no_llseek()}}}: it tells the file is non-seekable and any userspace program that try to seek on such file will get a {{{-ENOTTY}}} error. So these userspace program that seek even if it is useless may be broken because of that. Most maintainers refuse such change, this is why it's better to use {{{noop_llseek()}}} as it does nothing but doesn't break things either. == ioctl == Like {{{llseek}}}, {{{file_operations}}} can contain a {{{.ioctl}}} callback. This is always called with the BKL held. In order to remove the BKL from the core VFS code, all {{{file_operations}}} should be converted to use the {{{.unlocked_ioctl}}} callback instead. This can be done in one of two ways: * removing the BKL from the particular file entirely, either by proving that it's not needed, or by replacing it with a localized lock. * adding explicit {{{lock_kernel}}}/{{{unlock_kernel}}} statements in the {{{ioctl}}} method. A patch to do the second approach for all remaining drivers is on its way. More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/4/16/4559539. == TTY layer == Probably the most central part of the BKL removal is TTY layer, which used to heavily rely on the BKL. Alan Cox has been working for years on introducing sane locking to the TTY code that will eventually let us remove the BKL, but this may still take some time to get there. Also, a patch series from Arnd Bergmann exists to take a shortcut, by separating the BKL usage in the TTY layer from the usage outside of it. This series introduces a new Big TTY Mutex that takes the place of the BKL, but follows the normal rules for mutexes: * no autorelease on sleep, which is what most of the series is about. * no recursive locking * need to follow strict lock order to avoid ABBA deadlocks * limited to one subsystem only. The first patches convert all the code using the BKL in the TTY layer and related drivers to the new interface, while the final patch adds the real mutex implementation as an experimental configuration option. When that option is disabled, the behaviour should be basically unchanged regarding serialization against other subsystems using the BKL. The new locking rules are shown in this picture: attachment:tty_lock.png More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/3/30/4553357, [javascript:void(0);/*1275505016503*/ http://lwn.net/Articles/390400]. == Block layer == There is a patch to remove the BKL from the block layer. In there, the BKL is used mainly for serializing the {{{blkdev_get}}} and {{{blkdev_put}}} functions and some {{{ioctl}}} implementations as well as some less common open functions of related character devices following a previous pushdown and parts of the {{{blktrace}}} code. The most nasty one is the {{{blkdev_get}}} function which is actually recursive and may try to get the BKL twice. All users except the one in {{{blkdev_get}}} seem to be outermost locks, meaning we don't rely on the release-on-sleep semantics to avoid deadlocks. The {{{ctl_mutex}}} ({{{pktcdvd.ko}}}), {{{raw_mutex}}} ({{{raw.ko}}}), {{{state_mutex}}} ({{{dasd.ko}}}), {{{reconfig_mutex}}} ({{{md.ko}}}), and {{{jfs_log_mutex}}} ({{{jfs.ko}}}) may be held when {{{blkdev_get}}} is called, but as far as I can tell, these mutexes are never acquired from any of the functions that need to get converted. Most all open and release block_device_operations get called with the BKL held from blkdev_get, the same is done for the majority of the ioctl operations. A patch exists to push down the locking into the respective device drivers. On top of that, we can remove or replace the lock_kernel instances in blkdev_get/put. We hold bdev->bd_mutex almost all the time here, so it might not actually be needed. The block_ioctl function itself implements a few ioctl commands that are currently taking the BKL. It's not clear from the code why this is needed, and it may be possible to replace it with bdev->bd_mutex. More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/4/14/4558922. == File locking (fs/locks.c) == One of the oldest patches to remove the BKL, which still has not been merged is for the file locking. The patch itself should be fairly stable at this point, but there is still interaction with how the BKL is used in the NFS file system, in particular {{{lockd}}}, which runs for its entire life time with the BKL held. The hard part here is to find out what data structures in NFS actually need to be protected by {{{lock_flock}}} instead of {{{lock_kernel}}}. More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/4/14/4558923. == Super block operations == A patch series from Jan Blunck removes the BKL from the generic file system mount code by pushing it into those file systems that still need it. The patches appear to be stable, but have not made it upstream yet as of 2.6.35. As a consequence of the patches, the majority of the file systems now depend on the BKL. More info: http://kerneltrap.com/mailarchive/linux-fsdevel/2009/11/18/6582233. == USB layer == Andi Kleen has posted a patch series removing the BKL from all central parts of the USB device driver layer. More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/3/29/4552603. == Direct rendering manager == The DRM code still uses the BKL in ugly ways, a solution is needed. More info: http://kerneltrap.org/mailarchive/linux-kernel/2010/4/23/4562233. == Video4Linux == The BKL gets pushed into V4L device drivers by an existing patch, but most of the individual drivers now depend on the BKL. In general, this should be easy to replace with a private mutex per driver or to remove, but there are a lot of them. == init/main.c == The initial kernel thread at boot time runs with the BKL held. There does not seem to be any reason for this, at least not once all the other users have been removed. It will be trivial to remove this instance of the BKL. == Remaining drivers == When all the above changes have been done the base kernel no longer needs the BKL, but there are still a number of modules that need it. We need to mark these as 'depends on BKL' in {{{Kconfig}}} or remove the BKL from them, one at a time. Most of them are simple enough that they can be automatically converted to a private mutex using a script from Arnd. {{{ * sound/soundcore.ko * sound/soc/snd-soc-core.ko * sound/oss/sound.ko * sound/oss/msnd_pinnacle.ko * sound/oss/msnd_classic.ko * sound/core/snd.ko * sound/core/snd-pcm.ko * sound/core/seq/snd-seq.ko * sound/core/oss/snd-pcm-oss.ko * net/x25/x25.ko * net/wanrouter/wanrouter.ko * net/sunrpc/sunrpc.ko * net/irda/irnet/irnet.ko * net/irda/irda.ko * net/ipx/ipx.ko * net/appletalk/appletalk.ko * fs/ufs/ufs.ko * fs/udf/udf.ko * fs/squashfs/squashfs.ko * fs/smbfs/smbfs.ko * fs/reiserfs/reiserfs.ko * fs/qnx4/qnx4.ko * fs/ocfs2/ocfs2_stack_user.ko * fs/ocfs2/ocfs2.ko * fs/nfsd/nfsd.ko * fs/nfs/nfs.ko * fs/ncpfs/ncpfs.ko * fs/lockd/lockd.ko * fs/jffs2/jffs2.ko * fs/isofs/isofs.ko * fs/hpfs/hpfs.ko * fs/hfsplus/hfsplus.ko * fs/freevxfs/freevxfs.ko * fs/fat/vfat.ko * fs/fat/msdos.ko * fs/fat/fat.ko * fs/ecryptfs/ecryptfs.ko * fs/coda/coda.ko * fs/autofs4/autofs4.ko * fs/autofs/autofs.ko * fs/afs/kafs.ko * fs/adfs/adfs.ko * drivers/usb/misc/usblcd.ko * drivers/usb/misc/sisusbvga/sisusbvga.ko * drivers/usb/misc/rio500.ko * drivers/usb/misc/iowarrior.ko * drivers/usb/misc/idmouse.ko * drivers/usb/gadget/gadgetfs.ko * drivers/usb/gadget/g_printer.ko * drivers/usb/class/usblp.ko * drivers/telephony/ixj.ko * drivers/scsi/st.ko * drivers/scsi/scsi_tgt.ko * drivers/scsi/pmcraid.ko * drivers/scsi/osst.ko * drivers/scsi/osd/osd.ko * drivers/scsi/mpt2sas/mpt2sas.ko * drivers/scsi/megaraid/megaraid_sas.ko * drivers/scsi/megaraid/megaraid_mm.ko * drivers/scsi/megaraid.ko * drivers/scsi/gdth.ko * drivers/scsi/dpt_i2o.ko * drivers/scsi/ch.ko * drivers/scsi/aacraid/aacraid.ko * drivers/scsi/3w-xxxx.ko * drivers/scsi/3w-sas.ko * drivers/scsi/3w-9xxx.ko * drivers/rtc/rtc-m41t80.ko * drivers/pci/hotplug/cpqphp.ko * drivers/net/wireless/ray_cs.ko * drivers/net/wireless/airo.ko * drivers/net/wan/cosa.ko * drivers/net/ppp_generic.ko * drivers/mtd/ubi/ubi.ko * drivers/mtd/mtdchar.ko * drivers/misc/phantom.ko * drivers/message/i2o/i2o_config.ko * drivers/message/fusion/mptctl.ko * drivers/media/video/zoran/zr36067.ko * drivers/media/video/videodev.ko * drivers/media/video/usbvision/usbvision.ko * drivers/media/video/usbvideo/vicam.ko * drivers/media/video/tlg2300/poseidon.ko * drivers/media/video/stv680.ko * drivers/media/video/stradis.ko * drivers/media/video/stkwebcam.ko * drivers/media/video/se401.ko * drivers/media/video/s2255drv.ko * drivers/media/video/pwc/pwc.ko * drivers/media/video/dabusb.ko * drivers/media/video/cx88/cx8800.ko * drivers/media/video/cx88/cx88-blackbird.ko * drivers/media/video/cx23885/cx23885.ko * drivers/media/video/cpia.ko * drivers/media/video/bt8xx/bttv.ko * drivers/media/radio/si470x/radio-usb-si470x.ko * drivers/media/dvb/ttpci/dvb-ttpci.ko * drivers/media/dvb/firewire/firedtv.ko * drivers/media/dvb/dvb-core/dvb-core.ko * drivers/media/dvb/bt8xx/dst_ca.ko * drivers/isdn/mISDN/mISDN_core.ko * drivers/isdn/i4l/isdn.ko * drivers/isdn/hysdn/hysdn.ko * drivers/isdn/hardware/eicon/divas.ko * drivers/isdn/hardware/eicon/diva_mnt.ko * drivers/isdn/hardware/eicon/diva_idi.ko * drivers/isdn/divert/dss1_divert.ko * drivers/isdn/capi/capifs.ko * drivers/isdn/capi/capi.ko * drivers/input/serio/serio_raw.ko * drivers/input/misc/uinput.ko * drivers/infiniband/core/rdma_ucm.ko * drivers/infiniband/core/ib_uverbs.ko * drivers/infiniband/core/ib_umad.ko * drivers/infiniband/core/ib_ucm.ko * drivers/ide/ide-tape.ko * drivers/hwmon/fschmd.ko * drivers/hid/usbhid/usbhid.ko * drivers/hid/hid.ko * drivers/gpu/drm/i830/i830.ko * drivers/gpu/drm/i810/i810.ko * drivers/gpu/drm/drm.ko * drivers/char/toshiba.ko * drivers/char/tlclk.ko * drivers/char/stallion.ko * drivers/char/raw.ko * drivers/char/ppdev.ko * drivers/char/pcmcia/cm4040_cs.ko * drivers/char/pcmcia/cm4000_cs.ko * drivers/char/mwave/mwave.ko * drivers/char/lp.ko * drivers/char/istallion.ko * drivers/char/ipmi/ipmi_watchdog.ko * drivers/char/ipmi/ipmi_devintf.ko * drivers/char/ip2/ip2.ko * drivers/char/i8k.ko * drivers/char/dtlk.ko * drivers/char/applicom.ko * drivers/block/pktcdvd.ko * drivers/block/paride/pt.ko * drivers/block/paride/pg.ko * drivers/block/DAC960.ko}}} ---- CategoryKernelProjects