Documents/HighAvailability - Linux Kernel Newbies

<marcelo> The old model of high availability is "fault tolerance" usually hardware-based.BR <marcelo> Expensive, proprietary.BR <marcelo> This old model goal is to have the hardware system running BR <andres> plasBR <riel> so basically, a single computer is an unreliable piece of shit (relatively speaking) ...BR <riel> ... and High Availability is the collection of methods to make the job the computer does more reliableBR <riel> you can do that by better hardware structuresBR <riel> or by better software structuresBR <riel> usually a combination of bothBR <marcelo> the Linux model of high availability is software based.BR <marcelo> Now let me explain some basic concepts of HABR <marcelo> First, its very important that we dont rely on unique hardware components in a High Availability systemBR <marcelo> for example, you can have two network cards connected to a networkBR <marcelo> In case one of the cards fail, the system tries to use the other card.BR <marcelo> A hardware component that cannot fail because the whole system depends on it is called a "Single Point of Failure"BR <marcelo> SPOF, to make it short. :)BR <marcelo> Another important concept which must be known before we continue is "failover" BR <marcelo> Failover is the process which one machine takes over the job of another nodeBR <riel> "machine" in this context can be anything, btw ...BR <riel> if a disk fails, another disk will take overBR <riel> if a machine from a cluster fails, the other machines take over the taskBR <riel> but to have failover, you need to have good software supportBR <riel> because most of the time you will be using standard computer componentsBR <marcelo> well, this is all the "theory" needed to explain the next parts. BR <riel> so let me make a quick condensation of this introductionBR <riel> 1. normal computers are not reliable enough for some people (like: internet shop), so we need a trick .. umm method ... to make the system more reliableBR <riel> 2. high availability is the collection of these methodsBR <riel> 3. you can do high availability by using special hardware (very expensive) or by using a combination of normal hardware and softwareBR <riel> 4. if one point in the system breaks and it makes the whole system break, that point is a single point of failure .. SPOFBR <riel> 5. for high availability, you should have no SPOFs ... if one part of the system breaks, another part of the system should take overBR <riel> (this is called "failover")BR <riel> now I think we should explain a bit about how high availability works .. the technical sideBR <riel> umm wait ... sorry marcelo ;)BR <marcelo> okBR <marcelo> Lets talk about the basic components of HA BR <marcelo> Or at least some of them,BR <marcelo> A simple disk running a filesystem is clearly an SPOFBR <marcelo> If the disk fails, every part of the system which depends on the data contained on it will stop.BR <marcelo> To avoid a disk from being a SPOF of a system, RAID can be used.BR <marcelo> RAID-1, which is a feature of the Linux kernel...BR <marcelo> Allows "mirroring" of all data on the RAID device to a given number of disks...BR <marcelo> So, when data is written to the RAID device, its replicated between all disks which are part of the RAID1 array.BR <marcelo> This way, if one disk fails, the other (or others) disks on the RAID1 array will be able to continue workingBR <riel> because the system has a copy of the data on each diskBR <riel> and can just use the other copies of the dataBR <riel> this is another example of "failover" ... when one component fails, another component is used to fulfill this functionBR <riel> and the system administrator can replace (or reformat/reboot/...) the wrong componentBR <riel> this looks really simple when you don't look at it too muhcBR <riel> muchBR <riel> but there is one big problem ... when do you need to do failover?BR <riel> in some situations, you would have _2_ machines working at the same time and corrupting all data ... when you are not carefulBR <riel> think for example of 2 machines which are fileservers for the same dataBR <riel> at any time, one of the machines is working and the other is on standbyBR <riel> when the main machine fails, the standby machine takes overBR <riel> ... BUT ...BR <riel> what if the standby machine only _thinks_ the main machine is dead and both machines do something with the data?BR <riel> which copy of the data is right, which copy of the data is wrong?BR <riel> or worse ... what if _both_ copies of the data are wrong?BR <riel> for this, there is a special kind of program, called a "heartbeating" program, which checks which parts of the system are aliveBR <riel> for Linux, one of these programs is called "heartbeat" ... marcelo and lclaudio have helped writing this programBR <riel> marcelo: could you tell us some of the things "heartbeat" does?BR <marcelo> sureBR <marcelo> "heartbeat" is a piece of software which monitors the availability of nodesBR <marcelo> it "pings" the node which it wants to monitor, and, in case this node doesnt answer the "pings", it considers it to be dead.BR <marcelo> when a node is considered to be dead when can failover the services which it was runningBR <marcelo> the services which we takeover are previously configured in both systems.BR <marcelo> Currently heartbeat works only with 2 nodes.BR <marcelo> Its been used in production environments in a lot of situations...BR <riel> there is one small problem, howeverBR <riel> what if the cleaning lady takes away the network cable between the cluster nodes by accident?BR <riel> and both nodes *think* they are the only one alive?BR <riel> ... and both nodes start messing with the data...BR <riel> unfortunately there is no way you can prevent this 100%BR <riel> but you can increase the reliability by simply having multiple means of communicationBR <riel> say, 2 network cables and a serial cableBR <riel> and this is reliable enough that the failure of 1 component still allows good communication between the nodesBR <riel> so they can reliably tell if the other node is alive or notBR <riel> this was the introduction to HABR <riel> now we will give some examples of HA software on LinuxBR <riel> and show you how they are used ...BR <riel> ... <we will wait shortly until the people doing the translation to Español have caught up> ... ;)BR <marcelo> OkBR <marcelo> Now lets talk about the available software for LinuxBR <riel> .. ok, the translators have caught up .. we can continue again ;)BR <marcelo> Note that I'll be talking about the opensource software for LinuxBR <marcelo> As I said above, the "heartbeat" program provides monitoring and basic failover of services BR <marcelo> for two nodes onlyBR <marcelo> As a practical example...BR <marcelo> The web server at Conectiva (www.conectiva.com.br) has a standby node running heartbeatBR <marcelo> In case our primary web server fails, the standby node will detect and start the apache daemonBR <marcelo> making the service available again BR <marcelo> any service can be used, in theory, with heartbeat.BR <riel> so if one machine breaks, everybody can still go to our website ;)BR <marcelo> It only depends on the init scripts to start the serviceBR <marcelo> So any service which has a init script can be used with heartbeatBR <marcelo> arjan asked if takes over the IP addressBR <marcelo> There is a virtual IP address used by the serviceBR <marcelo> which is the "virtual serverIP" BR <marcelo> which is the "virtual server" IP address. BR <marcelo> So, in our webserver case...BR <marcelo> the real IP address of the first node is not used by the apache daemonBR <marcelo> but the virtual IP address which will be used by the standby node in case failover happensBR <marcelo> Heartbeat, however, is limited to two nodes.BR <marcelo> This is a big problem for a lot of big systems.BR <marcelo> SGI has ported its FailSafe HA system to Linux recently ([http://oss.sgi.com/projects/failsafe])[[BR]] <marcelo> FailSafe is a complete cluster manager which supports up to 16 nodes.BR <marcelo> Right now its not ready for production environmentsBR <marcelo> But thats being worked on by the Linux HA project people :)BR <marcelo> SGI's FailSafe is GPL.BR <riel> another type of clustering is LVS ... the Linux Virtual Server projectBR <riel> LVS uses a very different approach to clusteringBR <riel> you have 1 (maybe 2) machines that request http (www) requestsBR <riel> but those machines don't do anything, except send the requests to a whole bunch of machines that do the real workBR <riel> so called "working nodes"BR <riel> if one (or even more) of the working nodes fail, the others will do the workBR <riel> and all the routers (the machines sitting at the front) do is:BR <riel> 1. keep track of which working nodes are availableBR <riel> 2. give the http requests to the working nodesBR <riel> the kernel needs a special TCP/IP patch and a set of usermode utilities for this to workBR <riel> Red Hat's "piranha" tool is a configuration tool for LVS, that people can use to setup LVS clusters in a more easy wayBR <riel> in Conectiva, we are also working on a very nice HA projectBR <riel> the project marcelo and Olive are working on is called "drbd"BR <riel> the distributed redundant block deviceBR <riel> this is almost the same as RAID1, only over the networkBR <riel> to go back to RAID1 (mirroring) ... RAID1 is using 2 (or more) disks to store your dataBR <riel> with one copy of the data on every diskBR <riel> drdb extends this idea to use disks on different machines on the networkBR <riel> so if one disk (on one machine) fails, the other machines still have the dataBR <riel> and if one complete machine fails, the data is on another machine ... and the system as a whole continues to runBR <riel> if you use this together with ext3 or reiserfs, the machine that is still running can very quickly take over the filesystem that it has copied to its own diskBR <riel> and your programs can continue to runBR <riel> (with ext2, you would have to do an fsck first, which can take a long time)BR <riel> this can be used for fileservers, databases, webservers, ...BR <riel> everything where you need the very latest data to workBR <riel> ...BR <riel> this is the end of our part of the lecture, if you have any questions, you can ask them and we will try to give you a good answer ;)BR