| marcelo | The old model of high availability is "fault tolerance" usually hardware-based. | |
| marcelo | Expensive, proprietary. | |
| marcelo | This old model goal is to have the hardware system running | |
| andres | plas | |
| riel | so basically, a single computer is an unreliable piece of shit (relatively speaking) ... | |
| riel | ... and High Availability is the collection of methods to make the job the computer does more reliable | |
| riel | you can do that by better hardware structures | |
| riel | or by better software structures | |
| riel | usually a combination of both | |
| marcelo | the Linux model of high availability is software based. | |
| marcelo | Now let me explain some basic concepts of HA | |
| marcelo | First, its very important that we dont rely on unique hardware components in a High Availability system | |
| marcelo | for example, you can have two network cards connected to a network | |
| marcelo | In case one of the cards fail, the system tries to use the other card. | |
| marcelo | A hardware component that cannot fail because the whole system depends on it is called a "Single Point of Failure" | |
| marcelo | SPOF, to make it short. :) | |
| marcelo | Another important concept which must be known before we continue is "failover" | |
| marcelo | Failover is the process which one machine takes over the job of another node | |
| riel | "machine" in this context can be anything, btw ... | |
| riel | if a disk fails, another disk will take over | |
| riel | if a machine from a cluster fails, the other machines take over the task | |
| riel | but to have failover, you need to have good software support | |
| riel | because most of the time you will be using standard computer components | |
| marcelo | well, this is all the "theory" needed to explain the next parts. | |
| riel | so let me make a quick condensation of this introduction | |
| riel | 1. normal computers are not reliable enough for some people (like: internet shop), so we need a trick .. umm method ... to make the system more reliable | |
| riel | 2. high availability is the collection of these methods | |
| riel | 3. you can do high availability by using special hardware (very expensive) or by using a combination of normal hardware and software | |
| riel | 4. if one point in the system breaks and it makes the whole system break, that point is a single point of failure .. SPOF | |
| riel | 5. for high availability, you should have no SPOFs ... if one part of the system breaks, another part of the system should take over | |
| riel | (this is called "failover") | |
| riel | now I think we should explain a bit about how high availability works .. the technical side | |
| riel | umm wait ... sorry marcelo ;) | |
| marcelo | ok | |
| marcelo | Lets talk about the basic components of HA | |
| marcelo | Or at least some of them, | |
| marcelo | A simple disk running a filesystem is clearly an SPOF | |
| marcelo | If the disk fails, every part of the system which depends on the data contained on it will stop.l | |
| marcelo | To avoid a disk from being a SPOF of a system, RAID can be used. | |
| marcelo | RAID-1, which is a feature of the Linux kernel... | |
| marcelo | Allows "mirroring" of all data on the RAID device to a given number of disks... | |
| marcelo | So, when data is written to the RAID device, its replicated between all disks which are part of the RAID1 array. | |
| marcelo | This way, if one disk fails, the other (or others) disks on the RAID1 array will be able to continue working | |
| riel | because the system has a copy of the data on each disk | |
| riel | and can just use the other copies of the data | |
| riel | this is another example of "failover" ... when one component fails, another component is used to fulfill this function | |
| riel | and the system administrator can replace (or reformat/reboot/...) the wrong component | |
| riel | this looks really simple when you don't look at it too much | |
| riel | much | |
| riel | but there is one big problem ... when do you need to do failover? | |
| riel | in some situations, you would have _2_ machines working at the same time and corrupting all data ... when you are not careful | |
| riel | think for example of 2 machines which are fileservers for the same data | |
| riel | at any time, one of the machines is working and the other is on standby | |
| riel | when the main machine fails, the standby machine takes over | |
| riel | ... BUT ... | |
| riel | what if the standby machine only _thinks_ the main machine is dead and both machines do something with the data? | |
| riel | which copy of the data is right, which copy of the data is wrong? | |
| riel | or worse ... what if _both_ copies of the data are wrong? | |
| riel | for this, there is a special kind of program, called a "heartbeating" program, which checks which parts of the system are alive | |
| riel | for Linux, one of these programs is called "heartbeat" ... marcelo and lclaudio have helped writing this program | |
| riel | marcelo: could you tell us some of the things "heartbeat" does? | |
| marcelo | sure | |
| marcelo | "heartbeat" is a piece of software which monitors the availability of nodes | |
| marcelo | it "pings" the node which it wants to monitor, and, in case this node doesnt answer the "pings", it considers it to be dead. | |
| marcelo | when a node is considered to be dead when can failover the services which it was running | |
| marcelo | the services which we takeover are previously configured in both systems. | |
| marcelo | Currently heartbeat works only with 2 nodes. | |
| marcelo | Its been used in production environments in a lot of situations... | |
| riel | there is one small problem, however | |
| riel | what if the cleaning lady takes away the network cable between the cluster nodes by accident? | |
| riel | and both nodes *think* they are the only one alive? | |
| riel | ... and both nodes start messing with the data... | |
| riel | unfortunately there is no way you can prevent this 100% | |
| riel | but you can increase the reliability by simply having multiple means of communication | |
| riel | say, 2 network cables and a serial cable | |
| riel | and this is reliable enough that the failure of 1 component still allows good communication between the nodes | |
| riel | so they can reliably tell if the other node is alive or not | |
| riel | this was the introduction to HA | |
| riel | now we will give some examples of HA software on Linux | |
| riel | and show you how they are used ... | |
| riel | ... <we will wait shortly until the people doing the translation to Espa�ol have caught up> ... ;) | |
| marcelo | Ok | |
| marcelo | Now lets talk about the available software for Linux | |
| riel | .. ok, the translators have caught up .. we can continue again ;) | |
| marcelo | Note that I'll be talking about the opensource software for Linux | |
| marcelo | As I said above, the "heartbeat" program provides monitoring and basic failover of services | |
| marcelo | for two nodes only | |
| marcelo | As a practical example... | |
| marcelo | The web server at Conectiva (www.conectiva.com.br) has a standby node running heartbeat | |
| marcelo | In case our primary web server fails, the standby node will detect and start the apache daemon | |
| marcelo | making the service available again | |
| marcelo | any service can be used, in theory, with heartbeat. | |
| riel | so if one machine breaks, everybody can still go to our website ;) | |
| marcelo | It only depends on the init scripts to start the service | |
| marcelo | So any service which has a init script can be used with heartbeat | |
| marcelo | arjan asked if takes over the IP address | |
| marcelo | There is a virtual IP address used by the service | |
| marcelo | which is the "virtual serverIP" | |
| marcelo | which is the "virtual server" IP address. | |
| marcelo | So, in our webserver case... | |
| marcelo | the real IP address of the first node is not used by the apache daemon | |
| marcelo | but the virtual IP address which will be used by the standby node in case failover happens | |
| marcelo | Heartbeat, however, is limited to two nodes. | |
| marcelo | This is a big problem for a lot of big systems. | |
| marcelo | SGI has ported its FailSafe HA system to Linux recently (http://oss.sgi.com/projects/failsafe) | |
| marcelo | FailSafe is a complete cluster manager which supports up to 16 nodes. | |
| marcelo | Right now its not ready for production environments | |
| marcelo | But thats being worked on by the Linux HA project people :) | |
| marcelo | SGI's FailSafe is GPL. | |
| riel | another type of clustering is LVS ... the Linux Virtual Server project | |
| riel | LVS uses a very different approach to clustering | |
| riel | you have 1 (maybe 2) machines that request http (www) requests | |
| riel | but those machines don't do anything, except send the requests to a whole bunch of machines that do the real work | |
| riel | so called "working nodes" | |
| riel | if one (or even more) of the working nodes fail, the others will do the work | |
| riel | and all the routers (the machines sitting at the front) do is: | |
| riel | 1. keep track of which working nodes are available | |
| riel | 2. give the http requests to the working nodes | |
| riel | the kernel needs a special TCP/IP patch and a set of usermode utilities for this to work | |
| riel | RedHat's "piranha" tool is a configuration tool for LVS, that people can use to setup LVS clusters in a more easy way | |
| riel | in Conectiva, we are also working on a very nice HA project | |
| riel | the project marcelo and Olive are working on is called "drbd" | |
| riel | the distributed redundant block device | |
| riel | this is almost the same as RAID1, only over the network | |
| riel | to go back to RAID1 (mirroring) ... RAID1 is using 2 (or more) disks to store your data | |
| riel | with one copy of the data on every disk | |
| riel | drdb extends this idea to use disks on different machines on the network | |
| riel | so if one disk (on one machine) fails, the other machines still have the data | |
| riel | and if one complete machine fails, the data is on another machine ... and the system as a whole continues to run | |
| riel | if you use this together with ext3 or reiserfs, the machine that is still running can very quickly take over the filesystem that it has copied to its own disk | |
| riel | and your programs can continue to run | |
| riel | (with ext2, you would have to do an fsck first, which can take a long time) | |
| riel | this can be used for fileservers, databases, webservers, ... | |
| riel | everything where you need the very latest data to work | |
| riel | ... | |
| riel | this is the end of our part of the lecture, if you have any questions, you can ask them and we will try to give you a good answer ;) | |
| See also http://www.linux-ha.org/ |
