An álisis del Programa Integral de Nutrición (PIN) Ministerio de la Mujer y Desarrollo Social

DISEÑO GESTIÓN / MONITOREO EVALUACIÓN

5.19. An álisis del Programa Integral de Nutrición (PIN) Ministerio de la Mujer y Desarrollo Social

Along with environmental and access concerns are several things to consider when architecting a service for reliability. In Chapter 4, we explained how to build an individual server to make it more reliable. Having reliable servers as components in your service is another part of making the service reliable as a whole.

If you have redundant hardware available, use it as effectively as you can. For example, if a system has two power supplies, plug them into different power strips and different power sources. If you have redundant machines, get the power and also the network connectivity from different places—for example, different switches—if possible. Ultimately, if this service is meant to be available to people at several sites, think about placing redundant systems at another site that will act as a fallback if the main site has a catastrophic failure.

All the components of each service, other than the redundant pieces, should be tightly coupled, sharing the same power source and network

5.1 The Basics 113

infrastructure, so that the service as a whole depends on as few components as possible. Spreading nonredundant parts across multiple pieces of infrastructure simply means that the service has more single points of failure, each of which can bring the whole service down. For example, suppose that a remote access service is deployed and that part of that service is a new, more secure authentication and authorization system. The system is designed with three components: the box that handles the remote connections, the server that makes sure that people are who they say they are (authentication), and the server that determines what areas people are allowed to access (authorization). If the three components are on different power sources, a failure of any one power source will cause the whole service to fail. Each one is a single point of failure. If they are on the same power source, the service will be unaffected by failures of the other power sources. Likewise, if they are on the same network switch, only a failure of that switch will take the service down. On the other hand, if they are spread across three net- works, with many different switches and routers involved in communications between the components, many more components could fail and bring the service down.

The single most effective way to make a service as reliable as possible is to make it as simple as possible. Find the simplest solution that meets all the requirements. When considering the reliability of a service you are building, break it down into its constituent parts and look at what each of them relies on and the degree of reliability, until you reach servers and services that do not rely on anything else. For example, many services rely on name service, such as DNS. How reliable is your name service? Do your name servers rely on other servers and services? Other common central services are authentication services and directory services.

The network is almost certainly one of the components of your system. When you are building a service at a central location that will be ac- cessed from remote locations, it is particularly important to take network topology into account. If connectivity to the main site is down, can the service still be made available to the remote site? Does it make sense to have that service still available to the remote site? What are the implications? Are there resynchronization issues? For example, name service should remain available on both sides when a link is severed, because many things that people at the remote site do rely only on machines at that site. But people won’t be able to do those things if they can’t resolve names. Even if their name server database isn’t getting updates, the stale database can still be useful. If you have a centralized remote access authentication service with

remote access systems at other offices, those remote access systems probably still should be able to authenticate people who connect to them, even if the link to the central server is down. In both of these cases, the software should be able to provide secondary servers at remote offices and cope with resynchronizing databases when connectivity is restored. However, if you are building a large database or file service, ensuring that the service is still available in remote offices when their connectivity has been lost is probably not realistic.

Soft outagesstill provide some functionality. For example, a DNS server can be down and customers can still function, though sometimes a little more slowly or unable to do certain functions.

Hard outages, on the other hand, disrupt all other services, making it impossible for people to get any work done. It’s better to group customers and servers/services such that hard outages disrupt only particular customer groups, not all customers. The funny thing about computers is that if one critical function, such as NFS, isn’t working, often no work can be done. Thus, being 90 percent functional can be the same as being 0 percent functional. Isolate the 10 percent outage to well-partitioned subsets.

For example, a down NFS server hangs all clients that are actively connected. Suppose that there are three customer groups and three NFS file servers. If the customers’ data is spread over the file servers randomly, an outage on one file server will affect all customers. On the other hand, if each customer group is isolated to a particular file server, only one-third of the customers, at most, will be unable to work during an outage.

Grouped Power Cords

This same technique relates to how hardware is connected. A new SA was very proud of how neatly he wired a new set of servers. Each server had three components: a CPU, an external disk chassis, and a monitor. One power strip was for all the CPUs, one for all the disk chassis, and one for all the monitors. Every wire was neatly run and secured with wire ties—a very pretty sight. His mentor complimented him on a job well done but, realizing that the servers weren’t in use yet, took the opportunity to shut off the power strip with all the disks. All the servers crashed. The SA learned his lesson: It would be better to have each power strip supply power to all the components of a particular machine. Any single power strip failure would result in an outage of one-third of the devices. In both cases, one-third of the components were down, but in the latter case, only one-third of the service became unusable.

5.1 The Basics 115

❖ Windows Login Scripts Another example of reliability grouping relates to how one architects MS Windows login scripts. Everything the script needs should come from the same server as the script. That way, the script can be fairly sure that the server is alive. If users receive their login scripts from different servers, the various things that each login script needs to access should be replicated to all the servers rather than having multiple dependencies.

In document PROGRAMAS SOCIALES EN EL PERÚ (página 151-159)