DISEÑO GESTIÓN / MONITOREO EVALUACIÓN
5.13. Análisis del Programa Mejoramiento Integral de Barrios y Pueblos (PMIBP) y el Banco de Materiales (BANMAT) 5 – Ministerio de Vivienda
5.13.1. Programa Mejoramiento Integral de Barrios y Pueblos (PMIBP)
Although this chapter recommends paying more for server-grade hardware because the extra performance and reliability are worthwhile, a growing counterargument says that it is better to use many replicated cheap servers that will fail more often. If you are doing a good job of managing failures, this strategy is more cost-effective.
Running large web farms will entail many redundant servers, all built to be exactly the same, the automated install. If each web server can handle 500 queries per second (QPS), you might need ten servers to handle the 5,000 QPS that you expect to receive from users all over the Internet. A load-balancing mechanism can distribute the load among the servers. Best of all, load balancers have ways to automatically detect machines that are down. If one server goes down, the load balancer divides the queries between the remaining good servers, and users still receive service. The servers are all one-tenth more loaded, but that’s better than an outage.
What if you used lower-quality parts that would result in ten failures? If that saved 10 percent on the purchase price, you could buy an eleventh machine to make up for the increased failures and lower performance of the
slower machines. However, you spent the same amount of money, got the same number of QPS, and had the same uptime. No difference, right?
In the early 1990s, servers often cost $50,000. Desktop PCs cost around $2,000 because they were made from commodity parts that were being mass- produced at orders of magnitude larger than server parts. If you built a server based on those commodity parts, it would not be able to provide the required QPS, and the failure rate would be much higher.
By the late 1990s, however, the economics had changed. Thanks to the continued mass-production of PC-grade parts, both prices and performance had improved dramatically. Companies such as Yahoo! and Google figured out how to manage large numbers of machines effectively, streamlining hard- ware installation, software updates, hardware repair management, and so on. It turns out that if you do these things on a large scale, the cost goes down significantly.
Traditional thinking says that you should never try to run a commercial service on a commodity-based server that can process only 20 QPS. However, when you can manage many of them, things start to change. Continuing the example, you would have to purchase 250 such servers to equal the performance of the 10 traditional servers mentioned previously. You would pay the same amount of money for the hardware.
As the QPS improved, this kind of solution became less expensive than buying large servers. If they provided 100 QPS of performance, you could buy the same capacity, 50 servers, at one-fifth the price or spend the same money and get five times the processing capacity.
By eliminating the components that were unused in such an arrangement, such as video cards, USB connectors, and so on, the cost could be further contained. Soon, one could purchase five to ten commodity-based servers for every large server traditionally purchased and have more processing ca- pability. Streamlining the physical hardware requirements resulted in more efficient packaging, with powerful servers slimmed down to a mere rack-unit in height.8
This kind of massive-scale cluster computing is what makes huge web services possible. Eventually, one can imagine more and more services turning to this kind of architecture.
8. The distance between the predrilled holes in a standard rack frame is referred to as a rack-unit, abbreviated as U. This, a system that occupies the space above or below the bolts that hold it in would be a 2U system.
4.2 The Icing 91
Case Study: Disposable Servers
Many e-commerce sites build mammoth clusters of low-cost 1U PC servers. Racks are packed with as many servers as possible, with dozens or hundreds configured to provide each service required. One site found that when a unit died, it was more economical to power it off and leave it in the rack rather than repair the unit. Removing dead units might accidentally cause an outage if other cables were loosened in the process. The site would not need to reap the dead machines for quite a while. We presume that when it starts to run out of space, the site will adopt a monthly day of reaping, with certain people carefully watching the service-monitoring systems while others reap the dead machines.
Another way to pack a large number of machines into a small space is to useblade servertechnology. A single chassis contains many slots, each of which can hold a card, or blade, that contains a CPU and memory. The chassis supplies power and network and management access. Sometimes, each blade has a hard disk; others require each blade to access a centralized storage-area network. Because all the devices are similar, it is possible to create an auto- mated system such that if one dies, a spare is configured as its replacement.
An increasingly important new technology is the use of virtual servers. Server hardware is now so powerful that justifying the cost of single-purpose machines is more difficult. The concept of a server as a set of components (hardware and software) provide security and simplicity. By running many virtual servers on a large, powerful server, the best of both worlds is achieved. Virtual servers are discussed further in Section 21.1.2.
Blade Server Management
A division of a large multinational company was planning on replacing its aging multi- CPU server with a farm of blade servers. The application would be recoded so that instead of using multiple processes on a single machine, it would use processes spread over the blade farm. Each blade would be one node of a vast compute farm that jobs could be submitted to and results consolidated on a controlling server. This had won- derful scalability, since a new blade could be added to the farm within minutes via automated build processes, if the application required it, or could be repurposed to other uses just as quickly. No direct user logins were needed, and no SA work would be needed beyond replacing faulty hardware and managing what blades were assigned to what applications. To this end, the SAs engineered a tightly locked-down minimal- access solution that could be deployed in minutes. Hundreds of blades were purchased and installed, ready to be purposed as the customer required.
The problem came when application developers found themselves unable to manage their application. They couldn’t debug issues without direct access. They demanded shell access. They required additional packages. They stored unique state on each machine, so automated builds were no longer viable. All of a sudden, the SAs found themselves managing 500 individual servers rather than a blade farm. Other divisions had also signed up for the service and made the same demands.
Two things could have prevented this problem. First, more attention to detail at the requirements-gathering stage might have foreseen the need for developer access, which could then have been included in the design. Second, management should have been more disciplined. Once the developers started requesting access, management should have set down limits that would have prevented the system from devolving into hundreds of custom machines. The original goal of a utility providing access to many similar CPUs should have been applied to the entire life cycle of the system, not just used to design it.
4.3 Conclusion
We make different decisions when purchasing servers because multiple cus- tomers depend on them, whereas a workstation client is dedicated to a single customer. Different economics drive the server hardware market versus the desktop market, and understanding those economics helps one make better purchasing decisions. Servers, like all hardware, sometimes fail, and one must therefore have some kind of maintenance contract or repair plan, as well as data backup/restore capability. Servers should be in proper machine rooms to provide a reliable environment for operation (we discuss data center require- ments in Chapter 5, Services). Space in the machine room should be allocated at purchase time, not when a server arrives. Allocate power, bandwidth, and cooling at purchase time as well.
Server appliances are hardware/software systems that contain all the soft- ware that is required for a particular task preconfigured on hardware that is tuned to the particular application. Server appliances provide high-quality so- lutions engineered with years of experience in a canned package and are likely to be much more reliable and easier to maintain than homegrown solutions. However, they are not easily customized to unusual site requirements.
Servers need the ability to be remotely administered. Hardware/software systems allow one to simulate console access remotely. This frees up machine room space and enables SAs to work from their offices and homes. SAs can respond to maintenance needs without the overhead of traveling to the server location.
To increase reliability, servers often have redundant systems, preferably in n+ 1 configurations. Having a mirrored system disk, redundant power
Exercises 93
supplies, and other redundant features enhances uptime. Being able to swap dead components while the system is running provides better MTTR and less service interruption. Although this redundancy may have been a luxury in the past, it is often a requirement in today’s environment.
This chapter illustrates our theme of completing the basics first so that later, everything else falls into place. Proper handling of the issues discussed in this chapter goes a long way toward making the system reliable, maintainable, and repairable. These issues must be considered at the beginning, not as an afterthought.
Exercises
1. What servers are used in your environment? How many different vendors are used? Do you consider this to be a lot of vendors? What would be the benefits and problems with increasing the number of vendors? Decreasing?
2. Describe your site’s strategy in purchasing maintenance and repair contracts. How could it be improved to be cheaper? How could it be improved to provide better service?
3. What are the major and minor differences between the hosts you install for servers versus clients’ workstations?
4. Why would one want hot-swap parts on a system without n + 1 redundancy?
5. Why would one wantn+ 1 redundancy if the system does not have hot- swap parts?
6. Which critical hosts in your environment do not haven+ 1 redundancy or cannot hot-swap parts? Estimate the cost to upgrade the most critical hosts ton+ 1.
7. An SA who needed to add a disk to a server that was low on disk space chose to wait until the next maintenance period to install the disk rather than do it while the system was running. Why might this be?
8. What services in your environment would be good candidates for replac- ing with an appliance (whether or not such an appliance is available)? Why are they good candidates?
9. What server appliances are in your environment? What engineering would you have to do if you had instead purchased a general-purpose machine to do the same function?
Chapter5
Services
A server is hardware. A service is the function that the server provides. A service may be built on several servers that work in conjunction with one another. This chapter explains how to build a service that meets customer requirements, is reliable, and is maintainable.
Providing a service involves not only putting together the hardware and software but also making the service reliable, scaling the service’s growth, and monitoring, maintaining, and supporting it. A service is not truly a service until it meets these basic requirements.
One of the fundamental duties of an SA is to provide customers with the services they need. This work is ongoing. Customers’ needs will evolve as their jobs and technologies evolve. As a result, an SA spends a considerable amount of time designing and building new services. How well the SA builds those services determines how much time and effort will have to be spent supporting them in the future and how happy the customers will be.
A typical environment has many services. Fundamental services include DNS, email, authentication services, network connectivity, and printing.1
These services are the most critical, and they are the most visible if they fail. Other typical services are the various remote access methods, network license service, software depots, backup services, Internet access, DHCP, and file ser- vice. Those are just some of the generic services that system administration teams usually provide. On top of those are the business-specific services that serve the company or organization: accounting, manufacturing, and other business processes.
1. DNS, networking, and authentication are services on which many other services rely. Email and printing may seem less obviously critical, but if you ever do have a failure of either, you will discover that they are the lifeblood of everyone’s workflow. Communications and hardcopy are at the core of every company.
Services are what distinguish a structured computing environment that is managed by SAs from an environment in which there are one or more stand-alone computers. Homes and very small offices typically have a few stand-alone machines providing services. Larger installations are typically linked through shared services that ease communication and optimize re- sources. When it connects to the Internet through an Internet service provider, a home computer uses services provided by the ISP and the other people that the person connects to across the Internet. An office environment provides those same services and more.
5.1 The Basics
Building a solid, reliable service is a key role of an SA, who needs to consider many basics when performing that task. The most important thing to consider at all stages of design and deployment is the customers’ requirements. Talk to the customers and find out what their needs and expectations are for the service.2 Then build a list of other requirements, such as administrative
requirements, that are visible only to the SA team. Focus on thewhatrather than thehow. It’s easy to get bogged down in implementation details and lose sight of the purpose and goals.
We have found great success through the use of open protocols and open architectures. You may not always be able to achieve this, but it should be considered in the design.
Services should be built on server-class machines that are kept in a suit- able environment and should reach reasonable levels of reliability and perfor- mance. The service and the machines that it relies on should be monitored, and failures should generate alarms or trouble tickets, as appropriate.
Most services rely on other services. Understanding in detail how a service works will give you insight into the services on which it relies. For example, almost every service relies on DNS. If machine names or domain names are configured into the service, it relies on DNS; if its log files contain the names of hosts that used the service or were accessed by the service, it uses DNS; if the people accessing it are trying to contact other machines through the service, it uses DNS. Likewise, almost every service relies on the network, which is also a service. DNS relies on the network; therefore, anything that relies on DNS also relies on the network. Some services rely on email, which relies on DNS and the network; others rely on being able to access shared files on other
2. Some services, such as name service and authentication service, do not have customer requirements other than that they should always work and they should be fast and unintrusive.
5.1 The Basics 97
computers. Many services also rely on the authentication and authorization service to be able to distinguish one person from another, particularly where different levels of access are given based on identity. The failure of some services, such as DNS, causes cascading failures of all the other services that rely on them. When building a service, it is important to know the other services on which it relies.
Machines and software that are part of a service should rely only on hosts and software that are built to the same standards or higher. A service can be only as reliable as the weakest link in the chain of services on which it relies. A service should not gratuitously rely on hosts that are not part of the service. Access to server machines should be restricted to SAs for reasons of reliability and security. The more people who are using a machine and the more things that are running on it, the greater the chance that bad interactions will happen. Machines that customers use also need to have more things installed on them so that the customers can access the data they need and use other network services.
Similarly, a system is only as secure as its weakest link. The security of client systems is no stronger than the weakest link in the security of the in- frastructure. Someone who can subvert the authentication server can gain access to clients that rely on it; someone who can subvert the DNS servers could redirect traffic from the client and potentially gain passwords. If the security system relies on that subverted DNS, the security system is vulner- able. Restricting login and other kinds of access to machines in the security infrastructure reduces these kinds of risk.
A server should be as simple as possible. Simplicity makes machines more reliable and easier to debug when they do have problems. Servers should have the minimum that is required for the service they run, only SAs should have access to them; and the SAs should log in to them only to do maintenance. Servers are also more sensitive from a security point of view than desktops are. An intruder who can gain administrative access to a server can typically do more damage than with administrative access to a desktop machine. The fewer people who have access and the less that runs on the machine, the lower the chance that an intruder can gain access, and the greater the chance that an intruder will be spotted.
An SA has several decisions to make when building a service: from what