Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Burgess M.Principles of network and system administration.2004.pdf
Скачиваний:
174
Добавлен:
23.08.2013
Размер:
5.65 Mб
Скачать

6.5. CREATING INFRASTRUCTURE

219

to run their own Java Virtual Machine. Devices that do not have a Java virtual machine can be adapted with ‘surrogate’ devices. A surrogate device is like a ‘ghost’ of the actual device, kept on a fixed infrastructure host. This acts as a mediator between the actual device and a Jini interface.

These auto-configuration protocols will make future computing device configuration less of a headache for system administrators, and will allow us to take advantage of short-range wireless network communication protocols like Bluetooth. Of course, they do not really remove the need for a system administrator, but they push the need for administration up a layer of abstraction. Administrators will no longer need to fiddle with device drivers on individual hosts; rather, they will be tasked to maintain the Java infrastructure, including setting up the appropriate bindings within JNDI, and similar directory services. This is a simpler and more rational approach to device management.

6.5 Creating infrastructure

Until recently, little attention was given to analyzing methodologies for the construction of efficient and stable networks from the ground up, although some case studies for large-scale installations were made [170, 112, 289, 60, 215, 179, 129, 276, 149, 212, 164, 107]. One interesting exception is a discussion of human roles and delegation in network management in refs. [207, 135]. With the explosion in numbers of hosts combined in networks, several authors have begun to address the problem of defining an infrastructure model which is stable, reproducible and robust to accidents and upgrades [41, 108, 305, 44].

The term ‘bootstrapping an infrastructure’ was coined by Traugott and Huddleston in ref. [305] and nicely summarizes the basic intent. Both Evard [108] and Traugott and Huddleston have analyzed practical case studies of system infrastructures both for large networks (4000 hosts) and for small networks (as few as 3 hosts). Interestingly, Evard’s conclusions, although researched independently of Burgess [39, 41, 55, 42, 43], clearly vindicate the theoretical model used in constructing the tool cfengine.

6.5.1Principles of stable infrastructure

The principles on which we would like to build an infrastructure are straightforward, and build upon the idea of predictability under load.

Principle 32 (Scalability). Any model of system infrastructure must be able to scale efficiently to large numbers of hosts (and perhaps subnets, depending on the local netmask).

A model which does not scale efficiently with numbers of hosts is likely to fail quickly, as networks tend to expand rapidly beyond expectations.

Principle 33 (Reliability). Any model of system infrastructure must have reliability as one of its chief goals. Down-time can often be measured in real money.

220 CHAPTER 6. MODELS OF NETWORK AND SYSTEM ADMINISTRATION

Reliability is not just about the initial quality of hardware and software, but also about the need for preventative maintenance.

Corollary to principle (Redundancy). Reliability is safeguarded by redundancy, or backup services running in parallel, ready to take over at a moment’s notice [285].

Although redundancy does not prevent problems, it aids swift recovery. Barber has discussed improved server availability through redundancy [26]. High availability clusters and mainframes are often used for this problem. Gomberg et al. have compared scalable software installation methods on Unix and NT [132]. A refinement of the principle of homogeneity can be stated here, in its rightful place:

Principle 34 (Homogeneity/Uniformity II). A model in which all hosts are basically similar is i) easier to understand conceptually both for users and administrators, ii) cheaper to implement and maintain, and iii) easier to repair and adapt in the event of failure.

and finally:

Corollary to principle (Reproducibility). Avoid improvising system modifications, on the fly, which are not reproducible. It is easy to forget what was done and this will make the functioning of the system difficult to understand and predict, for you and for others.

The issue of convergence towards a stable state is central here (see section 6.7). Basically, convergence means that a system should always get closer to an ideal configuration, rather than farther away from it. This signals the need for continual maintenance of the system. The convergence idea will return several times throughout the book.

6.5.2Virtual machine model

Traugott and Huddleston [305] have eloquently argued that one should think of a networked system not so much as a loose association of hosts, but rather as a large virtual machine composed of associated organs. It is a small step from viewing a multitasking operating system as a collaboration between many specialized processes, to viewing the entire network as a distributed collaboration between specialized processes on different hosts. There is little or no difference in principle between an internal communication bus and an external communication bus. This would seem to suggest that the idea of peer association, described in section 6.3, is to be abandoned, but that need not be the case: there are several levels at which one can interpret the models in section 6.3. One must first specify what a node is. What Traugott and Huddleston observe is that it makes sense to treat tightly collaborating clusters as a unit.

Many sites adopt specific policies and guidelines in order to create this seamless virtual environment [58] by limiting the magnitude of the task. Institutions with a history of managing large numbers of hosts have a tradition of either adapting imperfect software to their requirements or creating their own. Tools such as make, which have been used to jury-rig configuration schemes [305] can now be replaced by more specific tools like cfengine [41, 55]. As with all things, getting started is the hard part.

6.5. CREATING INFRASTRUCTURE

221

6.5.3Creating uniformity through automation

Simple, robust infrastructure is created by planning a system which is easy to understand and maintain. If we want hosts to have the same software and facilities, creating a general uniformity, we need to employ automation to keep track of changes [154, 41, 55]. To begin, we must formulate the needs of and potential threats to system availability. That means planning resources, as in the foregoing sections, and planning the actual motions required to implement and maintain the system. If we can formalize those needs by writing them in the form of a policy, program or script, then half the battle is already won, and we have automatic reproducibility.

Principle 35 (Abstraction generalizes). Expressing tasks in an operating system independent language reduces time spent debugging, promotes homogeneity and avoids unnecessary repetition.

A script implies reproducibility, since it can be rerun on any host. The only obstacle to this is that not all script languages work on all systems.

Suggestion 8 (Platform independent languages). Use languages and tools which are independent of operating system peculiarities, e.g. cfengine, Perl, python. More importantly, use the right tool for the right job.

Perl is particularly useful, since it runs on most platforms and is about as operating system independent as it is possible to be. The disadvantage of Perl is that it is a low-level programming language, which requires us to code with a level of detail which can obscure the purpose of the code. Cfengine was invented to address this problem. The cfengine is a very high-level interface to system administration. It is also platform independent, and runs on most systems. Its advantage is that it hides the low-level details of programming, allowing us to focus on the structural decisions. We shall discuss this further below.

6.5.4Revision control

One approach to the configuration of hosts is to have a standard set of files in a file-base which can be simply copied into place. Several administration tools have been built on this principle, e.g. Host Factory [110]. The Revision Control System (RCS), designed by Tichy [302], was created as a repository for files, where changes could be traced through a number of evolving versions. RCS was introduced as a tool for programmers, to track bug fixes and improvements through a string of versions. The CVS system is an extended front-end to this system. System configuration is a similar problem, since it involves modifying the contents of many key files. Many administrators have made use of the revision control systems to keep track of configuration file changes, though little has been written about it. PC management with RCS has been discussed by Rudorfer [261]. Revision control is a useful way of keeping track of text-file changes, but it does not help us with other aspects of system maintenance, such as file permissions, process management or garbage collection.

222 CHAPTER 6. MODELS OF NETWORK AND SYSTEM ADMINISTRATION

6.5.5Software synchronization

In section 3.8.9 we discussed the distribution of data amongst a network community. This technique can be used to maintain a level of uniformity in the software used around the network. Software synchronization has been discussed in refs. [27, 147, 282]. Distribution by package mechanisms were pioneered by Hewlett Packard [256] with the install program. For some software packages Hewlett Packard use cfengine as a software installation tool [55]. Distribution by placement on network filesystems like the AFS has been discussed in [183].

6.5.6Push models and pull models

Revision control does not address the issue of uniformity unless the contents of the file-base can be distributed to many different hosts. There are two types of distribution mechanism, which are generally referred to as push and pull models of distribution.

Push: The push model is epitomized by the rdist program. Pushing files from a central location to a number of hosts is a way of forcing a file to be written to a group of hosts. The central repository decides when changes are to be sent, and the hosts which receive the files have no choice about receiving them [203]. In other words, control over all of the hosts is forced by the central repository. The advantage of this approach is that it can be made efficient. A push model is more easily optimized than a pull approach. The disadvantage of a push model is that hosts have no freedom to decide their own fate. A push model forces all hosts to open themselves to a central will. This could be a security hazard. In particular, rdist requires a host to grant not just file access, but full complete privilege to the distributing host. Another problem with push models is the need to maintain a list of all the hosts to which data will be pushed. For large numbers of hosts, this can become unwieldy.

Pull: The pull model is represented by cfengine and rsync. With a pull model, each host decides to collect files from a central repository, of its own volition. The advantage of this approach is that there is no need to open a host to control from outside, other than the trust implied by accepting configuration files from the distributing host. This has significant security advantages. It was recommended as a model of centralized system administration in refs. [265, 55, 305]. The main disadvantage to this method is that optimization is harder. rsync addresses this problem by using an ingenious algorithm for transmitting only file changes, and thus achieves a significant compression of data, while cfengine uses multi-threading to increase server availability.

6.5.7Reliability

One of the aims of building a sturdy infrastructure is to cope with the results of failure. Failure can encompass hardware and software. It includes downtime

6.6. SYSTEM MAINTENANCE MODELS

223

due to physical error (power, net cables and CPUs) and also downtime due to software crashes. The net result of any failure is loss of service. Our only defence against actual failure is parallelism, or redundancy. When one component fails, another can be ready to take over. Often it is possible to prevent failure with pro-active maintenance (see the next chapter for more on this issue). For instance, it is possible to vacuum clean hosts to prevent electrical short-circuits. It is also possible to perform garbage collection which can prevent software error. System monitors (e.g. cfengine) can ensure that crashed processes get restarted, thus minimizing loss. Reliability is clearly a multifaceted topic. We shall return to discuss reliability more quantitatively in section 13.5.10.

Component failure can be avoided by parallelism, or redundancy. One way to think about this is to think of a computer system as providing a service which is characterized by a flow of information. If we consider figure 6.10, it is clear that a flow of service can continue when servers work in parallel, even if one or more of them fails. In figure 6.11 it is clear that systems which are dependent on other systems are coupled in series and a failure prevents the flow of service. Of course, servers do not really work in parallel. The normal situation is to employ a fail-over capability. This means that we provide a backup service. If the main service fails, we replace it with a backup server. The backup server is not normally used however. Only in a few cases can one find examples of load-sharing by switching between (de-multiplexing) services on different hosts. Network Address Translation can be used for this purpose (see figure 2.11).

Figure 6.10: System components in parallel, implies redundancy.

Figure 6.11: System components in series, implies dependency.

6.6 System maintenance models

Models of system maintenance have evolved by distilling locally acquired experience from many sites. In latter years, attempts have been made to build software systems which apply certain principles to the problem of management. Network management has, to some extent, been likened to the process of software development in the System Administration Maturity Model, by Kubicki [187]. This work

224 CHAPTER 6. MODELS OF NETWORK AND SYSTEM ADMINISTRATION

was an important step in formalizing system administration. Later, a formalization was introduced by describing system administration in terms of automatable primitives.

Unix administrators have run background scripts to perform system checks and maintenance for many years. Such scripts (often called sanity checking scripts) run daily or hourly and make sure that each system is properly configured, perform garbage cleaning and report any serious problems to an administrator. In an immunological model, the aim is to minimize the involvement of a human being as far as possible.

Windows can be both easier and harder to administrate than Unix. It can be easier because the centralized model of having a domain server running all the network services, means that all configuration information can be left in one place (on the server), and that each workstation can be made (at least to a limited degree) to configure itself from the server’s files. It is harder to administer because the tools provided for system administration tasks work mainly by the GUI (graphical user interface) and this is not a suitable tool for addressing the issues of hundreds of hosts.

Several generalized approaches to the management of computers in a network have emerged.

6.6.1Reboot

With the rapid expansion of networks, the number of local networks has outgrown the number of experienced technicians. The result is that there are many administrators who are not skilled in the systems they are forced to manage. A disturbing but common belief, which originated in the 1980s microcomputer era, is that problems with a computer can be fixed by simply rebooting the operating system. Since home computer systems tend to crash with alarming regularity, this is a habit which has been acquired from painful experience. One learns nothing from this procedure, however, and the same strategy can cause problems for machines that are part of a larger system. Just because a terminal hangs, it does not mean that the host is not working at something important.

Although rebooting or powering down can appear to solve the immediate obstacle, and in some few cases might be the only course of action, one stands also to lose data that might be salvaged, and perhaps interrupt the interaction of the machine with remote hosts. Rebooting a multi-user system is dangerous since users might be logged in from remote locations and could lose data and service.

6.6.2Manual administration

The default approach to system management is to allow qualified humans to do everything by hand. This approach suffers from a lack of scalability. It suffers from human flaws and a lack of intrinsic documentation. Humans are not welldisciplined at documenting their work, or their intended configurations. There are also issues concerned with communication and work in a team, which can interfere