Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Burgess M.Principles of network and system administration.2004.pdf
Скачиваний:
163
Добавлен:
23.08.2013
Размер:
5.65 Mб
Скачать

252

CHAPTER 7. CONFIGURATION AND MAINTENANCE

The Perl language (see appendix B.2) is a curious hybrid of C, Bourne shell and C-shell, together with a number of extra features which make it ideal for dealing with text files and databases. Since most system administration tasks deal with these issues, this places Perl squarely in the role of system programming. Perl is semi-compiled at runtime, rather than interpreted line-by-line like the shell, so it gains some of the advantages of compiled languages, such as syntax check before execution and so on. This makes it a safer and more robust language. It is also portable (something which shell scripts are not [19]). Although introduced as a scripting language, like all languages, Perl has been used for all manner of things for which it was never intended. Scripting languages have arrived on the computing scene with an alacrity which makes them a favorable choice to anyone wanting to get code running quickly. This is naturally a mixed blessing. What makes Perl a winner over many other special languages is that it is simply too convenient to ignore for a wide range of frequently required tasks. By adopting the programming idioms of well-known languages, as well as all the basic functions in the C library, Perl ingratiates itself to system administrators and becomes an essential tool.

7.9 Preventative host maintenance

In some countries, local doctors do not get paid if their patients get sick. This motivates them to practice preventative medicine, thus keeping the population healthy and functional at all times. A computer system which is healthy and functional is always equipped to perform the task it was intended for. A sick computer system is an expensive loss, in downtime and in human resources spent fixing the problem. It is surprising how effective a few simple measures can be toward stabilizing a system.

The key principle which we have to remember is that system behavior is a social phenomenon, an interaction between users’ habits and resource availability. In any social or biological system, survival is usually tied to the ability of the system to respond to threats. In biology we have immunity and repair systems; in society we have emergency services like fire, police, paramedics and the garbage collection service, combined with routines and policy (‘the law’). We scarely notice these services until something goes wrong, but without them our society would quickly decline into chaos.

7.9.1Policy decisions

A policy of prevention requires system managers to make several important decisions. Let’s return for a moment to the idea that users are the greatest danger to the stability of the system; we need to strike a balance between restricting their activities and allowing them freedom. Too many rules and restrictions leads to unrest and bad feelings, while too much freedom leads to anarchy. Finding a balance requires a policy decision to be made. The policy must be digested, understood and, not least, obeyed by users and system staff alike.

Determine the system policy. This is the prerequisite for all system maintenance. Know what is right and wrong and know how to respond to a crisis.

7.9. PREVENTATIVE HOST MAINTENANCE

253

Again, as we have reiterated throughout, no policy can cover every eventuality, nor should it be a substitute for thinking. A sensible policy will allow for sufficient flexibility (fault tolerance). A rigid policy is more likely to fail.

Sysadmin team agreement. The team of system administrators needs to work together, not against one another. That means that everyone must agree on the policy and enforce it.

Expect the worst. Be prepared for system failure and for rules to be broken. Some kind of police service is required to keep an eye on the system. We can use a script, or an integrated approach like cfengine for this.

Educate users in good and bad practice. Ignorance is our worst enemy. If we educate users in good practice, we reduce the problem of policy transgressions to a few ‘criminal’ users, looking to try their luck. Most users are not evil, just uninformed.

Special users. Do some users require special attention, extra resources or special assistance? An initial investment catering to their requirements can save time and effort in the long run.

7.9.2General provisions

Damage and loss can come in many forms: by hardware failure, resource exhaustion (full disks, excessive load), by security breaches and by accidental error. General provisions for prevention mean planning ahead in order to prevent loss, but also minimizing the effects of inevitable loss.

Do not rely exclusively on service or support contracts with vendors. They can be unreliable and unhelpful, particularly in an organization with little economic weight. Vendor support helpdesks usually cannot diagnose problems over the phone and a visit can take longer than is convenient, particularly if a larger customer also has a problem at the same time. Invest in local expertise.

Educate users by posting information in a clear and friendly way.

Make rules and structure as simple as possible, but no simpler.

Keep valuable information about configuration securely, but readily, available.

Document all changes and make sure that co-workers know about them, so that the system will survive, even if the person who made the change is not available.

Do not make changes just before going away on holiday: there are almost always consequences which need to be smoothed out.

Be aware of system limitations, hardware and software capacity. Do not rely on something to do a job it was not designed for.

254

CHAPTER 7. CONFIGURATION AND MAINTENANCE

Work defensively and follow the pulse of the system. If something looks unusual, investigate and understand what is happening.

Avoid gratuitous changes to things which already work adequately. ‘If it ain’t broke, don’t fix it’, but still aim for continuous but cautious improvement.

Duplication of service and data gives us a fallback which can be brought to bear in a crisis.

Vendors often like to pressure sites into signing expensive service contracts. Today’s computer hardware is quite reliable: for the cost of a service contract it might be possible to buy several new machines each year, so one can ask the question: should we write off seldom hardware failure as acceptable loss, or pay the one-off repair bill? If one chooses this option, it is important to have another host which can step in and take over the role of the old one, while a replacement is being procured. Again, this is the principle of redundancy. The economics of service contracts need to be considered carefully.

7.9.3Garbage collection

Computer systems have no natural waste disposal system. If computers were biological life, they would have perished long ago, poisoned by their own waste. No system can continue to function without waste disposal. It is a thermodynamic impossibility to go on using resources forever, without releasing some of them again. That process must come to an end.

Garbage collection in a computer system refers to two things: disk files and processes. Users seldom clear garbage of their own accord, either because they are not really aware of it, or because they have an instinctive fear of throwing things away. Administrators have to enforce and usually automate garbage collection as a matter of policy. Cfengine can be used to automate this kind of garbage collection.

Disk tidying: Many users are not even aware that they are building up junk files. Junk files are often the by-product of running a particular program. Ordinary users will often not even understand all of the files which they accumulate and will therefore be afraid to remove them. Moreover, few users are educated to think of their responsibilities as individuals to the system community of all users, when it comes to computer systems. It does not occur to them that they are doing anything wrong by filling the disk with every bit of scrap they take a shine to.

Process management: Processes, or running programs, do not always complete in a timely fashion. Some buggy processes go amok and consume CPU cycles by executing infinite loops, others simply hang and fail to disappear. On multiuser systems, terminals sometimes fail to terminate their login processes properly and will leave whole hierarchies of idle processes which do not go away by themselves. This leads to a gradual filling of the process table. In the end, the accumulation of such processes will prevent new programs from being started. Processes are killed with the kill command on Unix-like systems, or with the Windows Resource Kit’s kill command, or the Task Manager.

7.10. SNMP TOOLS

255

7.9.4Productivity or throughput

Throughput is how much real work actually gets done by a computer system. How efficiently is the system fulfilling its purpose or doing its job? The policy decisions we make can have an important bearing on this. For instance, we might think that the use of disk quotas would be beneficial to the system community because then no user would be able to consume more than his or her fair share of disk space. However, this policy can be misguided. There are many instances (during compilation, for instance) where users have to create large temporary files which can later be removed. Rigid disk quotas can prevent a user from performing legitimate work; they can get in the way of the system throughput. Limiting users’ resources can have exactly the opposite effect of that which was intended.

Another example is in process management. Some jobs require large amounts of CPU time and take a long time to run: intensive calculations are an example of this. Conventional wisdom is to reduce the process priority of such jobs so that they do not interfere with other users’ interactive activities. On Unix-like systems this means using the nice command to lower the priority of the process. However, this procedure can also be misguided. Lowering the priority of a process can lead to process starvation. Lowering the priority means that the heavy job will take even longer, and might never complete at all. An alternative strategy is to do the reverse: increasing the priority of a heavy task will get rid of it more quickly. The work will be finished and the system will be cleared of a demanding job, at the cost of some inconvenience for other users over a shorter period of time. We can summarize this in a principle:

Principle 42 (Resource chokes and drains). Moderating resource availability to key processes can lead to poor performance and low productivity. Conversely, with free access to resources, resource usage needs to be monitored to avoid the problem of runaway consumption, or the exploitation of those resources by malicious users.

7.10 SNMP tools

In spite of its limitations (see section 6.4.1), SNMP remains the protocol of choice for the management of most network hardware, and many tools have been written to query and manage SNMP enabled devices.

The fact that SNMP is a simple read/write protocol has motivated programmers to design simple tools that focus more on the SNMP protocol itself than on the semantics of the data structures described in MIBs. In other words, existing tools try to be generic instead of doing something specific and useful. Typical examples are so-called MIB browsers that help users to browse and manipulate raw MIB data. Such tools usually only understand the machine-parseable parts of a MIB module – which is just adequate to shield users from the bulk of the often arcane numbers used in the protocol. Other examples are scripting language APIs which provide a ‘programmer-friendly’ view on the SNMP protocol. However, in order to realize more useful management application, it is necessary to understand the

256

CHAPTER 7. CONFIGURATION AND MAINTENANCE

semantics of and the relationships between MIB variables. Generic tools require that the users have this knowledge – which is however not always the case.

PHP

The PHP server-side web page language (an enhanced encapsulation of C) is perhaps the simplest way of extracting MIB data from devices, but it is just a generic, low-level interface. PHP makes use of the Net SNMP libraries. For example, here is a simple PHP web page that prints all of the SNMP variables for a device and allows the data to be viewed in a web browser:

<?php

$a = snmpwalk("printer.example.org", "public", "");

for ($i=0; $i < count($a); $i++)

{

echo "$a[$i]<br>";

}

?>

The community string is written here with its default values ‘public’, but it is assumed that this has been changed to something more private. PHP is well and freely documented online, in contrast with Perl. For monitoring small numbers of devices, and for demonstrating the principles of SNMP, this is an excellent tool. However, for production work, something more sophisticated will be required by most users.

Perl, Tcl etc.

There are several SNMP extensions for Perl; a widely used Perl SNMP API is based on the NET-SNMP implementation and supports SNMPv1, SNMPv2c and SNMPv3. The Perl script shown below is based on the NET-SNMP Perl extension and retrieves information from the routing table defined in the RFC1213-MIB module and displays them in a human-readable format.

The problem with Perl is that it only puts a brave face on the same problems that PHP has: namely, it provides only a low-level interface to the basic read/write operations of the protocol. There is no intelligence to the interface, and it requires a considerable amount of programming to do real management with this interface.

Another SNMP interface worthy of mention is the Tcl extension, Scotty.

SCLI

One of the most effective ways of interacting with any system is through a command language. With language tools a user can express his or her exact wishes, rather than filtering them through a graphical menu.

7.10. SNMP TOOLS

257

The scli package [268, 269] was written to address the need for rational command line utilities for monitoring and configuring network devices. It utilizes a MIB compiler called smidump to generate C stub code. It is easily extensible with a minimum of knowledge about SNMP.

The programs contained in the scli package are specific rather than generic. Generic SNMP tools such as MIB browsers or simple command line tools (e.g. snmpwalk) are hard to use since they expose too many protocol details for most users. Moreover, in most cases, they fail to present the information in a format that is easy to read and understand. A nice feature of scli is that it works like other familiar Unix commands, such as netstat and top, and generates a feeling of true investigative interaction.

host$ scli printer-XXX

100-scli version 0.2.12 (c) 2001-2002 Juergen Schoenwaelder 100-scli trying SNMPv2c ... timeout

100-scli trying SNMPv1 ... ok. (printer-714) scli > show printer info

Device:

1

Description:

HP LaserJet 5M

Device Status:

running

Printer Status:

idle

Current Operator:

 

Service Person

 

Console Display:

1 line(s) a 40 chars

Console Language:

en/US

Console Access:

operatorConsoleEnabled

Default Input:

input #2

Default Output:

output #1

Default Marker:

marker #1

Default Path:

media path #1

Config Changes:

4

(printer-XXX) scli

>

Similarly, a ‘top’-like continuous monitoring can be obtained with

printer-XXX> monitor printer console display

Agent:

printer-XXX:161 up 61

days 01:13:49

 

 

 

13:48:49

Descr:

HP ETHERNET MULTI-ENVIRONMENT,JETDIRECT,JD24,EEPROM

A.08.32

IPv4:

7

pps in

5

pps out

0

pps fwd

0

pps rasm

0

pps frag

UDP:

5

pps in

5

pps out

 

 

 

 

 

 

TCP:

0

sps in

0

sps out

2

con est

0

con aopn

0

con popn

Command:

monitor printer console display

 

 

 

 

PRINTER LINE = TEXT =======================================================

11 Done: mark (STDIN):p

Now the fields are continuously updated. This is network traffic intensive, but useful for debugging devices over a short interval of time.