- •Contents
- •Preface to second edition
- •1 Introduction
- •1.2 Applying technology in an environment
- •1.3 The human role in systems
- •1.4 Ethical issues
- •1.7 Common practice and good practice
- •1.8 Bugs and emergent phenomena
- •1.10 Knowledge is a jigsaw puzzle
- •1.11 To the student
- •1.12 Some road-maps
- •2 System components
- •2.2 Handling hardware
- •2.3 Operating systems
- •2.4 Filesystems
- •2.5 Processes and job control
- •2.6 Networks
- •2.7 IPv4 networks
- •2.8 Address space in IPv4
- •2.9 IPv6 networks
- •3 Networked communities
- •3.1 Communities and enterprises
- •3.2 Policy blueprints
- •3.4 User behavior: socio-anthropology
- •3.5 Clients, servers and delegation
- •3.6 Host identities and name services
- •3.8 Local network orientation and analysis
- •4 Host management
- •4.1 Global view, local action
- •4.2 Physical considerations of server room
- •4.3 Computer startup and shutdown
- •4.5 Installing a Unix disk
- •4.6 Installation of the operating system
- •4.7 Software installation
- •4.8 Kernel customization
- •5 User management
- •5.1 Issues
- •5.2 User registration
- •5.3 Account policy
- •5.4 Login environment
- •5.5 User support services
- •5.6 Controlling user resources
- •5.7 Online user services
- •5.9 Ethical conduct of administrators and users
- •5.10 Computer usage policy
- •6 Models of network and system administration
- •6.5 Creating infrastructure
- •6.7 Competition, immunity and convergence
- •6.8 Policy and configuration automation
- •7.2 Methods: controlling causes and symptoms
- •7.4 Declarative languages
- •7.6 Common assumptions: clock synchronization
- •7.7 Human–computer job scheduling
- •7.9 Preventative host maintenance
- •7.10 SNMP tools
- •7.11 Cfengine
- •8 Diagnostics, fault and change management
- •8.1 Fault tolerance and propagation
- •8.2 Networks and small worlds
- •8.3 Causality and dependency
- •8.4 Defining the system
- •8.5 Faults
- •8.6 Cause trees
- •8.7 Probabilistic fault trees
- •8.9 Game-theoretical strategy selection
- •8.10 Monitoring
- •8.12 Principles of quality assurance
- •9 Application-level services
- •9.1 Application-level services
- •9.2 Proxies and agents
- •9.3 Installing a new service
- •9.4 Summoning daemons
- •9.5 Setting up the DNS nameservice
- •9.7 E-mail configuration
- •9.8 OpenLDAP directory service
- •9.10 Samba
- •9.11 The printer service
- •9.12 Java web and enterprise services
- •10 Network-level services
- •10.1 The Internet
- •10.2 A recap of networking concepts
- •10.3 Getting traffic to its destination
- •10.4 Alternative network transport technologies
- •10.5 Alternative network connection technologies
- •10.6 IP routing and forwarding
- •10.7 Multi-Protocol Label Switching (MPLS)
- •10.8 Quality of Service
- •10.9 Competition or cooperation for service?
- •10.10 Service Level Agreements
- •11 Principles of security
- •11.1 Four independent issues
- •11.2 Physical security
- •11.3 Trust relationships
- •11.7 Preventing and minimizing failure modes
- •12 Security implementation
- •12.2 The recovery plan
- •12.3 Data integrity and protection
- •12.5 Analyzing network security
- •12.6 VPNs: secure shell and FreeS/WAN
- •12.7 Role-based security and capabilities
- •12.8 WWW security
- •12.9 IPSec – secure IP
- •12.10 Ordered access control and policy conflicts
- •12.11 IP filtering for firewalls
- •12.12 Firewalls
- •12.13 Intrusion detection and forensics
- •13 Analytical system administration
- •13.1 Science vs technology
- •13.2 Studying complex systems
- •13.3 The purpose of observation
- •13.5 Evaluating a hierarchical system
- •13.6 Deterministic and stochastic behavior
- •13.7 Observational errors
- •13.8 Strategic analyses
- •13.9 Summary
- •14 Summary and outlook
- •14.3 Pervasive computing
- •B.1 Make
- •B.2 Perl
- •Bibliography
- •Index
252 |
CHAPTER 7. CONFIGURATION AND MAINTENANCE |
The Perl language (see appendix B.2) is a curious hybrid of C, Bourne shell and C-shell, together with a number of extra features which make it ideal for dealing with text files and databases. Since most system administration tasks deal with these issues, this places Perl squarely in the role of system programming. Perl is semi-compiled at runtime, rather than interpreted line-by-line like the shell, so it gains some of the advantages of compiled languages, such as syntax check before execution and so on. This makes it a safer and more robust language. It is also portable (something which shell scripts are not [19]). Although introduced as a scripting language, like all languages, Perl has been used for all manner of things for which it was never intended. Scripting languages have arrived on the computing scene with an alacrity which makes them a favorable choice to anyone wanting to get code running quickly. This is naturally a mixed blessing. What makes Perl a winner over many other special languages is that it is simply too convenient to ignore for a wide range of frequently required tasks. By adopting the programming idioms of well-known languages, as well as all the basic functions in the C library, Perl ingratiates itself to system administrators and becomes an essential tool.
7.9 Preventative host maintenance
In some countries, local doctors do not get paid if their patients get sick. This motivates them to practice preventative medicine, thus keeping the population healthy and functional at all times. A computer system which is healthy and functional is always equipped to perform the task it was intended for. A sick computer system is an expensive loss, in downtime and in human resources spent fixing the problem. It is surprising how effective a few simple measures can be toward stabilizing a system.
The key principle which we have to remember is that system behavior is a social phenomenon, an interaction between users’ habits and resource availability. In any social or biological system, survival is usually tied to the ability of the system to respond to threats. In biology we have immunity and repair systems; in society we have emergency services like fire, police, paramedics and the garbage collection service, combined with routines and policy (‘the law’). We scarely notice these services until something goes wrong, but without them our society would quickly decline into chaos.
7.9.1Policy decisions
A policy of prevention requires system managers to make several important decisions. Let’s return for a moment to the idea that users are the greatest danger to the stability of the system; we need to strike a balance between restricting their activities and allowing them freedom. Too many rules and restrictions leads to unrest and bad feelings, while too much freedom leads to anarchy. Finding a balance requires a policy decision to be made. The policy must be digested, understood and, not least, obeyed by users and system staff alike.
•Determine the system policy. This is the prerequisite for all system maintenance. Know what is right and wrong and know how to respond to a crisis.
7.9. PREVENTATIVE HOST MAINTENANCE |
253 |
Again, as we have reiterated throughout, no policy can cover every eventuality, nor should it be a substitute for thinking. A sensible policy will allow for sufficient flexibility (fault tolerance). A rigid policy is more likely to fail.
•Sysadmin team agreement. The team of system administrators needs to work together, not against one another. That means that everyone must agree on the policy and enforce it.
•Expect the worst. Be prepared for system failure and for rules to be broken. Some kind of police service is required to keep an eye on the system. We can use a script, or an integrated approach like cfengine for this.
•Educate users in good and bad practice. Ignorance is our worst enemy. If we educate users in good practice, we reduce the problem of policy transgressions to a few ‘criminal’ users, looking to try their luck. Most users are not evil, just uninformed.
•Special users. Do some users require special attention, extra resources or special assistance? An initial investment catering to their requirements can save time and effort in the long run.
7.9.2General provisions
Damage and loss can come in many forms: by hardware failure, resource exhaustion (full disks, excessive load), by security breaches and by accidental error. General provisions for prevention mean planning ahead in order to prevent loss, but also minimizing the effects of inevitable loss.
•Do not rely exclusively on service or support contracts with vendors. They can be unreliable and unhelpful, particularly in an organization with little economic weight. Vendor support helpdesks usually cannot diagnose problems over the phone and a visit can take longer than is convenient, particularly if a larger customer also has a problem at the same time. Invest in local expertise.
•Educate users by posting information in a clear and friendly way.
•Make rules and structure as simple as possible, but no simpler.
•Keep valuable information about configuration securely, but readily, available.
•Document all changes and make sure that co-workers know about them, so that the system will survive, even if the person who made the change is not available.
•Do not make changes just before going away on holiday: there are almost always consequences which need to be smoothed out.
•Be aware of system limitations, hardware and software capacity. Do not rely on something to do a job it was not designed for.
254 |
CHAPTER 7. CONFIGURATION AND MAINTENANCE |
•Work defensively and follow the pulse of the system. If something looks unusual, investigate and understand what is happening.
•Avoid gratuitous changes to things which already work adequately. ‘If it ain’t broke, don’t fix it’, but still aim for continuous but cautious improvement.
•Duplication of service and data gives us a fallback which can be brought to bear in a crisis.
Vendors often like to pressure sites into signing expensive service contracts. Today’s computer hardware is quite reliable: for the cost of a service contract it might be possible to buy several new machines each year, so one can ask the question: should we write off seldom hardware failure as acceptable loss, or pay the one-off repair bill? If one chooses this option, it is important to have another host which can step in and take over the role of the old one, while a replacement is being procured. Again, this is the principle of redundancy. The economics of service contracts need to be considered carefully.
7.9.3Garbage collection
Computer systems have no natural waste disposal system. If computers were biological life, they would have perished long ago, poisoned by their own waste. No system can continue to function without waste disposal. It is a thermodynamic impossibility to go on using resources forever, without releasing some of them again. That process must come to an end.
Garbage collection in a computer system refers to two things: disk files and processes. Users seldom clear garbage of their own accord, either because they are not really aware of it, or because they have an instinctive fear of throwing things away. Administrators have to enforce and usually automate garbage collection as a matter of policy. Cfengine can be used to automate this kind of garbage collection.
•Disk tidying: Many users are not even aware that they are building up junk files. Junk files are often the by-product of running a particular program. Ordinary users will often not even understand all of the files which they accumulate and will therefore be afraid to remove them. Moreover, few users are educated to think of their responsibilities as individuals to the system community of all users, when it comes to computer systems. It does not occur to them that they are doing anything wrong by filling the disk with every bit of scrap they take a shine to.
•Process management: Processes, or running programs, do not always complete in a timely fashion. Some buggy processes go amok and consume CPU cycles by executing infinite loops, others simply hang and fail to disappear. On multiuser systems, terminals sometimes fail to terminate their login processes properly and will leave whole hierarchies of idle processes which do not go away by themselves. This leads to a gradual filling of the process table. In the end, the accumulation of such processes will prevent new programs from being started. Processes are killed with the kill command on Unix-like systems, or with the Windows Resource Kit’s kill command, or the Task Manager.
7.10. SNMP TOOLS |
255 |
7.9.4Productivity or throughput
Throughput is how much real work actually gets done by a computer system. How efficiently is the system fulfilling its purpose or doing its job? The policy decisions we make can have an important bearing on this. For instance, we might think that the use of disk quotas would be beneficial to the system community because then no user would be able to consume more than his or her fair share of disk space. However, this policy can be misguided. There are many instances (during compilation, for instance) where users have to create large temporary files which can later be removed. Rigid disk quotas can prevent a user from performing legitimate work; they can get in the way of the system throughput. Limiting users’ resources can have exactly the opposite effect of that which was intended.
Another example is in process management. Some jobs require large amounts of CPU time and take a long time to run: intensive calculations are an example of this. Conventional wisdom is to reduce the process priority of such jobs so that they do not interfere with other users’ interactive activities. On Unix-like systems this means using the nice command to lower the priority of the process. However, this procedure can also be misguided. Lowering the priority of a process can lead to process starvation. Lowering the priority means that the heavy job will take even longer, and might never complete at all. An alternative strategy is to do the reverse: increasing the priority of a heavy task will get rid of it more quickly. The work will be finished and the system will be cleared of a demanding job, at the cost of some inconvenience for other users over a shorter period of time. We can summarize this in a principle:
Principle 42 (Resource chokes and drains). Moderating resource availability to key processes can lead to poor performance and low productivity. Conversely, with free access to resources, resource usage needs to be monitored to avoid the problem of runaway consumption, or the exploitation of those resources by malicious users.
7.10 SNMP tools
In spite of its limitations (see section 6.4.1), SNMP remains the protocol of choice for the management of most network hardware, and many tools have been written to query and manage SNMP enabled devices.
The fact that SNMP is a simple read/write protocol has motivated programmers to design simple tools that focus more on the SNMP protocol itself than on the semantics of the data structures described in MIBs. In other words, existing tools try to be generic instead of doing something specific and useful. Typical examples are so-called MIB browsers that help users to browse and manipulate raw MIB data. Such tools usually only understand the machine-parseable parts of a MIB module – which is just adequate to shield users from the bulk of the often arcane numbers used in the protocol. Other examples are scripting language APIs which provide a ‘programmer-friendly’ view on the SNMP protocol. However, in order to realize more useful management application, it is necessary to understand the
256 |
CHAPTER 7. CONFIGURATION AND MAINTENANCE |
semantics of and the relationships between MIB variables. Generic tools require that the users have this knowledge – which is however not always the case.
PHP
The PHP server-side web page language (an enhanced encapsulation of C) is perhaps the simplest way of extracting MIB data from devices, but it is just a generic, low-level interface. PHP makes use of the Net SNMP libraries. For example, here is a simple PHP web page that prints all of the SNMP variables for a device and allows the data to be viewed in a web browser:
<?php
$a = snmpwalk("printer.example.org", "public", "");
for ($i=0; $i < count($a); $i++)
{
echo "$a[$i]<br>";
}
?>
The community string is written here with its default values ‘public’, but it is assumed that this has been changed to something more private. PHP is well and freely documented online, in contrast with Perl. For monitoring small numbers of devices, and for demonstrating the principles of SNMP, this is an excellent tool. However, for production work, something more sophisticated will be required by most users.
Perl, Tcl etc.
There are several SNMP extensions for Perl; a widely used Perl SNMP API is based on the NET-SNMP implementation and supports SNMPv1, SNMPv2c and SNMPv3. The Perl script shown below is based on the NET-SNMP Perl extension and retrieves information from the routing table defined in the RFC1213-MIB module and displays them in a human-readable format.
The problem with Perl is that it only puts a brave face on the same problems that PHP has: namely, it provides only a low-level interface to the basic read/write operations of the protocol. There is no intelligence to the interface, and it requires a considerable amount of programming to do real management with this interface.
Another SNMP interface worthy of mention is the Tcl extension, Scotty.
SCLI
One of the most effective ways of interacting with any system is through a command language. With language tools a user can express his or her exact wishes, rather than filtering them through a graphical menu.
7.10. SNMP TOOLS |
257 |
The scli package [268, 269] was written to address the need for rational command line utilities for monitoring and configuring network devices. It utilizes a MIB compiler called smidump to generate C stub code. It is easily extensible with a minimum of knowledge about SNMP.
The programs contained in the scli package are specific rather than generic. Generic SNMP tools such as MIB browsers or simple command line tools (e.g. snmpwalk) are hard to use since they expose too many protocol details for most users. Moreover, in most cases, they fail to present the information in a format that is easy to read and understand. A nice feature of scli is that it works like other familiar Unix commands, such as netstat and top, and generates a feeling of true investigative interaction.
host$ scli printer-XXX
100-scli version 0.2.12 (c) 2001-2002 Juergen Schoenwaelder 100-scli trying SNMPv2c ... timeout
100-scli trying SNMPv1 ... ok. (printer-714) scli > show printer info
Device: |
1 |
Description: |
HP LaserJet 5M |
Device Status: |
running |
Printer Status: |
idle |
Current Operator: |
|
Service Person |
|
Console Display: |
1 line(s) a 40 chars |
Console Language: |
en/US |
Console Access: |
operatorConsoleEnabled |
Default Input: |
input #2 |
Default Output: |
output #1 |
Default Marker: |
marker #1 |
Default Path: |
media path #1 |
Config Changes: |
4 |
(printer-XXX) scli |
> |
Similarly, a ‘top’-like continuous monitoring can be obtained with
printer-XXX> monitor printer console display
Agent: |
printer-XXX:161 up 61 |
days 01:13:49 |
|
|
|
13:48:49 |
||||
Descr: |
HP ETHERNET MULTI-ENVIRONMENT,JETDIRECT,JD24,EEPROM |
A.08.32 |
||||||||
IPv4: |
7 |
pps in |
5 |
pps out |
0 |
pps fwd |
0 |
pps rasm |
0 |
pps frag |
UDP: |
5 |
pps in |
5 |
pps out |
|
|
|
|
|
|
TCP: |
0 |
sps in |
0 |
sps out |
2 |
con est |
0 |
con aopn |
0 |
con popn |
Command: |
monitor printer console display |
|
|
|
|
PRINTER LINE = TEXT =======================================================
11 Done: mark (STDIN):p
Now the fields are continuously updated. This is network traffic intensive, but useful for debugging devices over a short interval of time.