Notes on the LISA '96 Workshop: Advanced Topics in System Administration

Tina Darmohray <tmd@iwi.com>, John Schimmel <jes@sgi.com>, and John Sellens <jmsellens@uwaterloo.ca>

To appear in ;login:, October or December 1996


Tuesday October 1 1996
Sheraton Chicago

It's 9:00am and we've already turned the tables on this workshop (or at least the tables in the meeting room, so that they'd face each other). The continental breakfast is rapidly disappearing. But all is not lost -- John Schimmel made sure that there was plenty of Diet Coke available for the break.

The workshop was attended by 36 individuals, with roughly equal representation from universities and commercial organizations, with a handful of government types thrown in for good measure. Most attendees labeled themselves as managers, and there were no students in the group.

The attendees: John Schimmel, Tina Darmohray, Brent Chapman, Lee Damon, Peter Van Epp, Bryan McDonald, Paul Evans, Laura de Leon, Paul Anderson, Strata Rose, John Sellens, David Parter, Andrew Hume, Xev Gittler, Remy Evard, Elizabeth Zwicky, Mark Allyn, Don Libes, Pat Wilson, Amy Kreiling, Helen Harrison, Mark Verber, Patrick Landry, Greg Rose, Mike Fisk, Lisa Giacchetti, Todd Atkins, Ian Reddy, Richard Chycoski, Doug Hughes, Mark Henderson, Dwayne Martin, Tim Gassaway, Jon Finke, Giray Pultar, and Bill LeFebvre.

This is the second Advanced Topics workshop held at LISA -- the first was last year at LISA 9 in Monterey. The workshop was organized by John Schimmel of SGI with the goal of providing a forum for a focused discussion on the most important current issues in system administration. Participants were expected to submit at least one topic for discussion and the workshop was designed to be wide-ranging and free-wheeling.

The Advanced Topics workshop was created because, back in the olden days, LISA was 100 people in a room throwing stones at each other. Now it's over 1,600, and one of the questions is: how do you go to a conference that size and actually learn anything outside the organized sessions? So, our hand-picked, select group (i.e. anybody that sent John mail expressing interest) got together to see what we could learn.

9:30am: John says "Let's Start". By some quirk of fate, the room ended up with almost all the university people on one side, and almost all the commercial people on the other, with a smattering of research and government people mixed in. A short review of operating system use revealed a predominance of SunOS and Solaris, with almost everything else represented (AIX, Sequent, DEC Ultrix and Digital UNIX, BSD variants, Linux, Plan 9 (1 person), NT, VMS, MacOS, HPUX, IRIX, Windows).

John wrote down a list of every topic that had been suggested for discussion. Helen Harrison suggested that we vote for 3 (or so) topics to discuss today. John said, "You're a manager, right?".

Tina Darmohray tried to suggest to John how to run the vote on the topics, and she learned a valuable lesson: Tina was now running the vote. Very efficiently. Most topics turned out to get either exactly 1, 5, 10, or 20 votes, until Tina started getting hassled about it, and we had votes of 12, 7 and 4. Windows NT and its associated implications was the hot topic at over 25 votes. Software distribution, networking, and other topics followed further down in the voting. We decided to try to tackle some of the other topics first, before diving into Windows NT issues, and so started on the topic of software distribution.

Software Distribution

Andrew Hume started the discussion by asking "what about rdist to Windows 95?"

Paul Evans related that there had been 2 or 3 paper submissions to this year's conference with the general approach of using dual boot PCs with Linux and Windows installed, and rebooting into Linux every night, using rdist and other commands to (re-)configure the Windows side of things, and then rebooting into Windows. General laughter ensued with the feeling that this was not a very pleasant solution.

Laura de Leon observed "Maybe we should have started with NT".

The discussion quickly centered on NT administration and integration in a large network (and NT and PC issues continued to surface regularly throughout the morning). Many attendees reported that their organizations are getting NT-based machines rapidly. The main reason behind their acquisition is to run popular applications (like Word and Excel) and to provide "easy" file and print service for PC users. The problem that NT servers (and PCs themselves) create for system administrators is that it is hard to provide large scale support for them because their typical administration paradigm assumes that they are administered from the console, rather than remotely or through some sort of automated scripting language. We agreed that a command line interface to hook into on NT and PCs to manage the machines remotely would be a dramatic improvement over the current state of affairs, and that it would be even better if there were a well-defined API for all the administrative functions.

There were a large number of utilities people had come up with to handle the PC problem, but none of these was particularly earth shattering. Probably the most useful tool discussed was Samba. The biggest problem seemed to be Lotus Notes.

Do GUI interfaces to UNIX (or NT) system administration get in the way of larger scale administration? Pat Wilson suggested that "surely there are command line alternatives for every GUI". Andrew Hume pointed out, as a counter-example, that the interface to NCR's RAID setup is through the GUI only. There was discussion of a few examples of administrative tasks that are more easily done through a GUI. Some systems have a command line alternative, but they aren't always obvious to the casual (or not so casual) administrator. The question was raised: why don't vendors make the command line alternative as easy to find as the GUI? Some of the suggested/postulated reasons were

Paul Evans threw out "a fundamental philosophical issue" in the UNIX/Windows discussion: from the PC/Windows point of view, the fundamental unit is the isolated machine on a person's desk, and much of the administration of the machine is done by walking up to the person's desk, or doing it one at a time over the network. Paul Anderson pointed out that automatically administering a network of 300 or 400 UNIX workstations is doable, and adding one more new machine is not a linear increase in the required effort. But in the PC world, the effort required is much more often linear.

The question arose: how can we protect users from themselves? This is, of course, the age-old dichotomy of centralized vs distributed administration.

Why don't OS vendors provide a "large installation administration kit" as a package that could be installed on your machines? Elizabeth Zwicky mentioned SGI's "inst" which will, in the next version, have a programmatic interface, so that it will be (theoretically) much easier to automate.

Andrew Hume asked for opinions on the use of encrypted data distribution within an organization. His example was AT&T -- can AT&T really trust the 200,000 people inside the AT&T firewall with "confidential" and "need to know" information available on the internal network?

The discussion turned to standards and formal methods. What about the "sysman" standards effort?

Mark Verber asked: Is the issue data distribution, or configuration management? Do formal specifications and theoretical methods work? Can you apply theory if you don't deal with it in practice? Andrew Hume believes that formal specifications will work in system administration, and that the current action is in Europe.

Greg Rose asked: When was the last time you moved your car's clutch pedal to the middle? Is reconfiguring your machine/environment necessary or even a good idea? Xev Gittler pointed out that if machines were as standardized as the automobile interface is, we wouldn't have nearly as many problems. Paul Evans suggested that it's a function of how mature the system or industry is, and that there weren't a lot of standards in the early days of the automobile. Brent Chapman wondered if turning on the ignition in your car and setting up your computer for your needs are comparable tasks? Is the latter an inherently more complicated task? Strata Rose pointed out that the utility of a car is already present in the car, but the utility of a computer is in how much you customize/modify it to suit your particular needs.

Does pressing your vendor's sales rep cause change to happen? Paul Evans observed that organizations behave in a way that makes sense internally. Sun never would have had Solaris if it was just driven by marketing or customer desires. The conclusion: pressing your sales rep may have no effect at all.

And once again we veered back into the PC arena, as indicated by Paul Evans' observation that people want things for a combination of rational and irrational reasons, stated and unstated needs. The PC "culture" thing means that there is a cultural backlash to any attempt at central control of "my desktop". The user's perfectly reasonable assumption is "I do it at home, why can't I do it here?" -- which leads to the obvious conclusion that something that is okay to do when you have 1 system is not necessarily a good thing to do when you have 10,000, spread across the globe.

What about charging for support? Will this help to cause people to toe the company line? This is quite an ironic idea, given that we are the people who set out to topple the tyrannical regime of MIS with UNIX, and now we are the tyrannical regime. Can we influence people in the desired direction by offering good, standard support for free, but charging money to fix someone's "customized" installation? Andrew Hume observed that there are two types of users -- one is happy with the standard setup, and the other needs the "latest and greatest" all the time.

Paul Evans: The cost of truly supporting people is far higher than typical perceptions suggest. It's an economic decision. We know how to do very good system administration. What we don't know is how to do whatever is the most "cost effective" for us.

One approach is to build a "baseline" distribution that is good for 80% of the users, and then make it easy to add the few extra things that people need (or want), in a standard way. Elizabeth Zwicky mentioned that SGI Europe has 93% compliance to their baseline distribution. They used convenience (it's really easy to use the baseline distribution), and bribery -- extra disk space lured in a few people, but free T Shirts lured in many more people. And posting a daily compliance statistics list helps people want to comply -- an example was given of one office who increased their compliance simply so that they would have higher compliance than their rival office in another country.

The topic of software distribution traversed all the extremes. On Unix the rdist/track tools are still the most widely used methods for distributing software and configuration information. There are a number of commercial products now on the market to act as wrappers around these protocols, but none of these seemed to cleanly fill the needs for either software distribution or configuration management. The number of large sites represented which used a commercial tool to handle configuration management was still very small. Most sites are still creating home grown tool sets to handle their unique circumstances.

Attempting to handle the file system layout in a large environment is still a significant problem. It has been long recognized that maintaining a single software repository, or at least a standard layout, is beneficial for large sites, but noone has yet worked out a clean way to do this. Many papers have been delivered at past LISA conferences on ways to control the layout of software to be best managed in a large site. This is still not a solved problem in the Unix arena. In the Windows based site this is pure chaos.

The day's first session closed with the observations: Strata Rose: Microsoft is just there -- it's like a glacier. Xev Gittler: It's not a glacier, it's an avalanche.

Part 2 -- After the morning break

After the morning break, John Schimmel had written down the following list of different things that we had talked about: - File distribution -- UNIX has known solutions, but Windows/NT solutions are not nearly as obvious - NT/Windows administration - Administration models - User customization/modification - Baseline Software Installation - User mentality (Unix vs PC) Is part of the PC problem that people are expected to differentiate between the consumer electronics that they have at home and the centrally administered and supported business tool that they have at work? Is that a reasonable expectation to have?

The conclusion at this point was that the biggest issue or concern that we have now is PC support, integration, and Windows NT administration and control. There was some concern that management was being unduly swayed by Microsoft's marketing power, and the idea that Microsoft is the answer to all questions.

There followed a discussion of the UNIX development environment as compared to the PC development environment. It was felt that the best characterization was that UNIX provided a collection of tools that work together, while the emphasis in the PC world was towards the one tool that could do it all. The conclusion being that each approach has its place, but that the "one tool" approach was more likely to place limits on the things that you can do.

Just before we broke for lunch, we agreed (unanimously!) that we wouldn't say anything at all about PCs or Windows NT in the afternoon session.

Part 3 -- After lunch

After lunch, we dove right into a discussion of

Networking

This discussion started as an informal survey of who is using what, and continued on into a few questions and hints back and forth.

Q. If you're 100MB to the desk, what are you to the server?
A. This is not always a problem, since the desktop is sometimes running video or similar applications directly to another desktop, bypassing the server.

Q. Trends?
A. 100VG is dead
A. FDDI appears to have a limited future, with little expansion
A. Any organization that's growing is installing more fiber
A. Switched ethernet is expected to be very useful and more widely deployed in the future
A. An expectation that wiring will be either category 5 UTP or fiber

Q. What do you see 3 years out? Is anyone envisioning anything other than 10/100 baseT to the desktop?
A. Not really ...

Q. Routing protocols?
A. Some RIP, almost no RIP2, some OSPF, IGRP, static routes are remarkably common

Q. Network monitoring?
A. SNMP is used for monitoring -- there is limited use of SNMP for configuration since telnet is usually more convenient
A. A little OpenView, a little Spectrum, tkined/scotty
A. Freeware tools are about as common as commercial tools
A. Capacity monitoring is used more as a diagnostic tool, rather than an ongoing planning tool

A few points:

There seemed to be a general consensus that networking is typically "solved" i.e. we know how to deal with it.

Security

Computer and network security is a significant concern for most organizations. As we did for networking, we fell into the informal survey/short question and answer mode for our discussion of security-related issues.

Q. How many people are using ssh?
A. Over half
Q. PGP?
A. Quite popular
Q. Anyone using stel?
A. Pretty much no one
Q. Kerberos?
A. About the same as last year
Q. DCE?
A. Far fewer sites had plans for using DCE this year as last

Q. Anyone using a commercial firewall right out of the box?
A. A few people are
Q. TIS Firewall Toolkit?
A. The majority
Q. No firewall at all?
A. About 1/3 of the sites
Q. Using socks to get through the firewall?
A. A few
Q. Doing Java applet filtering?
A. A few, but it's a popular concept
Q. Doing virus scanning on the firewall?
A. 1 or 2

Q. Using one time password tokens?
A. Roughly 1/3 of the sites
Q. Running a key signing service for your users?
A. A few (Boeing, SGI, Usenix ...)

Q. What do we think about Java?
A. We're leery ...
A. Elizabeth Zwicky: "I think sendmail will be safe before Java will"

Simon Fraser University just signed a site license for Timestep, which is secure, encrypted, virtual private networking (SVPN).

Many sites are still looking at kerberos implementations. Some attendees lamented that vendors could help us out by offering kerberized versions of their applications.

The current state of the art in inter-domain system administration was discussed (e.g. file system sharing, account creation, etc.). There was some interest in SSL. Neutral zones ("shared project networks") are in use by a number of organizations to join private networks.

Part 4, after the afternoon break

Human-to-Human Communications

We started this topic with a discussion of wireless communications for support staff. Synopsys uses radios to help the 70 or so sysadmins keep in touch. There was general agreement that radios can be a very useful tool.

Problem tracking software is in much wider use today than it has been in years back, and now many of the larger sites are using commercial packages instead of the home grown solutions which were presented at earlier LISA conferences.

How can we deal effectively with new "short" tasks? It's typically two things -- find the time to do the short task, and make it obvious that you have had to not do something else. Strata Rose suggests that "TTM Lists" (task, time, money) are a good tool for dividing a large task up into sub-tasks that people can understand. Assign time and monetary costs to each sub-task and then it's much easier to make the decision of what to stop doing in order to accomplish this task.

For many of us, email is one of the primary means of communication. It is important to remember that email is an imperfect medium. It's easier (and faster, and better) to solve conflicts in person rather than through email. Email lacks facial impressions, tone of voice, stance, etc., and is very fast, which makes it much harder to "edit yourself". Current version humans tend to be much less "ept" at writing than speaking.

Remy Evard pointed out that being able to write effective email is a critical skill for system administrators. There is a tendency for technical people to think that everyone else is like themselves, and that people who disagree with you are either evil or stupid. It is likely that, when it comes to email, we (as system administrators) are the weirdos, since we tend to deal with larger amounts of email, and have more experience with it. Our attitudes towards email, and the way we use it, are likely to be atypical -- it's important to remember that not everyone looks at email the same way that we do.

The discussion of email slid into a mention of voice mail, and the question was raised: who's having trouble dealing with the combination of email and voice mail? There was general agreement that we find email more effective and less disruptive/annoying than voice mail. Bill LeFebvre opined that "Voice mail is the fax of the 90's".

How do you communicate with your users? Brent Chapman makes it a habit to write explanatory documents and put them on the web -- Brent does his project plans as HTML documents. It helps if you've got strong web indexing tools on your server. Laura de Leon said that at her site the users like the web pages, but they love the bi-monthly paper newsletter (which also ends up on the web). At SGI, everything is web-based. "Silicon Junction" is the SGI internal default web site, and is essentially a daily newspaper.

Are web pages good for announcements and discussions? The consensus was that they are just about as good as the alternatives (mailing lists, newsgroups). Sometimes it is appropriate to send an email pointer to a URL with a 2 or 3 line description. Mailing lists used for announcements need to be very low volume.

One person suggested that important announcements should just "pop up on your screen". This was not a popular option -- it was felt that this would be too annoying. In addition, there would be problems with choosing which screen to pop it up on, and there is the problem of what happens to those people who aren't signed on when the announcement is posted.

Paper notices are appropriate too -- server down notices get posted on the washroom doors at Boeing, and Synopsys posts notifications of the quarterly downtime on every external door.

The important point is: determine what the appropriate method is for your site, document it, and use it. Whatever channel you choose, it must be dedicated to important announcements to be effective.

Publications

As SAGE publications editor, Tina Darmohray is always looking for printed material. She wants/expects ;login: articles from workshop attendees. Can we put together a document for vendors demonstrating the need for configuration APIs? USENIX is putting together a legal issues document, and is interested in hearing about relevant issues. One of the questions to be addressed is what do you actually do, and what are you legally obligated to do?

Legal issues were discussed very briefly. One common thread was that many sites have a policy to NOT back up email, specifically so that they cannot recover it if asked to do so.

Backups

It was not a real surprise that backups are still a big issue for workshop attendees. The average site was dealing with hundreds of gigabytes of backed up data. The most popular backup technology appears to be DLTs this year, but several people are actually beginning to use the higher speed DST tapes from Ampex to deal with extreme data stores. For example, Andrew Hume reported using 4 DST drives to backup their 960GB SGI server. The rapid growth in the size of disk technologies, and the incredible demand for online storage continues unrestrained.

Various network backup systems are in use, with the most common being Legato Networker and IBM's ADSL.

Other related questions/problems are the use of archive servers, network bandwidth, hierarchical storage management, and the use of onsite and offsite disk mirroring.

Quality, Service and Metrics

Remy Evard asked: How can you tell that the systems administration in your organization has improved over time?

There was a discussion of metrics for evaluating system administration performance. Some of the problems and issues raised were:

If you base the evaluation of system administration effectiveness on user satisfaction, you need to survey the users. Does it matter if the users are rational and know what's best for them? It's the old question of what your goal is -- to have a well-run system, or to make your users happy? Are those necessarily conflicting goals? Paul Evans pointed out that trying to make your users happy is sometimes the road to ruin.

Should our goal be to keep them at the same level of happiness?

Is user satisfaction/happiness beyond the system administrator's control? If the users want more file space, and there's no money for more disk, that's not (usually) the sysadmin's fault, or a problem that he/she can solve. Or is this a "site" problem -- the site is not being well run (for whatever reason) because the "appropriate" resources are not there?

Strata Rose asked: Do we consider users to be competent people in their own right? Or do we treat them as children who don't know what they need to get their job done? It is important to treat users as mature professionals.

Do the goals, values and satisfaction levels of a site differ between (for example) commercial sites, universities, and ISPs?

A measurement problem arose: does the system administrator get blamed for vendor problems that are beyond his/her control?

Elizabeth Zwicky suggested that a good metric might be: Do the users think that things are getting better? Her site offered users a virtual $100 and asked what part of IS would the user prefer to spend it on. This kind of survey can easily be biased by the interaction that users have with particular IS groups.

A question was raised: how many sites have system administration and networking as disjoint groups? Quite a few hands went up. Do the disjoint groups typically work well together? (Apparently not)

And finally, there was some discussion of charge back for system administration services. Should system administration be a profit center, or just another component of a company's infrastructure?

Conclusions

It's 4:30pm and we're done.

The obvious questions are: Do we have any useful conclusions? Have we accomplished anything? Or is the benefit of this workshop the sharing of ideas and experiences and meeting people with similar or different problems. Can we change the world? Can we fix anything substantial as a result of this exercise? Perhaps not.

Strata Rose mentioned that perhaps one of the primary accomplishments of this workshop is that of "morale" -- if my peers are also facing the same problems that I am, maybe I'm not incompetent after all. We networked today -- people got some good ideas about what other people are doing, and hopefully have a better understanding of which way the solutions might lie.

A final consideration is, of course, planning for next year. Is there some different structure which can really accomplish something in this kind of forum? Or is the sharing of knowledge, problems and experiences the most important benefit of this workshop? Something to think about.