Managment: Creating an organization which will build and run reliable systems

I can hear all of my geeky friends screaming at me "but you promised that this would be a technical book!" Trust me, it will be. My thinking is that you and your career will be better served if you understand what "the suits" are thinking about. You want "the suits" to read the technical stuff, so it seems fair to ask you to read the managerial stuff. It's about questions you ought to know the answers to:

What do I have to do?
When do I have to do it?
Where do I do it?
How do I do it?
and Why?

Management is all about what, why, where, and why (and how much). The technical stuff is about the how.

Before any geeks can touch any computer, there are some things that have to happen:

Somebody has an idea for using computers to make money. I know we're all about open source, and I am too, but at some point somebody has to have an idea for making money.
Funding for all of the things that are going to happen has to be arranged.
A lawyer gets involved. I'm not sure why, but there will be at least one.
The geeks have to be screened and hired. Just because you can run as root on a 100 MHz Pentium on linspire does not make you a system administrator. I'm not saying don't hire inexperienced people, but rather one should have a mix of junior and senior people, and one should know which is which.
The room to put the computers in has to procured. Power must be laid in (with special attention to grounding). A communications infrastructure has to be built. The racks have to be purchased and installed. Air conditioning has to be installed. Fire protection, earthquake protection, physical security all has to be worked out.
Somebody has to buy the computers, the racks, the power strips, the nuts and bolts, the tools (it's 12:30 AM Sunday morning - do you know where your #2 phillips head screwdriver is?), the wires, the patch panels,
They have to be received by receiving, entered into the inventory system (because taxes are frequently levied on inventory) and stored somewhere until you are ready to install them
At that point, some sysadmin can image the system, add the IP address, install the application, test it, connect it to the monitoring system, connect it to the load balancing system, test the load balancer, and put it into production.

All of the things in the list above are things that sysadmnins generally don't worry about, but they are critical things that have to happen if the business is going to stay healthy. If the business is not healthy, then you will lose your job. Guaranteed. In addition to the list above, there is another list of things that has to be worried about:

The computers will break, so they have to be repaired
They have to be monitored, so you can tell when they break. The monitoring has to be at several levels, including but not limited to a ping test, testing the remote access mechanism, and testing the application. You have to know if the computers are being overloaded. You have to detect when your systems are under attack. You have to monitor DNS, NFS and NTP. You have to detect when your domain registration has expired or has been hijacked.
Bad Guys will try take you out, so you have to have computing security process, including intrusion testing.
New software will be developed and old software will be fixed, so you have to have a release process. The release process has to either stop processing, if you can tolerate that, or failover, if you can't. Then you have to do the release, test it that it went well, then failover the other way and repeat the process.
New computers which are faster, cheaper, and (hopefully) cooler will supplant old machines. Some applications will outlive their usefulness. So you have to have a mechanism in place for sunsetting the applications and the machines when they have outlived their usefulness.

Cost and value: finance for sysadmins

What is the cost of failure? The answer is: it depends. Consider some scenarios:

Your company does a mail order business. People mail you orders, somebody enters them into a computer and you ship the orders the next business day. If a computer goes down, you can survive for a few hours while it's getting fixed. If worse comes to worse, you can go to the local computer store and buy a new computer, swap hard drives and boot. You can reenter orders from the papers. Reliability is not a critical factor here.
You have a subsystem that does statistical analysis of your order stream, and it produces reports each day to help the marketing people hone their marketing efforts. Reliability is not a critical factor here, either. If the marketing people don't get their reports for a day, the company will face an intangible loss but it probably will not kill the company.
You are an online company, taking and fulfilling orders 24x7x365. You have a billing system that collects the order data in real time, but processes the billing in a batch mode. Security is important here, because the billing is the key to your revenue stream. You can withstand a few hours of downtime, but it won't be pleasant. If this system is compromised and customer data is stolen, that could kill your corporation.
You are an online company, taking and fulfulling orders 24x7x365. Your customers access a web server, which connects to databases. One of those databases has credit card information, so not only must it be reliable but it must also be secure. Another database holds the orders and it must be reliable - it is very bad news if an order gets lost. Reliability must be assured, but you can take the system offline at, say, 2 in the morning for routine maintenance, so long as all of the orders in progress are dealt with.
You are a hospital, with a computing system that provides critical information to physicians, nurses, pharmacists, and dieticians. Reliability must be assured and you cannot take the system offline, ever. Allowable downtime is measured in tens of seconds.
You are a military aircraft "fly by wire" system. Your system must not only be reliable, but it has demanding realtime performance requirements. Furthermore, the system must work in a demanding environment which includes extremes of heat, cold, vibration, power fluctuations, and corrosion. The system must also be tolerant of damage in combat. Allowable downtime is measured in milliseconds.

Before you can design for reliability, you have to know how much failure you can tolerate. Some systems are less critical than others, and a wise manager will put more resources (money, personnel, space, redundancy) into the critical systems, while spending less on the not-so-critical systems.

Profit and loss

The key measure of the success of any business is net profit. This is something you have to have at least a minimal understanding of if you are going to convince your boss of anything.

Profit = Net Income - total costs

Anything you can (legally) do to increase your net income is A Good Thing. Anything you can (legally) do to decrease your costs is also A Good Thing. There are a couple of problems with considering pure profit: it fails to consider investment (paying money in order to make money in the future) and risk.

Return on investment

However, the formulas below overstate the true value. The degree to which Return On Investment (ROI) overstates the economic value depends on at least 5 factors:

length of project life (the longer, the bigger the overstatement)
capitalization policy (the smaller the fraction of total investment capitalized in the books, the greater will be the overstatement)
The rate at which depreciation is taken on the books (depreciation rates faster than straight-line basis will result in a higher ROI)
The lag between investment outlays and the recoupment of these outlays from cash inflows (the greater the time lag, the greater the degree of overstatement)
the growth rate of new investment (faster growing companies will have lower Return On Investment )

The formula for ROI is

Net Income / Book Value of Assets = Return On Investment

however, a better formula is

Net Income+Interest (1-Tax Rate) / Book value of Assets = Return On Investment

Depreciation

One of the things that throws sysadmins for a loop is the concept of depreciation. Most sysadmins understand, intuitively, that equipment you buy has a finite life. Straight line depreciation, the simplest way to think about it, finds the annual cost of an item by dividing its purchase price by the lifetime in years. What is confusing is that the faster you depreciate something, the higher the ROI.

Risk

Roles and functions

I am about to show you a minimal list of all of the functions that a company has to do. This list has been organized the way I think it ought to be done, putting organizations and suborganizations together so that as much similar functionality as possible is grouped. Your organization will vary, but this list represents the minimum of what you have to do.

Organizing all of the roles

Legal
Marketing

Customer support
Assessing customer needs
Advertising

Sales

Order fulfillment
Customer management (CRM is the buzzword)

Accounting/Finance

Accounts payable
Accounts receivable
Capital planning
Fiscal controls

Shipping and Receiving
Human Relations

Personnel security
Staffing levels
Recruiting, retention and personnel development

Facilities

Physical security
Utilities

power
phone
heat and air conditioning
Space and furniture
bathrooms

Fire protection

Technology

Design/Engineering

Application design

Requirements definition, including security analysis
Allocate requirements to subsystems
Design subsystems
Subsystem test

Programming practices

Change control
Program reviews
Release control

Programming languages
Operating systems

Quality Assurance

Problem ticket tracking
testing

Design review
Regression testing
Integration test
load test
Failure test
post production monitoring

Security analysis

Technical Operations

Networking
Servers
Databases
Monitoring
Disaster planning

Remote sites
backups
Reliable hardware

Computing Security

Internal computer support

Statistics & measures

There are some interesting wrinkles in this list.

First, security is mentioned five times: physical, computing, personnel and 2 application checks. The rationale for this is that while the needs for security are always the same, the methods that you go about accomplishing are very different. Physical security can be handled by guards, possibly armed. But an armed guard can't do a thing about a 14 year old prodigy from Hoboken who's just given himself administrator rights on your MySQL database. So you have to have somebody who knows computers to deal with computing threats. In a small organization, that might be the system administrator, in a large organization, a dedicated person or team. Personnel security is supposed to protect you from "inside jobs", where a person abuses their position of trust. That includes thinigs such as carefully screening new hires, running ongoing security checks on your current employees (I know that sounds draconian - I'm a sysadmin myself and I resent being invetigated), and having processes and procedures in place to keep sensitive information safe. For example, you may require that all of your credit card data be handled by two people at all times. Also, if you ever fire or layoff somebody, you must escort them from the building and cut off all access to the computer systems, so that you are safe from disgruntled employees. Application security is crucial, as most modern operating systems are pretty secure (even MS-Windows has gotten better). The applications that run under them are frequently the source of holes. Applications should be secure by design, and they should be tested for security before release.

What is "Statistics and Measures", and why is it 'way up the corporate ladder? This is the embodiment of an idea from Tom Yourdon's book insert citation here. Yourdon proposes a group whose job it is to measure things - defects, productivity, reliability. This group verifies that your organization is meeting its goals, whatever those goals might be. Statistics and measures is high in the corporate ladder because it has to have as much independence as possible. So, for example, if a given project is a train wreck, and the statistics and measures group predicts its going to be a train wreck, then the group has been successful! Having statistics and measures allow you to truly manage, as opposed to merely giving orders.

Quality assurance is in a separate organization under technology for the same reason Statistics and Measures is in a separate organization: it has to be as independent as possible. Quality assurance should be involved with the design process - it must ensure that the design is testable.

The Reliable Organization

Legal

Every business has legal requirements. Every data processing organization has additional legal requirements. In some cases, the penalties for violating those laws can be quite severe. If you have troubles staying awake, then ask your lawyer about some of these laws and regulations:

Acronym	Common Name	Who it covers	brief summary	Typical penalties for violations
SOX	Sorbannes-Oxley	Public corporations	Good Corporate governance
HIPPA		Hospitals, insurance companies, physicians and other private practioners
FERPA		Schools, community colleges, 4 year colleges, universities.

Your lawyer should might with your sysadmins, DBAs, and networking people to make sure that all of the rules and regulations are followed. There are a couple of reasons why: 1) It's The Right Thing To Do; and 2) The cost of proper controls is smaller than the cost of defending, let alone losing, a lawsuit.

Marketing

Marketing is all about finding out what your customers need that you can fulfill. One of the trends in the music business in 2005 was an increase in (legal) downloading, as opposed to purchasing CDs which lost market share. Why? Because most customers want only one or two songs on a typical CD. The rest is filler. By downloading (legally) just what they want, customers

Customer support

Why is customer support a marketing function? Because it is a golden opportunity to talk with your customers, an opportunity most companies squander. Your customer support people should be noting what doesn't work, and you should give your engineers the results of that information so they know what should be made better. You can even make measurements of customer satisfaction, or customer dissatisfaction, and use that to measure if the product is improved. Your customer support people should be asking questions, seeking ideas. that sort of thing.

Assessing customer needs

Advertising

Sales

Order fulfillment
Customer management (CRM is the buzzword)

Accounting/Finance

Accounts payable
Accounts receivable
Capital planning
Fiscal controls

Shipping and Receiving

Human Relations

Recruiting, retention and personnel development

It never ceases to amaze me how tolerant American business is of the cost of turnover. Most computer organizations are idiosyncratic. Their systems have been developed over years or even decades, and it is frequently cheaper to fix the old systems than it is to rebuild them new. When a computer finally does wear out, it is cheaper and less disruptive to duplicate its functionality then it is to reengineer entire processes.

Once I worked on a Solaris machine that got its inputs from a machine running Redhat and gave its output to another machine running Debian. The Solaris machine had given us years of faithful and reliable service, but it was getting old and unreliable. I replaced it with a PC running Redhat, recompiled all of the software, and put it into place. However, I then noticed that the other two machines and this machine were only working at 10% of their capacity. So I combined all of their functionality into one computer and that, I thought, was that. Unknown to me, or anybody else for that matter, was a little process, just a short program in a cronjob, which was a corporate critical process. And it would run only under Solaris, and it had to talk to both of the other machines - occaisonally. And the source code was lost in antiquity. It turns out that the guy who wrote the program had quit in disgust the year before. We were able to track him down and find the source code on an old backup tape, so the day was saved, but at such a cost!

I like to think I am a pretty good system administrator. I've worked in places where the documentation for the sysadmins was quite good. I've worked in places where the documentation was quite bad, or even non-existant. Even with superb documentation, it takes a long time for a sysadmin to get aquainted with the environment. Systems are frequently put together as a result of mergers, buyouts, and moves. There's never enough time to properly document everything, so much time is wasted trying to find things. But it's okay, because people remember things (Under Charlie's desk is an enterprise critical workstation, but Charlie is the only guy who can make it work to do the billing). One day, the person with the memory leaves, and all that arcane site specific knowlege goes with him or her.

So turnover has four costs associated with it

The hiring process is very time consuming. The same technology that puts your resume in front of tens of thousands of hiring managers also put tens of thousands of resumes in front of you. Even if you spend only 30 seconds on a resume, it can take days to sort through them all.
When people come on board, somebody has to take time out to explain how things work
When people leave, you have to audit your systems and make sure that there are no "hidden" accounts, logic bombs, secret passageways into the systems and similar. This is doubly true for sysadmins, since they have privileges.
After people leave, the remaining people have to figure out what that person knew.

New people will make mistakes, not because they are stupid or incompetent, but because that's how we all learn. The person who left may have spent six months learning how a given system works (and how it fails), when he or she leaves, those six months of experience are gone. Worse, if you discover 6 months after he or she left that you need him or her for something, and bring them back on an expen$ive consulting contract, you run the risk that they will have forgotten everything.

Clearly, the solution is to reduce turnover. How do you do that?

Pay scales should be competitive.
Schedules should be a flexible as possible
Don't short change people on benefits
Be scrupulously honest and upfront

Age discrimination

It happens, despite laws against it. The "sweet spot" of most peoples careers occurs in their late 20s or early 30s. At that time, you've finally gained enough experience so that you know enough to be useful, but you haven't had so much salary growth that you're priced out of the market. The perception is that, as we grow older, we start slowing down, getting set in our ways, have more health problems, have families which are a distraction. Like all stereotypes, there is some truth in all of these perceptions, and the occaisional truths tend to reinforce what we believe.

Personnel security

Staffing levels

Things to look for

Obviously, you want technically qualified people, so you should look for things like education, certifications, and experience. But there are some other things you ought to look for:

Can they write clearly without much difficulty? If they cannot easily write well, then they probably won't write documentation. They may be very smart, but nobody else will be able to take advantage of what they know.
Can they communicate well orally, especially in a noisy environment (think about making a phone call from a noisy server room)
How do they handle stress?

Facilities

Physical security & fire protection

Utilities

Power

Phone

heat and air conditioning

Space and furniture

Bathrooms

I was chatting with a sysadmin who had turned down a job offer because the bathroom was dirty and they were out of toilet paper. She was a very qualified sysadmin and would have been a dynamite addition to the organization. But they stinted on the cans! The cost of recruiting somebody else probably swamped the cost of making nice bathrooms.

Similarly, system administration is a high stress job. Sysadmins are smart people, and they understand that exercise is a good relief for stress. Exercise also lowers your health care costs. Provide a shower facility if you possibly can.

Janitorial

Technology

This section focuses on the "what" and the "why" of building a reliable system. There are other parts of the book that are devoted to "how" to do it.

Design/Engineering

The key to making reliable systems is in the design stage. It is axoimatic that it is cheaper to fix design flaws in the design stage than in coding, and it is cheaper to fix problems in coding that it is in testing.

Application design

Requirements definition, including security analysis
Allocate requirements to subsystems
Design subsystems
Subsystem test

Programming practices

The programming industry is fairly mature at this point, and we know what works and what doesn't. There are also some things that, remarkably enough, we still don't know. A classic example is "what is the best programming language?" Another example is "What is the best operating system?"

P

Change control
Program reviews
Release control

Programming languages

As I write this in 2005, there seem to a relatively small set of programming languages in common use. Some of them are safer than others.

To my astonishment, there is still demand for FORTRAN programmers, because there is an large body of software written in FORTRAN and it is cheaper to maintain it than it is to rewrite it. FORTRAN is dangerous programming language unless you use some of the compiler options to force strong type checking and other options One of the FORTRAN optons is the TBD switch, which allows runtime array boundary checking. Use it.
There is still a lot of demand for COBOL programmers for business applications for the same reason that FORTRAN is popular. COBOL was designed with the idea that a non-programmer could pick up a COBOL program and understand it. I've never heard of a non-programmer who did so.
A lot of software is written in C still, in part because the contructors and destructors in C++ don't work deterministically. Linux is written almost entirely in C (there are a few sections of the kernel that are written in assembler - very few). C is a very dangerous programming language because of the danger of buffer overflows, mangling the heap (with free and alloc calls), and misplaced pointers.
C++ is popular. C++ is dangerous for the same reasons C is. Worse, C++ allows for multiple inheritence, which means that if you have an identifier that inherits from more than one ancestor, it isn't obvious which ancestor you inherit from.
Java is popular and it is a very safe programming language. I highly recomend it.
I don't know how popular C# is. Microsoft released C# after they lost their lawsuit from Sun over corrupting Java. It's been ported to Linux by the mono project.
perl is very popular and is very good for small programming projects. I do most of my programming in Perl, but then I'm a sysadmin and not a developer or a DBA. Perl can be made relatively safe with the -W switch, which I use religiously.
PHP is very popular for developing web pages.
Python is something of a cult language. I think that it would be more popular than perl is today if it had come out before perl. In particular, I like that the bounds of loops and conditionals (if-then-else) are delimited by indentation. It makes for a very intuitive understanding of what the code does - it looks like what it does.

What makes a programming language safe or dangerous?

A language that is very picky about syntax is safer than one that will accept most anything. The thinking here is that it is cheaper for the compiler to find bugs than for a human to have to tediously look for them.
Two of the most common programming errors are overrunning the boundaries of an array and pointers that point to the wrong things. In C and C++, these are really the same thing because of the close relationship between arrays and pointers. Java guarantees that array boundaries will throw an exception, so that is safe. Perl and PHP will increase the array size to accomodate the reference - I'm not sure if that is a Good Thing or not, it depends on what your program is trying to do. In perl, if you attempt to read beyond the end of an array, then the result returned is not defined. You can test for definedness, but you can't really use an undefined variable for anything and perl will complain about that.
Another common error is referencing a variable before assigning it a value. Again, FORTRAN, C and C++ will happily do that. Perl and Java will not. That makes Perl and Java safer programming languages.

Operating systems

My critics accuse me of "Microsoft Bashing" and to a certain extent, they are correct, I do. The problem is that, as a computing expert with decades of experience, I see how Microsoft has utterly botched the job of designing for security. For example, the Microsoft system has a single data structure, the registry, and if you corrupt it, then your system can become unbootable. I am unaware of any data structure like that in the UNIX or Linux world. I suppose /etc/inittab could do it, if, for example, you made the default run level 0 or 6. But you seldom touch the /etc/inittab, and the only account that can is root, wherease anything and everything touches the registry.

Quality Assurance

Problem ticket tracking

In all likelihood, you will have several sets of problems. Your software engineers will have bugs, your operations staff will have things that break and need fixing. However, your facilities people also have problems. So do your purchasing people.

The solution, of course, is a problem tracking system that implements business rules

testing

Design review

Regression testing

Integration test

failure test

There are several ways of failure testing. I've seen (been the victim of) somebody pulling the power cord in the middle of a load test. I've also seen a client test script that locked a record, read the record, modified the record, did a kill -9 on the database PID, and then tried to write the record. When a component of the system fails, and it will, can the system recover fast enough to meet the requirement? When part of the system has failed, will performance be adequate? Is the MTTR acceptable?

load test

post production monitoring

Security analysis

Operations

Networking

Servers

Databases

Monitoring

Disaster planning

Reliable hardware

Backups

Remote sites

Computing security

Internal computer support

Statistics & measures

How well are you doing? One way to measure that is by looking at your financials. If your revenue is greater than your costs, then you are doing well indeed. Your shareholders, VCs, and your employees are all interested in this measure.

But there are other metrics you might use, and those have a bearing on your profitability.

What is the MTBF of your servers? You can measure this with the UNIX uptime command fairly easily.
What is the MTBF of your services? This is a little harder to measure, you will have to examine your application log files
What is the MTTR?
How well are the load balancers balancing loads?
How busy are your servers? If the servers are busy all of the time, then your customers will experience poor response time and take their business elsewhere. If the servers are never busy, then you are wasting money - consolidate them onto fewer physical machines. Virtual machines are an interesting technology that might make this easier. On the other hand, the more servers you have, the more failure you can tolerate.

You can't improve reliability if you don't measure it. Otherwise, you have no idea if you are making things better or worse (or having no impact at all).

Ethics

Why is Ethics in a book about reliable systems? Because Ethics is key to making systems reliable. In order for your systems to work reliably, you have to know what the problems are. Your people have to have confidence that they can come to you with a problem, and you won't shoot the messenger. You only have to shoot one or two, and the word will get out that you don't want to hear bad news. So consequently, when bad things happen (and they will), then you won't know about them until they are too big to ignore, and perhaps too big to do something about.

One day, an engine fell off of a Boeing 747. The FAA was concerned that a part which holds the engine on, called a "fuse pin", may be defective. So I was put on a team to test the fuse pins. I had been given some software to run the test with (the test was almost completely automated) and all I had to do was run the computer. So we put a fuse pin in the test machine and tested it. It was fine. We put another fuse pin in the test machine, and it was fine. We put another fuse pin in the test machine, and it was fine. After a while, I suggested that we try testing a fuse pin to destruction, just to see what would happen. I was told this had been done decades ago, we don't need to do that. I tried again, pointing out that it would be an interesting test of my software. So we took an old fuse pin that we had already tested and put it back in the machine. I gave instructions to the program to increase the load on the pin to 110% of "worst case" load. Then 120% of "worst case" load. Then 130% of "worst case" load. Then the fuse pin broke. It turned out that my software had a bug in it.
I had to go to my manager and tell him that the software had a bug in it, and we might have to redo all of the testing we had done. Fortunately, my manager was an ethical man, and he listened carefully to my story and then asked me what I was going to do about it. The two of us came up with a plan. He reported to his higher ups that there was a problem and we were working on it. I worked with my fellow engineers to figure out the problem, develop a fix, test it by destroying another fuse pin, reprocess the old data to make it right.
My manager took a risk that his managers would be mad at him - this was a highly visible issue. I took a risk by going to his manager. We could have covered up the problem, and the world would never know. But Boeing had a reputation as a quality organization. That day, I tested that reputation, and it passed that test.

Earlier, I mentioned personnel security. I discussed investigating people before you hire them, and investigating your key people again while they are working for you. That is to protect you against abuse of trust for financial gain. But people are motivated by other things than money, e.g. revenge. So while you should not be afraid of your people, you do have to treat them with respect, dignity and understanding. Remarkably enough, you don't have to pay them very well if you can motivate them in other ways (consider, for example, Boy Scouts and Girl Scouts - nobody is making money but there are a lot of scouts). One way is an equitable profit sharing plan. After all, they are sharing risk with you, even if they are not aware of it. Another way is a relaxed atmosphere, especially when there are no customers around. Frequent parties, recognition for work well done, respecting people's wishes concerning overtime and schedule (to the extent that you can) are all ways to keep your workforce loyal.

You should have an ethics policy. It should be administered either by your legal department or your statistics and measurement department. You must have protection for "whistle blowers". That will tend to keep people honest because they know that they can't get rid of their consciences.

How to keep your ethical standards in an unethical world

It ain't easy.

While I was out of work, I was offered a ludicrous sum of money to come in and help a pornographer who wanted to take over his own hosting. I have friends who go "war driving", find unsecured wireless transceivers and send hundreds of thousands of spams. With a little luck and skill, the owner of the access point will never know. There are phishing sites. I found one that was registered in Taiwan, but running traceroute suggested it was in Fullerton, California. There are times when I really want to write a virus.

The first great decision: do it in house or outsource it

One of my managers once asked me how small I could make an operations department and still make it work 24x7. After some thought, I decided that the answer was zero: you could farm out the whole thing. The manager said that he wanted to know how small to make the operations department but still have it under his control. Again, the answer was: zero. Just because you've outsourced doesn't necessarily mean that you've lost control of it. Now that manager, clearly frustrated, told me that he wanted to have an operations staff of direct reports, how small could it be and still run 24x7? I decided that the answer was anywhere from 4 to 16, depending on how failure tolerant he wanted the operation to be. His response was that he wanted an organization of between 8 and 10 people - how could we organize it to provide 24x7 operation?

Most people want to work the day shift, because the rest of the world does. Computer system failures seem to be uniformly distributed over time, especially for 24x7 systems. Even if your systems are failure tolerant, they will break. A lot of outfits do batch processing at night, to get the next days billings out the door. If you are selling to the global economy, then your load will be fairly constant over 24 hours. So you have to organize to provide a human being, 24x7.

However, you need more than one human being. The system administrators have a different skill set than the networking guys, who in turn have different skills than the database administrators. So you need one person from each of these three groups available. Then, everybody needs a backup person, for when things go really bad, or if a question arises that requires more research or thought. Finally, people go on vacation, get sick, go to conferences, etc. so you need a third person from each of these groups. Finally, you will need a cadre of developers to analyze and possibly correct software failures. This discussion assumes that you have automatic monitoring that will alert your people at the earliest sign of trouble and will at least minimally diagnose the problem, which may or may not be a valid assumption.

If you have a sysadmin, a DBA, a network administrator, and a developer, then you have the kernel of a technical operations organization. However, if anything happens to any of them, then you have a major hole in your organization, one that is impossible to fill.

This is expensive. If your organization is very small, then it might make sense to oursource the system administration. In this day and age, you have a lot of options. You can outsource to an outfit in town, a company elsewhere in the country, or a company elsewhere in the world.

$Log: management.html,v $
Revision 1.1.1.1  2006/10/01 23:36:20  cvsuser
Initial checkin to CVS

Revision 1.1  2006/01/05 06:02:19  jeffs
Initial revision