Managment: Creating an organization which will build and run reliable systems

I can hear all of my geeky friends screaming at me "but you promised that this would be a technical book!"  Trust me, it will be.  My thinking is that you and your career will be better served if you understand what "the suits" are thinking about.  You want "the suits" to read the technical stuff, so it seems fair to ask you  to read the managerial stuff.  It's about questions you ought to know the answers to:

Management is all about what, why, where, and why (and how much).  The technical stuff is about the how.

Before any geeks can touch any computer, there are some things that have to happen:

  1. Somebody has an idea for using computers to make money.  I know we're all about open source, and I am too, but at some point somebody has to have an idea for making money.
  2. Funding for all of the things that are going to happen has to be arranged.
  3. A lawyer gets involved.  I'm not sure why, but there will be at least one.
  4. The geeks have to be screened and hired.  Just because you can run as root on a 100 MHz Pentium on linspire does not make you a system administrator.  I'm not saying don't hire inexperienced people, but rather one should have a mix of junior and senior people, and one should know which is which.
  5. The room to put the computers in has to procured.  Power must be laid in (with special attention to grounding).  A communications infrastructure has to be built.  The racks have to be purchased and installed.  Air conditioning has to be installed.  Fire protection, earthquake protection, physical security all has to be worked out.
  6. Somebody has to buy the computers, the racks, the power strips, the nuts and bolts, the tools (it's 12:30 AM Sunday morning - do you know where your #2 phillips head screwdriver is?), the wires, the patch panels,
  7. They have to be received by receiving, entered into the inventory system (because taxes are frequently levied on inventory) and stored somewhere until you are ready to install them
  8. At that point, some sysadmin can image the system, add the IP address, install the application, test it, connect it to the monitoring system, connect it to the load balancing system, test the load balancer, and put it into production.

All of the things in the list above are things that sysadmnins generally don't worry about, but they are critical things that have to happen if the business is going to stay healthy.  If the business is not healthy, then you will lose your job.  Guaranteed.  In addition to the list above, there is another list of things that has to be worried about:


  1. The computers will break, so they have to be repaired
  2. They have to be monitored, so you can tell when they break.  The monitoring has to be at several levels, including but not limited to a ping test, testing the remote access mechanism, and testing the application.  You have to know if the computers are being overloaded.  You have to detect when your systems are under attack.  You have to monitor DNS, NFS and NTP.  You have to detect when your domain registration has expired or has been hijacked.
  3. Bad Guys will try take you out, so you have to have computing security process, including intrusion testing.
  4. New software will be developed and old software will be fixed, so you have to have a release process.  The release process has to either stop processing, if you can tolerate that, or failover, if you can't.  Then you have to do the release, test it that it went well, then failover the other way and repeat the process.
  5. New computers which are faster, cheaper, and (hopefully) cooler will supplant old machines.  Some applications will outlive their usefulness.  So you have to have a mechanism in place for sunsetting the applications and the machines when they have outlived their usefulness.

Cost and value: finance for sysadmins

What is the cost of failure?  The answer is: it depends.  Consider some scenarios:

Before you can design for reliability, you have to know how much failure you can tolerate.  Some systems are less critical than others, and a wise manager will put more resources (money, personnel, space, redundancy) into the critical systems, while spending less on the not-so-critical systems.

Profit and loss

The key measure of the success of any business is net profit.  This is something you have to have at least a minimal understanding of if you are going to convince your boss of anything.

    Profit = Net Income - total costs

Anything you can (legally) do to increase your net income is A Good Thing.  Anything you can (legally) do to decrease your costs is also A Good Thing.  There are a couple of problems with considering pure profit: it fails to consider investment (paying money in order to make money in the future) and risk.

Return on investment

However, the formulas below overstate the true value.  The degree to which Return On Investment (ROI) overstates the economic value depends on at least 5 factors:

  1. length of project life (the longer, the bigger the overstatement)
  2. capitalization policy (the smaller the fraction of total investment capitalized in the books, the greater will be the overstatement)
  3. The rate at which depreciation is taken on the books (depreciation rates faster than straight-line basis will result in a higher ROI)
  4. The lag between investment outlays and the recoupment of these outlays from cash inflows (the greater the time lag, the greater the degree of overstatement)
  5. the growth rate of new investment (faster growing companies will have lower Return On Investment )

The formula for ROI is

Net Income / Book Value of Assets = Return On Investment

however, a better formula is 

Net Income+Interest (1-Tax Rate) / Book value of Assets = Return On Investment

Depreciation

One of the things that throws sysadmins for a loop is the concept of depreciation.  Most sysadmins understand, intuitively, that equipment you buy has a finite life.  Straight line depreciation, the simplest way to think about it, finds the annual cost of an item by dividing its purchase price by the lifetime in years.  What is confusing is that the faster you depreciate something, the higher the ROI.


Risk



Roles and functions

I am about to show you a minimal list of all of the functions that a company has to do.  This list has been organized the way I think it ought to be done, putting organizations and suborganizations together so that as much similar functionality as possible is grouped.  Your organization will vary, but this list represents the minimum of what you have to do.

Organizing all of the roles

  1. Legal
  2. Marketing
    1. Customer support
    2. Assessing customer needs
    3. Advertising
  3. Sales
    1. Order fulfillment
    2. Customer management (CRM is the buzzword)
  4. Accounting/Finance
    1. Accounts payable
    2. Accounts receivable
    3. Capital planning
    4. Fiscal controls
  5. Shipping and Receiving
  6. Human Relations
    1. Personnel security
    2. Staffing levels
    3. Recruiting, retention and personnel development
  7. Facilities
    1. Physical security
    2. Utilities
      1. power
      2. phone
      3. heat and air conditioning
      4. Space and furniture
      5. bathrooms
    3. Fire protection
  8. Technology
    1. Design/Engineering
      1. Application design
        1. Requirements definition, including security analysis
        2. Allocate requirements to subsystems
        3. Design subsystems
        4. Subsystem test
      2. Programming practices
        1. Change control
        2. Program reviews
        3. Release control
      3. Programming languages
      4. Operating systems
    2. Quality Assurance
      1. Problem ticket tracking
      2. testing
        1. Design review
        2. Regression testing
        3. Integration test
        4. load test
        5. Failure test
        6. post production monitoring
      3. Security analysis
    3. Technical Operations
      1. Networking
      2. Servers
      3. Databases
      4. Monitoring
      5. Disaster planning
        1. Remote sites
        2. backups
        3. Reliable hardware
      6. Computing Security
    4. Internal computer support
  9. Statistics & measures

There are some interesting wrinkles in this list.

First, security is mentioned five times: physical, computing, personnel and 2 application checks.  The rationale for this is that while the needs for security are always the same, the methods that you go about accomplishing are very different.  Physical security can be handled by guards, possibly armed.  But an armed guard can't do a thing about a 14 year old prodigy from Hoboken who's just given himself administrator rights on your MySQL  database.  So you have to have somebody who knows computers to deal with computing threats.  In a small organization, that might be the system administrator, in a large organization, a dedicated person or team.  Personnel security is supposed to protect you from "inside jobs", where a person abuses their position of trust.  That includes thinigs such as carefully screening new hires, running ongoing security checks on your current employees (I know that sounds draconian - I'm a sysadmin myself and I resent being invetigated), and having processes and procedures in place to keep sensitive information safe.  For example, you may require that all of your credit card data be handled by two people at all times.  Also, if you ever fire or layoff somebody, you must escort them from the building and cut off all access to the computer systems, so that you are safe from disgruntled employees.  Application security is crucial, as most modern operating systems are pretty secure (even MS-Windows has gotten better).  The applications that run under them are frequently the source of holes.  Applications should be secure by design, and they should be tested for security before release.

What is "Statistics and Measures", and why is it 'way up the corporate ladder?  This is the embodiment of an idea from Tom Yourdon's book insert citation here.  Yourdon proposes a group whose job it is to measure things - defects, productivity, reliability.  This group verifies that your organization is meeting its goals, whatever those goals might be.  Statistics and measures is high in the corporate ladder because it has to have as much independence as possible.  So, for example, if a given project is a train wreck, and the statistics and measures group predicts its going to be a train wreck, then the group has been successful!   Having statistics and measures allow you to truly manage, as opposed to merely giving orders.

Quality assurance is in a separate organization under technology for the same reason Statistics and Measures is in a separate organization: it has to be as independent as possible.  Quality assurance should be involved with the design process - it must ensure that the design is testable.


The Reliable Organization

Legal

Every business has legal requirements.  Every data processing organization has additional legal requirements.  In some cases, the penalties for violating those laws can be quite severe.  If you have troubles staying awake, then ask your lawyer about some of these laws and regulations:

Acronym
Common Name
Who it covers
brief summary
Typical penalties for violations

SOX
Sorbannes-Oxley
Public corporations
Good Corporate governance


HIPPA

Hospitals, insurance companies, physicians and other private practioners



FERPA

Schools, community colleges, 4 year colleges, universities.









Your lawyer should might with your sysadmins, DBAs, and networking people to make sure that all of the rules and regulations are followed.  There are a couple of reasons why: 1) It's The Right Thing To Do; and 2) The cost of proper controls is smaller than the cost of defending, let alone losing, a lawsuit.

Marketing

Marketing is all about finding out what your customers need that you can fulfill.  One of the trends in the music business in 2005 was an increase in (legal) downloading, as opposed to purchasing CDs which lost market share.  Why?  Because most customers want only one or two songs on a typical CD.  The rest is filler.  By downloading (legally) just what they want, customers


Customer support

Why is customer support a marketing function?  Because it is a golden opportunity to talk with your customers, an opportunity most companies squander.  Your customer support people should be noting what doesn't work, and you should give your engineers the results of that information so they know what should be made better.  You can even make measurements of customer satisfaction, or customer dissatisfaction, and use that to measure if the product is improved.  Your customer support people should be asking questions, seeking ideas. that sort of thing.

Assessing customer needs


Advertising



Sales


  1. Order fulfillment
  2. Customer management (CRM is the buzzword)
  1. Accounting/Finance

    1. Accounts payable
    2. Accounts receivable
    3. Capital planning
    4. Fiscal controls

Shipping and Receiving



Human Relations

Recruiting, retention and personnel development

It never ceases to amaze me how tolerant American business is of the cost of turnover.  Most computer organizations are idiosyncratic.  Their systems have been developed over years or even decades, and it is frequently cheaper to fix the old systems than it is to rebuild them new.  When a computer finally does wear out, it is cheaper and less disruptive to duplicate its functionality then it is to reengineer entire processes. 

Once I worked on a Solaris machine that got its inputs from a machine running Redhat and gave its output to another machine running Debian.  The Solaris machine had given us years of faithful and reliable service, but it was getting old and unreliable.  I replaced it with a PC running Redhat, recompiled all of the software, and put it into place.  However, I then noticed that the other two machines and this machine were only working at 10% of their capacity.  So I combined all of their functionality into one computer and that, I thought, was that.  Unknown to me, or anybody else for that matter, was a little process, just a short program in a cronjob, which was a corporate critical process.  And it would run only under Solaris, and it had to talk to both of the other machines - occaisonally.  And the source code was lost in antiquity.  It turns out that the guy who wrote the program had quit in disgust the year before.  We were able to track him down and find the source code on an old backup tape, so the day was saved, but at such a cost!

I like to think I am a pretty good system administrator.  I've worked in places where the documentation for the sysadmins was quite good.  I've worked in places where the documentation was quite bad, or even non-existant.   Even with superb documentation, it takes a long time for a sysadmin to get aquainted with the environment.  Systems are frequently put together as a result of mergers, buyouts, and moves.  There's never enough time to properly document everything, so much time is wasted trying to find things.  But it's okay, because people remember things (Under Charlie's desk is an enterprise critical workstation, but Charlie is the only guy who can make it work to do the billing).  One day, the person with the memory leaves, and all that arcane site specific knowlege goes with him or her.

So turnover has four costs associated with it

  1. The hiring process is very time consuming.  The same technology that puts your resume in front of tens of thousands of hiring managers also put tens of thousands of resumes in front of you.  Even if you spend only 30 seconds on a resume, it can take days to sort through them all.
  2. When people come on board, somebody has to take time out to explain how things work
  3. When people leave, you have to audit your systems and make sure that there are no "hidden" accounts, logic bombs, secret passageways into the systems and similar.  This is doubly true for sysadmins, since they have privileges.
  4. After people leave, the remaining people have to figure out what that person knew. 

New people will make mistakes, not because they are stupid or incompetent, but because that's how we all learn.  The person who left may have spent six months learning how a given system works (and how it fails), when he or she leaves, those six months of experience are gone.  Worse, if you discover 6 months after he or she left that you need him or her for something, and bring them back on an expen$ive consulting contract, you run the risk that they will have forgotten everything.

Clearly, the solution is to reduce turnover.  How do you do that?

Age discrimination

It happens, despite laws against it.  The "sweet spot" of most peoples careers occurs in their late 20s or early 30s.  At that time, you've finally gained enough experience so that you know enough to be useful, but you haven't had so much salary growth that you're priced out of the market.  The perception is that, as we grow older, we start slowing down, getting set in our ways, have more health problems, have families which are a distraction.  Like all stereotypes, there is some truth in all of these perceptions, and the occaisional truths tend to reinforce what we believe.


Personnel security


Staffing levels


Things to look for

Obviously, you want technically qualified people, so you should look for things like education, certifications, and experience.  But there are some other things you ought to look for:


Facilities

Physical security & fire protection

Utilities

Power
Phone
heat and air conditioning
Space and furniture
Bathrooms

I was chatting with a sysadmin who had turned down a job offer because the bathroom was dirty and they were out of toilet paper.  She was a very qualified sysadmin and would have been a dynamite addition to the organization.  But they stinted on the cans!  The cost of recruiting somebody else probably swamped the cost of making nice bathrooms.

Similarly, system administration is a high stress job.  Sysadmins are smart people, and they understand that exercise is a good relief for stress.  Exercise also lowers your health care costs.  Provide a shower facility if you possibly can. 


Janitorial

Technology

This section focuses on the "what" and the "why" of building a reliable system.  There are other parts of the book that are devoted to "how" to do it.


Design/Engineering

The key to making reliable systems is in the design stage.  It is axoimatic that it is cheaper to fix design flaws in the design stage than in coding, and it is cheaper to fix problems in coding that it is in testing.


Application design
        1. Requirements definition, including security analysis
        2. Allocate requirements to subsystems
        3. Design subsystems
        4. Subsystem test
Programming practices

The programming industry is fairly mature at this point, and we know what works and what doesn't.  There are also some things that, remarkably enough, we still don't know.  A classic example is "what is the best programming language?"  Another example is "What is the best operating system?"

P

        1. Change control
        2. Program reviews
        3. Release control
Programming languages

As I write this in 2005, there seem to a relatively small set of programming languages in common use.  Some of them are safer than others.

What makes a programming language safe or dangerous?

Operating systems

My critics accuse me of "Microsoft Bashing" and to a certain extent, they are correct, I do.  The problem is that, as a computing expert with decades of experience, I see how Microsoft has utterly botched the job of designing for security.  For example, the Microsoft system has a single data structure, the registry, and if you corrupt it, then your system can become unbootable.  I am unaware of any data structure like that in the UNIX or Linux world.  I suppose /etc/inittab could do it, if, for example, you made the default run level 0 or 6.  But you seldom touch the /etc/inittab, and the only account that can is root, wherease anything and everything touches the registry.

Quality Assurance

Problem ticket tracking

In all likelihood, you will have several sets of problems.  Your software engineers will have bugs, your operations staff will have things that break and need fixing.  However, your facilities people also have problems.  So do your purchasing people.

The solution, of course, is a problem tracking system that implements business rules


testing
Design review
Regression testing
Integration test
failure test

There are several ways of failure testing. I've seen (been the victim of) somebody pulling the power cord in the middle of a load test. I've also seen a client test script that locked a record, read the record, modified the record, did a kill -9 on the database PID, and then tried to write the record.  When a component of the system fails, and it will, can the system recover fast enough to meet the requirement?  When part of the system has failed, will performance be adequate?  Is the MTTR acceptable?

load test
post production monitoring
Security analysis

Operations


Networking

Servers

Databases



Monitoring


Disaster planning
Reliable hardware
Backups

Remote sites
Computing security



Internal computer support



Statistics & measures

How well are you doing?  One way to measure that is by looking at your financials.  If your revenue is greater than your costs, then you are doing well indeed.  Your shareholders, VCs, and your employees are all interested in this measure.

But there are other metrics you might use, and those have a bearing on your profitability.

You can't improve reliability if you don't measure it.  Otherwise, you have no idea if you are making things better or worse (or having no impact at all).

Ethics

Why is Ethics in a book about reliable systems?  Because Ethics is key to making systems reliable.  In order for your systems to work reliably, you have to know what the problems are.  Your people have to have confidence that they can come to you with a problem, and you won't shoot the messenger.  You only have to shoot one or two, and the word will get out that you don't want to hear bad news.  So consequently, when bad things happen (and they will), then you won't know about them until they are too big to ignore, and perhaps too big to do something about.

One day, an engine fell off of a Boeing 747.  The FAA was concerned that a part which holds the engine on, called a "fuse pin", may be defective.  So I was put on a team to test the fuse pins.  I had been given some software to run the test with (the test was almost completely automated) and all I had to do was run the computer.  So we put a fuse pin in the test machine and tested it.  It was fine.  We put another fuse pin in the test machine, and it was fine.  We put another fuse pin in the test machine, and it was fine.  After a while, I suggested that we try testing a fuse pin to destruction, just to see what would happen.  I was told this had been done decades ago, we don't need to do that.  I tried again, pointing out that it would be an interesting test of my software.  So we took an old fuse pin that we had already tested and put it back in the machine.  I gave instructions to the program to increase the load on the pin to 110% of "worst case" load.  Then 120% of "worst case" load.  Then 130% of "worst case" load.  Then the fuse pin broke.  It turned out that my software had a bug in it.
I had to go to my manager and tell him that the software had a bug in it, and we might have to redo all of the testing we had done.  Fortunately, my manager was an ethical man, and he listened carefully to my story and then asked me what I was going to do about it.  The two of us came up with a plan.  He reported to his higher ups that there was a problem and we were working on it.  I worked with my fellow engineers to figure out the problem, develop a fix, test it by destroying another fuse pin, reprocess the old data to make it right.
My manager took a risk that his managers would be mad at him - this was a highly visible issue.  I took a risk by going to his manager.  We could have covered up the problem, and the world would never know.  But Boeing had a reputation as a quality organization.  That day, I tested that reputation, and it passed that test.

Earlier, I mentioned personnel security.  I discussed investigating people before you hire them, and investigating your key people again while they are working for you.  That is to protect you against abuse of trust for financial gain.  But people are motivated by other things than money, e.g. revenge.  So while you should not be afraid of your people, you do have to treat them with respect, dignity and understanding.  Remarkably enough, you don't have to pay them very well if you can motivate them in other ways (consider, for example, Boy Scouts and Girl Scouts - nobody is making money but there are a lot of scouts).  One way is an equitable profit sharing plan.  After all, they are sharing risk with you, even if they are not aware of it.  Another way is a relaxed atmosphere, especially when there are no customers around.  Frequent parties, recognition for work well done, respecting people's wishes concerning overtime and schedule (to the extent that you can) are all ways to keep your workforce loyal.

You should have an ethics policy.  It should be administered either by your legal department or your statistics and measurement department.  You must have protection for "whistle blowers".  That will tend to keep people honest because they know that they can't get rid of their consciences.

How to keep your ethical standards in an unethical world

It ain't easy.

While I was out of work, I was offered a ludicrous sum of money to come in and help a pornographer who wanted to take over his own hosting.  I have friends who go "war driving", find unsecured wireless transceivers and send hundreds of thousands of spams.  With a little luck and skill, the owner of the access point will never know.  There are phishing sites.  I found one that was registered in Taiwan, but running traceroute suggested it was in Fullerton, California.  There are times when I really want to write a virus.

The first great decision: do it in house or outsource it

One of my managers once asked me how small I could make an operations department and still make it work 24x7.  After some thought, I decided that the answer was zero: you could farm out the whole thing.  The manager said that he wanted to know how small to make the operations department but still have it under his control.  Again, the answer was: zero.  Just because you've outsourced doesn't necessarily mean that you've lost control of it.  Now that manager, clearly frustrated, told me that he wanted to have an operations staff of direct reports, how small could it be and still run 24x7?  I decided that the answer was anywhere from 4 to 16, depending on how failure tolerant he wanted the operation to be.  His response was that he wanted an organization of between 8 and 10 people - how could we organize it to provide 24x7 operation?

Most people want to work the day shift, because the rest of the world does.  Computer system failures seem to be uniformly distributed over time, especially for 24x7 systems.  Even if your systems are failure tolerant, they will break.  A lot of outfits do batch processing at night, to get the next days billings out the door.  If you are selling to the global economy, then your load will be fairly constant over 24 hours.  So you have to organize to provide a human being, 24x7.

However, you need more than one human being.  The system administrators have a different skill set than the networking guys, who in turn have different skills than the database administrators.  So you need one person from each of these three groups available.  Then, everybody needs a backup person, for when things go really bad, or if a question arises that requires more research or thought.  Finally, people go on vacation, get sick, go to conferences, etc. so you need a third person from each of these groups.  Finally, you will need a cadre of developers to analyze and possibly correct software failures.  This discussion assumes that you have automatic monitoring that will alert your people at the earliest sign of trouble and will at least minimally diagnose the problem, which may or may not be a valid assumption.

If you have a sysadmin, a DBA, a network administrator, and a developer, then you have the kernel of a technical operations organization.  However, if anything happens to any of them, then you have a major hole in your organization, one that is impossible to fill.

This is expensive.  If your organization is very small, then it might make sense to oursource the system administration.   In this day and age, you have a lot of options.  You can outsource to an outfit in town, a company elsewhere in the country, or a company elsewhere in the world.



$Log: management.html,v $
Revision 1.1.1.1  2006/10/01 23:36:20  cvsuser
Initial checkin to CVS

Revision 1.1  2006/01/05 06:02:19  jeffs
Initial revision