A case example: a highly reliable acronym server

I'm going to create a fairly simple, highly reliable application to demonstrate some of these ideas: an acronym server.

Requirements

We always start these kinds of documents with a description of what we want the system to do.  There are two kinds of requirements: business requirements and technical requirements.  The business people haven't a clue about the problems that the technical people are going to face.  So we take the business requirements and create technical requirements.  In some cases, these requirements border on design.  In the case of computer systems, there frequently isn't a nice dividing line between requirements and design, because one has to beat what one wants against what can be done.

Business requirements

The acronym server has the following business requirements, in order of decreasing priority:

  1. Given an acronym, return the meaning, references to the meaning, and who entered the information
  2. Will not allow a user to enter or change an entry without identification, authentication, and authorization
  3. Allow users to create accounts for themselves in a secure and reliable way.
  4. Provide a mechanism for Advertising
  5. Security
  6. 99.97% reliability, which means no outage of more than 156 minutes per year(2).  Disaster recovery in 2 hours from incident to operational.

One of the interesting things about this exercise was what happened to the reliability figure.  In my rough draft, I wanted 99.99% reliability, which is a number I just dreamed up (don't be surprised, a lot of managers do that).  After beating that reliability figure against what I felt could be done as a reasonable cost, I decided to relax the reliability requirement by a factor of 3.  In an academic exercise where I am both the technician and the manager, I can get away with this sort of thing.

Technical requirements

In addition to the business requirements, there are some technical requirements:

  1. The system must be monitored for failure conditions (MON). Failures include
    1. Excessive database connect time (may indicate a DB problem)
    2. Query rate too low (may indicate an upstream problem, perhaps with web or network)
    3. ping failure (Either a network problem or an OS crash)
    4. App not listening on port (May indicate that the app has crashed)
  2. There should be environmental monitoring as follows:
    1. Electrical: over voltage, under voltage, outage on any phase, excessive current on any phase
    2. Temperature: over acceptable
    3. Water on floor
    4. Link to fire alarm
  3. There should be a log processing system (LPS) that gives the following reports:
    1. Usage overall (for capacity planning)
    2. Advertising hits (To bill our advertisers
    3. List of 404s (may indicate deep linking from the outside
  4. There should be a customer support subsystem (CSS) that allows operators to change user passwords, remove entries from the database, create accounts.
  5. The systems must have no single point of failure except possibly in the CS, LPS, and MON subsystems. There should be 99.97% uptime or better.
  6. There should be the following environments
    Environment
    name
    Usage and characteristics
    Reliability and repair
    Configuration control
    Hardware
    Development
    dev
    Used for development  Software includes compilers, debuggers, documentation tools.  
    Systems may crash at will, repair SLA is next business day. No formal configuration control, or control by developers.
    Virtual machines are acceptable.  Whatever is laying around.
    Test
    test
    Test systems, including simulated load generators.
    Systems may crash at will, repair SLA is next business day.  Informal configuration control by software leads.  Virtual machines are acceptable.  Whatever is laying around.  Must be fast enough to generate a proper load.
    Integration
    int
    Used to simulate a production environment, and test software release procedures. 
    Systems ought not to crash.  Repair SLA is next business day.
    Informal configuration control by QA leads.  Virtual machines are acceptable.
    load
    load
    Used to simulate a production environment and check that performance is what is desired. 
    Systems ought not to crash.  Repair SLA is next business day. Informal configuration control by QA leads.
    Production hardware.  Virtual machines are acceptable if and only if production machines with the same function are virtual machines
    Production
    prod
    Production environment.  
    Repair SLA is 1 hour 24x7x365.
    Formal configuration control by management
    production hardware.   Virtual machines are acceptable if approved by the system architect.
    ancillary
    an
    Advertising Content management, log processing, system monitoring, security monitoring release system.
    Repair SLA is defined on a per machine basis. Configuration controls as appropriate for the application
    production hardware.   Virtual machines are acceptable if approved by the system architect.
    Customer facing

    Rewrite rules go to other machines
    Repair SLA is 1 hour 24x7x365. Formal configuration control by management production hardware.   Virtual machines are acceptable if approved by the system architect.
    development, test, integration, load, and production environments.
  7. Remote management.
  8. Racking requirements(1): The systems shall be installed in racks as follows:
    Requirement
    Rule
    Rationale
    Prevent water damage in case of a flood in the computer room
    No machine installed lower than 8" above the floor. 
    Why 8"?  So that if somebody is wading through the water, the waves won't touch the machines and possibly start a fire or kill somebody.
    Any power supply might fail
    All production computers have hot swappable dual power supplies
    If a PS fails, then the machine won't go down, and the PS can be repaired without interrupting service.
    Dual power supplies on different phases
    A phase might fail (it's more likely that two phases will fail, but a single phase failure is still a possibility)
    Machines in a farm must be spread out over several racks
    If a PDU in a rack fails, it won't take out an entire farm
    Current draw must be balanced across all phases in a rack, no more than 40% of nominal max current may be drawn under normal conditions.
    If one of the PDUs fails, then all of the load goes to the other PDU.    If any PDU is loaded more than 50%, when the failover occurs, the surviving PDU will be overloaded.
    SOX and FACTA
    Financial processing machines must have special security mounting with limited access controls

    Good airflow
    Alternating aisles between racks should be "hot side" "cold side".  Computers should draw air from the front, which is the cold side and exhaust it on the back, which is the hot side.  The fronts of the computers should face the fronts of the computers on the next row; similarly the backs of the computers should face the backs of the computers.
    The computers will stay cooler if they draw in cool air.  Once the air has been heated, it will tend to rise out the hot side and be captured by the HVAC,

  9. Environmental
    1. Sufficient cooling
    2. Sufficient electrical
    3. Power reliability: batteries, diesel generators
  10. Physical Security
  11. Computing Security
  12. Disaster recovery (DR)
    1. The DR site should be at some distance from the primary site
    2. The DR site need not replicate the primary site.  In particular,
      1. Degraded response time is acceptable
      2. Software releases will not be done while in DR mode
      3. Log processing can be delayed and not all log processing needs to be done.
    3. Failover mechanism should be practiced every 3 months.













Footnotes:
  1. I am deeply indebted to James Littlefield of Real Networks for these ideas
  2. This ought to be obvious from the chapter on statistics, but in case you missed it:  365.25 days/year * 24 hours/day = 8766 hours/year.  99.97% reliability = 0.03% failure.  8766 hours * 0.03%=2.6 hours = 158 minutes.