CS 449/549

From CS Wiki
Jump to: navigation, search

Fault Tolerant Systems

Catalog Description: Design, modeling, analysis and integration of hardware and software to achieve dependable computing systems employing on-line fault tolerance; theory and fundamental concepts of designing reliable systems; analytical evaluation techniques, faults and advances in ultra-reliable distributed systems, fault-tolerant software systems; case studies include the space Shuttle, Airbus, and Boeing fly-by-wire primary flight computers as well as systems in reliable data bases and financial markets. Additional projects and assignments required for graduate credit.

Total Credits: 3

Course Coordinator: Axel Krings

URL: http://www2.cs.uidaho.edu/~krings/CS449


Prerequisites by Topic:

  • Basic knowledge of computer architecture and computer organization, e.g., memory system organization and architecture, Interfacing and communication, functional organization, multiprocessing and alternative architectures. (CS 150)
  • Basic topics of operating systems, e.g., OS system principles, memory management, file system, concurrency, input/output, OS security. (CS 240)

Major Topics Covered

  1. Introduction to fault-tolerance, safety-critical systems, and top challenges (2 hours)
  2. Standard definitions: e.g. reliability, fault-error-failure (1 hour)
  3. Redundancy concepts (spatial, information, time), error detection / correction (2 hours)
  4. Reliability analysis: math background, bathtub curve, MTTF (2 hours)
  5. Reliability block diagrams and fault tree analysis (2 hours)
  6. Reliability analysis using Markov analysis (3 hours)
  7. Reliability analysis using Petri nets (2 hours)
  8. Distributed systems: ordering / synchronizing, reliable / atomic / causal broadcast (5 hours)
  9. Fault-tolerant agreement, consensus, fault models (5 hours)
  10. Clock synchronization (3 hours)
  11. Recovery strategies, checkpointing, message logging (2 hours)
  12. RAID systems, fail-stop processes (3 hours)
  13. Diagnosability (2 hours)
  14. Case studies: Space Shuttle, Boeing 777, SIFT, Tandem, NonStop System Cyclone, Himalaya, MAFT (6 hours)

Course Outcomes

Upon completion of this course, students should:

  1. Understand the terminology of fault-tolerant system design, e.g., dependability, reliability, safety, and maintainability,
  2. Understand of basic concepts of redundancy and redundancy management,
  3. Be able to analyze systems using reliability block diagrams, fault-tree analysis, Markov chains and Petri nets,
  4. Be able to argue fail rates and hazard functions as they relate to the Bathtub curve,
  5. Understand of function and analysis of series/parallel systems, stand-by-redundancy, M-of-N systems, as well as recovery strategies,
  6. Have a solid understanding of faults, errors and failures in the context of fault models,
  7. Understanding of agreement algorithms, exact and approximate agreement, e.g., Byzantine agreement or clock synchronization,
  8. Understanding of basic concepts like fail-stop processes, and the issues involved in PMC-derived system diagnosis
  9. Be familiar with and understand common case studies.