CS 449/549
Fault Tolerant Systems
Catalog Description: Design, modeling, analysis and integration of hardware and software to achieve dependable computing systems employing on-line fault tolerance; theory and fundamental concepts of designing reliable systems; analytical evaluation techniques, faults and advances in ultra-reliable distributed systems, fault-tolerant software systems; case studies include the space Shuttle, Airbus, and Boeing fly-by-wire primary flight computers as well as systems in reliable data bases and financial markets. Additional projects and assignments required for graduate credit.
Total Credits: 3
Course Coordinator: Axel Krings
URL: http://www2.cs.uidaho.edu/~krings/CS449
Textbook:
Prerequisites by Topic:
- Basic knowledge of computer architecture and computer organization, e.g., memory system organization and architecture, Interfacing and communication, functional organization, multiprocessing and alternative architectures. (CS 150)
- Basic topics of operating systems, e.g., OS system principles, memory management, file system, concurrency, input/output, OS security. (CS 240)
Major Topics Covered
- Introduction to fault-tolerance, safety-critical systems, and top challenges (2 hours)
- Standard definitions: e.g. reliability, fault-error-failure (1 hour)
- Redundancy concepts (spatial, information, time), error detection / correction (2 hours)
- Reliability analysis: math background, bathtub curve, MTTF (2 hours)
- Reliability block diagrams and fault tree analysis (2 hours)
- Reliability analysis using Markov analysis (3 hours)
- Reliability analysis using Petri nets (2 hours)
- Distributed systems: ordering / synchronizing, reliable / atomic / causal broadcast (5 hours)
- Fault-tolerant agreement, consensus, fault models (5 hours)
- Clock synchronization (3 hours)
- Recovery strategies, checkpointing, message logging (2 hours)
- RAID systems, fail-stop processes (3 hours)
- Diagnosability (2 hours)
- Case studies: Space Shuttle, Boeing 777, SIFT, Tandem, NonStop System Cyclone, Himalaya, MAFT (6 hours)
Course Outcomes
Upon completion of this course, students should:
- Understand the terminology of fault-tolerant system design, e.g., dependability, reliability, safety, and maintainability,
- Understand of basic concepts of redundancy and redundancy management,
- Be able to analyze systems using reliability block diagrams, fault-tree analysis, Markov chains and Petri nets,
- Be able to argue fail rates and hazard functions as they relate to the Bathtub curve,
- Understand of function and analysis of series/parallel systems, stand-by-redundancy, M-of-N systems, as well as recovery strategies,
- Have a solid understanding of faults, errors and failures in the context of fault models,
- Understanding of agreement algorithms, exact and approximate agreement, e.g., Byzantine agreement or clock synchronization,
- Understanding of basic concepts like fail-stop processes, and the issues involved in PMC-derived system diagnosis
- Be familiar with and understand common case studies.