Service-based Fault Tolerance for Cyber-Physical Systems: A Systems Engineering Approach
Alho, Pekka (2015)
Alho, Pekka
Tampere University of Technology
2015
Rakennetun ympäristön tiedekunta - Faculty of Built Environment
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-15-3660-1
https://urn.fi/URN:ISBN:978-952-15-3660-1
Tiivistelmä
Cyber-physical systems (CPSs) comprise networked computing units that monitor and control physical processes in feedback loops. CPSs have potential to change the ways people and computers interact with the physical world by enabling new ways to control and optimize systems through improved connectivity and computing capabilities. Compared to classical control theory, these systems involve greater unpredictability which may affect the stability and dynamics of the physical subsystems. Further uncertainty is introduced by the dynamic and open computing environments with rapidly changing connections and system configurations. However, due to interactions with the physical world, the dependable operation and tolerance of failures in both cyber and physical components are essential requirements for these systems.
The problem of achieving dependable operations for open and networked control systems is approached using a systems engineering process to gain an understanding of the problem domain, since fault tolerance cannot be solved only as a software problem due to the nature of CPSs, which includes close coordination among hardware, software and physical objects. The research methodology consists of developing a concept design, implementing prototypes, and empirically testing the prototypes. Even though modularity has been acknowledged as a key element of fault tolerance, the fault tolerance of highly modular service-oriented architectures (SOAs) has been sparsely researched, especially in distributed real-time systems. This thesis proposes and implements an approach based on using loosely coupled real-time SOA to implement fault tolerance for a teleoperation system.
Based on empirical experiments, modularity on a service level can be used to support fault tolerance (i.e., the isolation and recovery of faults). Fault recovery can be achieved for certain categories of faults (i.e., non-deterministic and aging-related) based on loose coupling and diverse operation modes. The proposed architecture also supports the straightforward integration of fault tolerance patterns, such as FAIL-SAFE, HEARTBEAT, ESCALATION and SERVICE MANAGER, which are used in the prototype systems to support dependability requirements. For service failures, systems rely on fail-safe behaviours, diverse modes of operation and fault escalation to backup services. Instead of using time-bounded reconfiguration, services operate in best-effort capabilities, providing resilience for the system. This enables, for example, on-the-fly service changes, smooth recoveries from service failures and adaptations to new computing environments, which are essential requirements for CPSs.
The results are combined into a systems engineering approach to dependability, which includes an analysis of the role of safety-critical requirements for control system software architecture design, architectural design, a dependability-case development approach for CPSs and domain-specific fault taxonomies, which support dependability case development and system reliability analyses. Other contributions of this work include three new patterns for fault tolerance in CPSs: DATA-CENTRIC ARCHITECTURE, LET IT CRASH and SERVICE MANAGER. These are presented together with a pattern language that shows how they relate to other patterns available for the domain.
The problem of achieving dependable operations for open and networked control systems is approached using a systems engineering process to gain an understanding of the problem domain, since fault tolerance cannot be solved only as a software problem due to the nature of CPSs, which includes close coordination among hardware, software and physical objects. The research methodology consists of developing a concept design, implementing prototypes, and empirically testing the prototypes. Even though modularity has been acknowledged as a key element of fault tolerance, the fault tolerance of highly modular service-oriented architectures (SOAs) has been sparsely researched, especially in distributed real-time systems. This thesis proposes and implements an approach based on using loosely coupled real-time SOA to implement fault tolerance for a teleoperation system.
Based on empirical experiments, modularity on a service level can be used to support fault tolerance (i.e., the isolation and recovery of faults). Fault recovery can be achieved for certain categories of faults (i.e., non-deterministic and aging-related) based on loose coupling and diverse operation modes. The proposed architecture also supports the straightforward integration of fault tolerance patterns, such as FAIL-SAFE, HEARTBEAT, ESCALATION and SERVICE MANAGER, which are used in the prototype systems to support dependability requirements. For service failures, systems rely on fail-safe behaviours, diverse modes of operation and fault escalation to backup services. Instead of using time-bounded reconfiguration, services operate in best-effort capabilities, providing resilience for the system. This enables, for example, on-the-fly service changes, smooth recoveries from service failures and adaptations to new computing environments, which are essential requirements for CPSs.
The results are combined into a systems engineering approach to dependability, which includes an analysis of the role of safety-critical requirements for control system software architecture design, architectural design, a dependability-case development approach for CPSs and domain-specific fault taxonomies, which support dependability case development and system reliability analyses. Other contributions of this work include three new patterns for fault tolerance in CPSs: DATA-CENTRIC ARCHITECTURE, LET IT CRASH and SERVICE MANAGER. These are presented together with a pattern language that shows how they relate to other patterns available for the domain.
Kokoelmat
- Väitöskirjat [4865]