Fault Injection in Virtual Platforms
The focus of building virtual platforms is often placed on system operation, especially in the field of digital twins for automotive and aerospace industries. Only by ensuring that the virtual platform can run the system properly can the software reliably operate on the virtual platform.
To ensure the proper operation of a system, it is often necessary to address how to test for anomalies, failures, and other issues. This process is commonly referred to as fault injection, which involves deliberately introducing faults into the target system to accelerate system failure. By observing the system’s behavior in response to the faults, the reliability of the system can be evaluated.
Fault injection is generally categorized into the following types:
Hardware-based Fault Injection: This method uses additional hardware to introduce faults into the target system’s hardware.
Simulation-based Fault Injection: At the system design stage, a hardware simulation model of the target system is created using a standard hardware description language, and fault injection units are inserted into the model to simulate faults.
Software-based Fault Injection: This method involves modifying the memory image of the target system according to a fault model to simulate hardware or software faults in the system.
It is well known that performing fault injection directly on hardware is both complex and expensive. Additionally, it is difficult to precisely control the timing and location of the injection, and it can potentially damage the target system's hardware. Virtual platforms offer a solution to circumvent these issues.
As a digital twin of hardware systems, virtual platforms allow for fault injection at any state and location, such as hard drive failures or Denial of Service (DoS) attacks. This enables cost-effective, highly reliable, and infinitely repeatable fault injection for software testing. The advantages of performing fault injection on virtual platforms include, but are not limited to:
Deterministic Environment: The virtual platform environment is fully deterministic, simplifying fault analysis as environmental factors are not a concern.
Automated Fault Injection: Fault injection can be automated through the insertion of scripts, allowing for multiple fault combinations to be tested more efficiently than on hardware.
Cost Efficiency: Using virtual platforms for fault injection can reduce costs by approximately 90% compared to direct hardware-based fault injection.
In general, fault injection on virtual platforms requires expanding the system model. Some faults can be introduced through simple model state changes, such as modifying register values or memory content, while other types require specific fault models. In practice, aside from deliberately designed fault injections, there are also unexpected faults that cannot be explicitly defined as "faults."
Unexpected "Clock Drift" Incident
An unexpected crash occurred on the virtual platform's operating system, showing a "divide by zero" error. This was likely due to clock drift. In real hardware operations, the processor cores on different chips usually do not operate in exact synchrony because individual clock cycles can exhibit deviations, commonly referred to as "clock drift." Clock drift refers to sustained deviations caused by physical design factors.
In virtual platforms, time is tightly coordinated, and the typical clock drift found in real hardware is not modeled. This led to an issue that would not occur in physical hardware but manifested on the virtual platform.
Unexpected "Simultaneous Board Insertion" Incident
During a test to check whether the system software could correctly detect and activate newly inserted circuit boards (specifically, hot-plugging boards into a live backplane), the system crashed when two boards were simultaneously added on the virtual platform. To determine whether the bug was in the virtual platform or the system itself, one could replicate the scenario on real hardware by hot-plugging two boards into the backplane simultaneously and comparing the stack traces.
In reality, the moment both boards were inserted simultaneously, the backplane was burned out. Although such occurrences are rare, the fault could have been caused by a short circuit or overload, and the root cause lies in the fact that this scenario had not been considered during the early design phase. This highlights an essential point for developers: planning for unforeseen events.
"SkyEye: A Fault Injection Utility
SkyEye serves as a hardware behavior-level simulation environment grounded in visual modeling. Engineers have the capability to inject faults at will within the virtual testing environment, pause or reverse execution at any moment, and consistently reproduce defects. This functionality facilitates overcoming limitations in hardware configuration and addressing software-related issues."