Chaos Engineering for PCF
Modern Internet-scale microservice architectures exhibit complex communication behavior and failure scenarios with chaotic behavior (a.k.a the Butterfly Effect) that may lead to large scale disruptive events. This complexity comes from the PCF components, services running thereon, and the underlying infrastructure necessary to provide highly available compute, network, security, storage, persistence services. For a distributed microservice architecture to function ideally, these elements must all work in tandem and tolerate failure. To systematically verify that a system can tolerate failure, a disciplined approach is necessary. One such approach is Chaos Engineering. This proposal demonstrates the approach and the custom tools T-Mobile is building to purposefully break systems, identify weaknesses and take corrective actions. Its an enhanced API on top of the ChaosLemur project for introducing more complex failure scenarios into the PCF environment.
Sr. Engineer, T-Mobile