Modern, internet-scale microservice architectures exhibit complex communication behavior and failure scenarios with chaotic behavior (aka, the Butterfly Effect) that may lead to large-scale, disruptive events. This complexity comes from the PCF components, the services running thereon, and the underlying infrastructure necessary to provide highly available compute, network, security, storage, and persistence services. For a distributed microservice architecture to function ideally, these elements must all work in tandem and tolerate failure. To systematically verify that a system can tolerate failure, a disciplined approach is necessary. One such approach is “Chaos Engineering.”
This proposal describes the continued progress T-Mobile has made with Chaos Engineering. We'll focus on both our development of open-source tools designed to inject failures in Cloud Foundry and on our first attempts at Game Days, where we sit down with the developers and help them discover weaknesses in their app.