Monitoring critical PCF components at large scale can be tremendously challenging. At T-Mobile, with 40K+ AI and SI hosted over 15+ foundations, it's business-critical to identify unhealthy components and remedy the underlying problems in real time. In addition to PCF KPIs, monitoring tools are necessary to ensure a healthy state of the core components and workflows. The smoke tests offered by the CF community are useful but operate with few limitations, such as lack of reliable cleanup and status reporting of granular operations, admin-privileged execution of nonessential operations, and so on.
In this talk, we'll present a comprehensive suite of smoke tests that we developed with a plug-and-play architecture, keeping reliability, efficiency, and maintainability at the center. We'll discuss our monitoring scope, solution architecture, and some benefits such as time improvements to identify unhealthy components and workflows, troubleshooting, and restoring the normal state of the system.