System Resiliency
Category: infrastructure
The ability of a system to maintain acceptable service levels in the face of faults and challenges.
Resiliency is "bouncing back." It’s not about preventing all errors (impossible), it’s about ensuring that when errors happen, the system stays online. Chaos engineering, automated retries, and circuit breakers are all tools built to achieve resiliency.
Common Examples
- We invested heavily in system resiliency, ensuring that an outage in one region automatically redirects broker traffic to a functioning node.
- True system resiliency is achieved only when the architecture is designed to assume that every component will eventually fail.