Future of Reliability Engineering (Part 2)

In early May, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Blog Post Series: Evolution of the Network Engineer Failure is the new Normal (move towards Chaos-Engineering) Automation as a Service Cloud is King Observe & Measure Failure is the new Normal (move towards Chaos-Engineering) a) Breaking down Silo’s Let’s be real, software will always have bugs, infrastructure will also eventually fail. This isn’t solely a engineering or operations problem, this is everyones problem Chaos-Engineering as a practice is actually a good example of breaking down silo’s. Everyone reading this is probably well versed in this meme about “it’s ops problems now” (link). Chaos Engineering forces that pattern to be broken by requiring developers to be invovled in the process. As a result of your chaos-engineering testing, engineering teams should be fixing weak points in your applications/ infrastructure. b) Failure Management Chaos-Engineering is a great way to test various processes around “failure”. This goes from engineering training to monitoring and automation, all the way to incident response. A continuing chaos-engineering practice should be a continual loop of learning and improvement beyond just the code or infrastructure; it’s also about improving processes. c) Testing Chaos-Engineering is obviously a form of testing in itself, but more generally speaking, it should be used as a way to improve your general testing posture. The feedback loop in a chaos-engineering practice should invovlve engineers writing more resilient code that makes it harder for systems to fail. Ofcourse these fixes also need to be continually tested. d) Automation As SRE’s, we aim reduce manual toil as much as possible and have tools/ systems to do the work for us; in a reliable, predictable manner. This same principal applies for chaos-engineering particularly because your aim is to break the system. Since you are trying to ‘break’ the system, you need to do this in a reliable, repeatable manner. If you are going to perform chaos-engineering at all, you should at the very least have a source-controlled script that everyone can find and understand. e) Measure Everything Within a Chaos Engineering practice, the aim is to alter the normal state of service, observe & measure everything that happens in a controlled environment. Before you perform any chaos-engineering test, you should have a solid understanding of what parts of the system you need to observe, closely observe what changes during the chaos test and write up your observations. Over time, you should be able to tell a story of higher availability/ reliability of the service as the chaos-engineering tests and results improve the software. I would strongly recommend a template similar to this to record your findings. In the next post, we’ll look at ‘Automation as a Service’.

Future of Reliability Engineering (Part 1)

Last month at Interop ITX, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Evolution of the Network Engineer (towards Network Reliability Engineers) a) Breaking down Silo’s The network is no longer a silo. Applications run over the network in a distributed fashion requiring low-latency and large data-pipes. With these requirements, network engineers must understand these requirements, understand how applications are generally deployed for troubleshooting purposes and ensure that they have models to plan for capacity management. b) Failure management “It’s not a matter of avodiing failure, it’s preparing for what to do when things fail.” Failures are going to happen, the problem is managing them. Theoretically protocols like OSPF, BGP, HSRP give us redundancy, but how does that play in practice. Have you tested it? Equally, a large number of failures in the network come from L1 grey failures. How to you plan to detect and mitigate against these faults? c) Testing Testing has been an area that network engineering has lacked in forever. In the past 10 or so years, tools like GNS3 have moved the needle. Similarly, as Linux Network Operating Systems become more common, the ability to stage configuration and run regression tests. In the future, we will get to the point where you can run tests of new configuration before it is deployed to production. d) Network programmability & Automation Over the past few years, network device management has finally evolved (slightly) away from SSH/ Telnet/ SNMP into programmable models where we can get away from copy-pasting configuration everywhere and polling devices to death (literally). This is not dissimilar from the evolution of server management, where tools like Puppet/ Chef/ CFEngine/ Salt came along and made device management extremely simple and orchestratable. In a number of ways (depending on the organization), this displaced the traditional system administration role. Coming back to the network again, as the network becomes programmable, the traditional network engineer role will need to evolve to know how to program and orchestrate these systems en-masse. e) Measure everything! Long live the days of only having SNMP available to measure network performance. The rise of streaming telemetry, and the use of network agents to validate network performance is a must. Further to that, triaging network issues is generally extremely difficult, so you should have some form of monitoring on most layers of the network L1, L2, L3, L4 and L7. There should be ways to measure availability of your network The next post in this series will be on Chaos Engineering.