# networking

### Complete guide to iptables implementation

I&rsquo;ve been wanting to put this article together for some time, a complete guide to implementing iptables on a Linux Server. Firstly, my assumptions: You have a reasonable grasp of Linux and Iptables You want to use Iptables to secure a Linux server The Basics Iptables has by default three chains for the FILTER table: INPUT OUTPUT FORWARD In this case, we&rsquo;re going to focus on the INPUT chain (Incoming to firewall. For packets coming to the local server) Implementation Automation I implement these rules using the puppet-iptables module. The module is regularly updated and has a very large feature-set. References: https://gist.github.com/jirutka/3742890 http://www.cyberciti.biz/tips/linux-iptables-10-how-to-block-common-attack.html

# podcast

### My SRE path on Full Stack Journey Podcast

I recently caught up with Scott Lowe of the Full Stack Journey Podcast to talk about my SRE career. You can find the podcast release here

# postmortem

### A Chaos Engineering Gameday Report Template

Following on from my Postmortem Template Document, I thought it would be prudent to right a similar template for Chaos Engineering Gamedays. You can find the empty template here, I&rsquo;ll create a filled-out example template in the coming weeks. Please feel free to tweet at me with any feedback

### A Postmortem Template

I&rsquo;ve been thinking about this for awhile and really wanted to publish my own Postmortem Template. You can find the empty template here, I&rsquo;ll create a filled-out example template in the coming weeks. Please feel free to tweet at me with any feedback

# publications

### Publication Updates (Jul 22 2018)

In the past month, I have had the pleasure to be able to record a few podcasts and have some other work published. You can find it all here: Devops.com: The Importance of Soft Skills in Engineering Devops.com PyBay: Meet Michael Kehoe: Building Production Ready Python Applications PyBay Medium Fullstack Journey (PacketPushers): Michael Kehoe PacketPushers NetworkCollective: Michael Kehoe Network Collective

### Future of Reliability Engineering (Part 2)

In early May, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Blog Post Series: Evolution of the Network Engineer Failure is the new Normal (move towards Chaos-Engineering) Automation as a Service Cloud is King Observe &amp; Measure Failure is the new Normal (move towards Chaos-Engineering) a) Breaking down Silo’s Let’s be real, software will always have bugs, infrastructure will also eventually fail. This isn’t solely a engineering or operations problem, this is everyones problem Chaos-Engineering as a practice is actually a good example of breaking down silo’s. Everyone reading this is probably well versed in this meme about “it’s ops problems now&rdquo; (link). Chaos Engineering forces that pattern to be broken by requiring developers to be invovled in the process. As a result of your chaos-engineering testing, engineering teams should be fixing weak points in your applications/ infrastructure. b) Failure Management Chaos-Engineering is a great way to test various processes around “failure”. This goes from engineering training to monitoring and automation, all the way to incident response. A continuing chaos-engineering practice should be a continual loop of learning and improvement beyond just the code or infrastructure; it’s also about improving processes. c) Testing Chaos-Engineering is obviously a form of testing in itself, but more generally speaking, it should be used as a way to improve your general testing posture. The feedback loop in a chaos-engineering practice should invovlve engineers writing more resilient code that makes it harder for systems to fail. Ofcourse these fixes also need to be continually tested. d) Automation As SRE’s, we aim reduce manual toil as much as possible and have tools/ systems to do the work for us; in a reliable, predictable manner. This same principal applies for chaos-engineering particularly because your aim is to break the system. Since you are trying to ‘break&rsquo; the system, you need to do this in a reliable, repeatable manner. If you are going to perform chaos-engineering at all, you should at the very least have a source-controlled script that everyone can find and understand. e) Measure Everything Within a Chaos Engineering practice, the aim is to alter the normal state of service, observe &amp; measure everything that happens in a controlled environment. Before you perform any chaos-engineering test, you should have a solid understanding of what parts of the system you need to observe, closely observe what changes during the chaos test and write up your observations. Over time, you should be able to tell a story of higher availability/ reliability of the service as the chaos-engineering tests and results improve the software. I would strongly recommend a template similar to this to record your findings. In the next post, we’ll look at ‘Automation as a Service’.

### Future of Reliability Engineering (Part 1)

Last month at Interop ITX, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Evolution of the Network Engineer (towards Network Reliability Engineers) a) Breaking down Silo’s The network is no longer a silo. Applications run over the network in a distributed fashion requiring low-latency and large data-pipes. With these requirements, network engineers must understand these requirements, understand how applications are generally deployed for troubleshooting purposes and ensure that they have models to plan for capacity management. b) Failure management “It’s not a matter of avodiing failure, it’s preparing for what to do when things fail.” Failures are going to happen, the problem is managing them. Theoretically protocols like OSPF, BGP, HSRP give us redundancy, but how does that play in practice. Have you tested it? Equally, a large number of failures in the network come from L1 grey failures. How to you plan to detect and mitigate against these faults? c) Testing Testing has been an area that network engineering has lacked in forever. In the past 10 or so years, tools like GNS3 have moved the needle. Similarly, as Linux Network Operating Systems become more common, the ability to stage configuration and run regression tests. In the future, we will get to the point where you can run tests of new configuration before it is deployed to production. d) Network programmability &amp; Automation Over the past few years, network device management has finally evolved (slightly) away from SSH/ Telnet/ SNMP into programmable models where we can get away from copy-pasting configuration everywhere and polling devices to death (literally). This is not dissimilar from the evolution of server management, where tools like Puppet/ Chef/ CFEngine/ Salt came along and made device management extremely simple and orchestratable. In a number of ways (depending on the organization), this displaced the traditional system administration role. Coming back to the network again, as the network becomes programmable, the traditional network engineer role will need to evolve to know how to program and orchestrate these systems en-masse. e) Measure everything! Long live the days of only having SNMP available to measure network performance. The rise of streaming telemetry, and the use of network agents to validate network performance is a must. Further to that, triaging network issues is generally extremely difficult, so you should have some form of monitoring on most layers of the network L1, L2, L3, L4 and L7. There should be ways to measure availability of your network The next post in this series will be on Chaos Engineering.

### Publication Updates (June 05 2018)

Hi all, I&rsquo;ve recently updated my publications page with my latest presentations from: Interop ITX 2018: The future of Reliability Engineering Velocity New York 2018: How to Monitor Containers Correctly SF Reliability Engineering - May Talks Devops Exchange SF April 2018: How to Build Production-Ready Microservices Information Week: 3 Myths about the Site Reliability Engineer, Debunked You can also find me later in the year at: PyBay 2018: Building Production-Ready Python Microservices Velocity New York 2018: How to Monitor Containers Correctly

### San Francisco Chaos Engineering Meetup Slides

Tonight I have the priviledge of speaking alongside Russ Miles and Kolton Andrus at the San Francisco Chaos Engineering Meetup. You can find my slides from the event here

### Publication Updates (May 27 2017)

Hi all, I just updated my publications page with links to my SRECon17 Americas talks, my new LinkedIn engineering blog post. It was announced this week I will also have the privilege of speaking at SRECon17 EMEA in Dublin later this year. You can find me talking about: Networks for SRE&rsquo;s: What do I need to know for troubleshooting applications Reducing MTTR and false escalations: Event Correlation at LinkedIn

### Publication Updates (March 11 2017)

Hi all, I just updated my publications page with my APRICOT presentation from earlier in the month. If you’re coming to SRECon Americas 2017 this coming week, come and check out my presentations: Traffic shift: Avoiding disasters at scale Reducing MTTR and false escalations: Event Correlation at LinkedIn

# security

### Future of Reliability Engineering (Part 2)

In early May, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Blog Post Series: Evolution of the Network Engineer Failure is the new Normal (move towards Chaos-Engineering) Automation as a Service Cloud is King Observe &amp; Measure Failure is the new Normal (move towards Chaos-Engineering) a) Breaking down Silo’s Let’s be real, software will always have bugs, infrastructure will also eventually fail. This isn’t solely a engineering or operations problem, this is everyones problem Chaos-Engineering as a practice is actually a good example of breaking down silo’s. Everyone reading this is probably well versed in this meme about “it’s ops problems now&rdquo; (link). Chaos Engineering forces that pattern to be broken by requiring developers to be invovled in the process. As a result of your chaos-engineering testing, engineering teams should be fixing weak points in your applications/ infrastructure. b) Failure Management Chaos-Engineering is a great way to test various processes around “failure”. This goes from engineering training to monitoring and automation, all the way to incident response. A continuing chaos-engineering practice should be a continual loop of learning and improvement beyond just the code or infrastructure; it’s also about improving processes. c) Testing Chaos-Engineering is obviously a form of testing in itself, but more generally speaking, it should be used as a way to improve your general testing posture. The feedback loop in a chaos-engineering practice should invovlve engineers writing more resilient code that makes it harder for systems to fail. Ofcourse these fixes also need to be continually tested. d) Automation As SRE’s, we aim reduce manual toil as much as possible and have tools/ systems to do the work for us; in a reliable, predictable manner. This same principal applies for chaos-engineering particularly because your aim is to break the system. Since you are trying to ‘break&rsquo; the system, you need to do this in a reliable, repeatable manner. If you are going to perform chaos-engineering at all, you should at the very least have a source-controlled script that everyone can find and understand. e) Measure Everything Within a Chaos Engineering practice, the aim is to alter the normal state of service, observe &amp; measure everything that happens in a controlled environment. Before you perform any chaos-engineering test, you should have a solid understanding of what parts of the system you need to observe, closely observe what changes during the chaos test and write up your observations. Over time, you should be able to tell a story of higher availability/ reliability of the service as the chaos-engineering tests and results improve the software. I would strongly recommend a template similar to this to record your findings. In the next post, we’ll look at ‘Automation as a Service’.

### Future of Reliability Engineering (Part 1)

Last month at Interop ITX, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Evolution of the Network Engineer (towards Network Reliability Engineers) a) Breaking down Silo’s The network is no longer a silo. Applications run over the network in a distributed fashion requiring low-latency and large data-pipes. With these requirements, network engineers must understand these requirements, understand how applications are generally deployed for troubleshooting purposes and ensure that they have models to plan for capacity management. b) Failure management “It’s not a matter of avodiing failure, it’s preparing for what to do when things fail.” Failures are going to happen, the problem is managing them. Theoretically protocols like OSPF, BGP, HSRP give us redundancy, but how does that play in practice. Have you tested it? Equally, a large number of failures in the network come from L1 grey failures. How to you plan to detect and mitigate against these faults? c) Testing Testing has been an area that network engineering has lacked in forever. In the past 10 or so years, tools like GNS3 have moved the needle. Similarly, as Linux Network Operating Systems become more common, the ability to stage configuration and run regression tests. In the future, we will get to the point where you can run tests of new configuration before it is deployed to production. d) Network programmability &amp; Automation Over the past few years, network device management has finally evolved (slightly) away from SSH/ Telnet/ SNMP into programmable models where we can get away from copy-pasting configuration everywhere and polling devices to death (literally). This is not dissimilar from the evolution of server management, where tools like Puppet/ Chef/ CFEngine/ Salt came along and made device management extremely simple and orchestratable. In a number of ways (depending on the organization), this displaced the traditional system administration role. Coming back to the network again, as the network becomes programmable, the traditional network engineer role will need to evolve to know how to program and orchestrate these systems en-masse. e) Measure everything! Long live the days of only having SNMP available to measure network performance. The rise of streaming telemetry, and the use of network agents to validate network performance is a must. Further to that, triaging network issues is generally extremely difficult, so you should have some form of monitoring on most layers of the network L1, L2, L3, L4 and L7. There should be ways to measure availability of your network The next post in this series will be on Chaos Engineering.

# srecon

### SRECon US 2018 Day 3: What I'm seeing

The talk&rsquo;s I&rsquo;m wathing today are: Containerization War Stories Resolving Outages Faster with Better Debugging Strategies Monitoring DNS with Open-Source Solutions &ldquo;Capacity Prediction&rdquo; instead of &ldquo;Capacity Planning&rdquo;: How Uber Uses ML to Accurately Forecast Resource Utilization DIstributed Tracing, Lessons Learned Whispers in Chaos: Searching for Weak Signals in Incidents Architecting a Technical Post Mortem Your System has recovered from an Incident, but have your Developers The Day 3 Plenary sessions are: The History of Fire Escapes Leaping form mainframes to AWS: Technology Time Travel in the Government Operational Excellence in Aprils Fools&rsquo; Pranks Come and say Hi if you see me!

### SRECon Americas 2018 Day 1 Review

Hi all, This year marks my 3rd year at SRECon Americas. This year brings a 3-day format with the first day being exclusively dedicated to workshops. Hooray! The workshops included: Containers from Scratch SRE Classroom, or How to Build a Distributed System in 3 Hours Profiling JVM Applications in Production Incident Command for IT - What We’ve Learned from the Fire Department Kubernetes 101 Chaos Engineering Bootcamp Ansible for SRE Teams Tech Writing 101 for SREs For the first session, I attended the Containers from Scratch session. As someone who understands the practical implementation of containers, I really appreciated seeing all the details behind it. You can find the following resources from the presentation: Tutorial material Linux Primitives I unfortunately didn’t get a chance to see any of Brent Chapman’s session today on Incident Management, but after going to his BayLISA presentation two weeks back, I know it would have been great. You can find his presentation materials here Bridget Kromhout did a detailed Kubernetes 101 session. From all accounts, it was awesome. You can find relevant materials here: SRECon Slides container.training GitHub You can find James Meickle’s presentation on ‘Ansible for SRE’ here Update (March 28th, 8am): Tammy Butow posted her materials from her Chaos Engineering Bootcamp workshop: Github Speaker Deck Update (March 29th, 12pm): Dan Luedtke did his own version of the Containers Workshop in Go. See the post here Finally, I spent a little bit of time in the LinkedIn Engineering booth, thanks for everyone who stopped by and say Hi! to us.