SRE

On Bastion Hosts

I was at a meetup the other night and a student mentioned that they were learning about bastion hosts and wanted to learn more. So I thought I would do a deep dive on what they are and why to use them. What Bastion hosts are instances that sit within your public subnet and are typically accessed using SSH or RDP. Once remote connectivity has been established with the bastion host, it then acts as a ‘jump’ server, allowing you to use SSH or RDP to log in to other instances. https://cloudacademy.com/blog/aws-bastion-host-nat-instances-vpc-peering-security/ Why Bastion hosts act as a gateway or ‘jump’ host into a secure network. The servers in the secure network will ONLY accept SSH connections from bastion hosts. This helps limit the number of points where you can SSH into servers from and limit it to a trusted set of hosts. This also significantly helps auditing of SSH connections in the secure network. Bastion hosts typically have more strigent security postures. This includes more regular patching, more detailed logging and auditing. How Bastion setups are rather quiet simple, here are a few simple steps to set one up: Provision a new server(s) that are ONLY dedicated for the purpose of bastion access Install any additional security measures (see the cyberciti reference below for specific recommendations Ensure that all servers in the secure network ONLY accept SSH connections from the bastion server(s) Configure your SSH client to talk to hosts in your private network. Replace the IdentityFile and domain-names to suit your network: $ cat ~/.ssh/config Host *.secure.example.com IdentityFile %d/.ssh/keyname.extension ProxyCommand ssh bastion.corp.example.com -W %h:%p Host bastion.corp.example.com IdentityFile %d/.ssh/keyname.extension Host * PubkeyAuthentication yes References https://www.cyberciti.biz/faq/linux-bastion-host/ https://cloudacademy.com/blog/aws-bastion-host-nat-instances-vpc-peering-security/ https://www.sans.org/reading-room/whitepapers/basics/hardening-bastion-hosts-420 https://blog.scottlowe.org/2017/05/26/bastion-hosts-custom-ssh-configs/

Future of Reliability Engineering (Part 2)

In early May, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Blog Post Series: Evolution of the Network Engineer Failure is the new Normal (move towards Chaos-Engineering) Automation as a Service Cloud is King Observe & Measure Failure is the new Normal (move towards Chaos-Engineering) a) Breaking down Silo’s Let’s be real, software will always have bugs, infrastructure will also eventually fail. This isn’t solely a engineering or operations problem, this is everyones problem Chaos-Engineering as a practice is actually a good example of breaking down silo’s. Everyone reading this is probably well versed in this meme about “it’s ops problems now” (link). Chaos Engineering forces that pattern to be broken by requiring developers to be invovled in the process. As a result of your chaos-engineering testing, engineering teams should be fixing weak points in your applications/ infrastructure. b) Failure Management Chaos-Engineering is a great way to test various processes around “failure”. This goes from engineering training to monitoring and automation, all the way to incident response. A continuing chaos-engineering practice should be a continual loop of learning and improvement beyond just the code or infrastructure; it’s also about improving processes. c) Testing Chaos-Engineering is obviously a form of testing in itself, but more generally speaking, it should be used as a way to improve your general testing posture. The feedback loop in a chaos-engineering practice should invovlve engineers writing more resilient code that makes it harder for systems to fail. Ofcourse these fixes also need to be continually tested. d) Automation As SRE’s, we aim reduce manual toil as much as possible and have tools/ systems to do the work for us; in a reliable, predictable manner. This same principal applies for chaos-engineering particularly because your aim is to break the system. Since you are trying to ‘break’ the system, you need to do this in a reliable, repeatable manner. If you are going to perform chaos-engineering at all, you should at the very least have a source-controlled script that everyone can find and understand. e) Measure Everything Within a Chaos Engineering practice, the aim is to alter the normal state of service, observe & measure everything that happens in a controlled environment. Before you perform any chaos-engineering test, you should have a solid understanding of what parts of the system you need to observe, closely observe what changes during the chaos test and write up your observations. Over time, you should be able to tell a story of higher availability/ reliability of the service as the chaos-engineering tests and results improve the software. I would strongly recommend a template similar to this to record your findings. In the next post, we’ll look at ‘Automation as a Service’.

Future of Reliability Engineering (Part 1)

Last month at Interop ITX, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Evolution of the Network Engineer (towards Network Reliability Engineers) a) Breaking down Silo’s The network is no longer a silo. Applications run over the network in a distributed fashion requiring low-latency and large data-pipes. With these requirements, network engineers must understand these requirements, understand how applications are generally deployed for troubleshooting purposes and ensure that they have models to plan for capacity management. b) Failure management “It’s not a matter of avodiing failure, it’s preparing for what to do when things fail.” Failures are going to happen, the problem is managing them. Theoretically protocols like OSPF, BGP, HSRP give us redundancy, but how does that play in practice. Have you tested it? Equally, a large number of failures in the network come from L1 grey failures. How to you plan to detect and mitigate against these faults? c) Testing Testing has been an area that network engineering has lacked in forever. In the past 10 or so years, tools like GNS3 have moved the needle. Similarly, as Linux Network Operating Systems become more common, the ability to stage configuration and run regression tests. In the future, we will get to the point where you can run tests of new configuration before it is deployed to production. d) Network programmability & Automation Over the past few years, network device management has finally evolved (slightly) away from SSH/ Telnet/ SNMP into programmable models where we can get away from copy-pasting configuration everywhere and polling devices to death (literally). This is not dissimilar from the evolution of server management, where tools like Puppet/ Chef/ CFEngine/ Salt came along and made device management extremely simple and orchestratable. In a number of ways (depending on the organization), this displaced the traditional system administration role. Coming back to the network again, as the network becomes programmable, the traditional network engineer role will need to evolve to know how to program and orchestrate these systems en-masse. e) Measure everything! Long live the days of only having SNMP available to measure network performance. The rise of streaming telemetry, and the use of network agents to validate network performance is a must. Further to that, triaging network issues is generally extremely difficult, so you should have some form of monitoring on most layers of the network L1, L2, L3, L4 and L7. There should be ways to measure availability of your network The next post in this series will be on Chaos Engineering.