My favorite LinkedIn Engineering Blog Posts of FY20

One of the hats I wear at LinkedIn is technical editor of the Linkedin Engineering Blog. In FY2020, I’ve edited 17 of our blog posts (and written one). I want to highlight my top 5 favourite posts of the Fiscal Year: The Top Five A look at our biggest SRE[in]con yet Analyzing anomalies with ThirdEye Eliminating toil with fully automated load testing Open sourcing DataHub: LinkedIn’s metadata search and discovery platform How LinkedIn customizes Apache kafka for 7 trillion messages per day Other posts I’ve edited (in order of date posted): DataHub: A generalized metadata search & discovery tool LinkedIn NYC Tech Talk serices: Engineering Excellence Meetup Solving manageability challenges at scale with Nuage Empowering our developers with the Nuage SDK Upgrading to RHEL7 with minimal interruptions Coding Conversations: Four teams, three tracks, two offices Advanced schema management for Spark applications at scale Open sourcing Kube2Hadoop: Secure access to HDFS from Kubernetes The impact of slow NFS on data systems Scaling LinkedIn’s Edge with Azure Front Door Monitoring business performance data with ThirdEye smart alerts Faster testing on Android with Mobile Test Orchestrator Building LinkedIn Talent Insights While I can’t give too much away, in FY21, the LinkedIn engineering blog will have features on some of our largest data systems as well as more deep-dives on how we built our features.

Enabling Yubikey for SSH 2FA

Enabling Yubikey for SSH 2FA In the past I wrote about setting up a bastion hosts and why they are important. I thought I’d take the time to explain how to utilize a Yubikey as a mechanism to perform 2FA when you SSH into bastion hosts. There are a few key components here: Buying a Yubikey Enabling PAM authentication in SSH Installing the Yubikey PAM module Getting a Yubico API Key Setting up authorized Yubikeys. Configuring PAM to use Yubikey for 2FA Testing it out Buying a Yubikey Yubico sells a number of different Yubikeys for specific purposes. A the time of writing, the Yubikey 5 is the flagship device and is perfect for general 2FA. Yubico usually ship within 2 days and their customer service has been great to me in the past. Enable PAM authentication in SSH You will need to configure /etc/ssh/sshd_config to have the following parameters PubkeyAuthentication yes PasswordAuthentication no UsePAM yes ChallengeResponseAuthentication yes AuthenticationMethods publickey,keyboard-interactive:pam Installing the Yubikey PAM module If you’re on EPEL/ RHEL/ CentOS, you can install it via yum sudo yum install pam_yubico Otherwise if you’re on a Debian distribution: sudo add-apt-repository ppa:yubico/stable sudo apt-get update sudo apt-get install libpam-yubico Getting a Yubico API Key In order to use the PAM module, assuming you’re not running your own Yubico validation servers, you’ll need to register for an API key at: Simply provide an email address and press your Yubikey and you’ll get an id and a key (which ends with an =). Setting up authorized Yubikeys. The Pubico PAM module allows you to configure Yubikey the authorized Yubikeys in one of two ways: A file in the users home directory ~/.yubico/authorized_yubikeys. This file is formatted in the following manner: <username>:<Token-ID-1>:<Token-ID-2>:<Token-ID-n> So it looks something like this: michael:ccccccrjzigk:ccccccirfkl A file in a central location (e.g. /etc/yubikeys). This file follows the same format, however one user-name per line: michael:ccccccrjzigk:ccccccirfkl bob:ccccccirkdkx The way to get the token ID is to press your Yubikey in a text editor and copy the first 12 characters of the string that is produced. I usually do this in Python just to be sure: michael@laptop~ % python >>> a = "eibeeliibcjkcljuiieijhcckeerejbikvhhcbchkgin" >>> b = a[0:12] >>> b 'eibeeliibcjk' >>> len(b) 12 There is no security advantage between methods, but depending on if and how you’re using configuration management, there may be a preferred method here. You can find more information at the Yubico site. Configuring PAM to use Yubikey for 2FA In /etc/pam.d/sshd: on the top line, add the following string, replacing the id and key from the previous step. If you’re using option 1 auth required id=<id> key=<key> debug If you’re using option 2 auth required id=<id> key=<key> debug authfile=/path/to/mapping/file You will need to restart sshd to pick up these changes. Testing it out Note: If you’re using pihole, make sure that the api* is not being blocked. We recommend that you keep the terminal you’re using to configure your bastion host open and then try and SSH to the bastion in a new tab/ window. When you SSH, you should be prompted for your Yubikey: michael@laptop ~ % ssh bastion YubiKey for `michael`: Last Login: Thu June 10 20:22:53 2020 from 192.168.x.x [michael@bastion ~]$ Credit to Yubico for the cover image

Hello World: Creating this blog

Hi, Welcome to the new iteration of this blog. It’s been a while since I’ve touched this and this time I really want to do better. Instead of a boring “Hello World” initial blog post, I wanted to write about the process I went though to create this site. This iteration is also powered by Hugo and hosted by Netlify. 0. Installing Homebrew I am creating this website on an older Macbook Pro (Catalina). Before we begin, you need to ensure you have Homebrew installed. You can run brew to see if it’s installed. If it is not, you can install it by running the following command: /bin/bash -c "$(curl -fsSL" 1. Installing Hugo Now we need to install Hugo. You can do this via running: brew install hugo and then check the install via running hugo version I’m running v0.69.0. Note: Later in these series, I will be using features from v0.67.0, so it is important to have a new version 2. Creating a new git repo The next step is to create a git repo. Netlify (the hosting provider I’m going to use) supports Github, Gitlab & BitBucket. I’m going to use Github and use a private repository as shown below: I’m now going to create a new folder on my macbook and clone the repository: mkdir Sites cd Sites git clone<username>/<repo>.git cd <repo> e.g. $ mkdir Sites $ cd Sites build@machine Sites % git clone Cloning into ''... remote: Enumerating objects: 3, done. remote: Counting objects: 100% (3/3), done. remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (3/3), done. build@machine Sites % ls old_mk build@machine Sites % cd build@silicon % 3. Creating a new site Now create the skeleton site. You do this via running hugo new site . --force For my site, I ran: build@silicon % hugo new site . --force 4. Picking a theme See for a list of themes to look through. There is nothing stopping you from creating your own, or modifying one (if the license permits). The feature set of a theme may vary (e.g. Social Media Button support). For this site, I’m using hugo-refresh. NOTE: The rest of this tutorial assumes that you are using the hugo-refresh theme 5. Installing the theme Now we want to install the theme into our repository: To do this you run git submodule add <git repo of theme> themes/<theme-name> e.g. git submodule add themes/hugo-refresh I’m going to make modifications to the theme, so I’m going to fork the repo (In Github) and modify it shortly. git submodule add themes/hugo-refresh 6. Create some basic configuration We’re going to do two things here to allow the server to start: Delete the config.toml file that was auto-generated Copy the example configuration cp themes/hugo-refresh/exampleSite/config.yaml . At this time, I also want to make some small configuration changes: Change the baseURL to Change the title to Michael Kehoe Disable disableKinds: ["sitemap","RSS"] baseURL: title: Michael Kehoe # disableKinds: ["sitemap","RSS"] 7. Running a local server Hugo provides a simple server to build the content and serve it (very quickly). You can do this by running this in a new tab: hugo server It may be useful to render draft posts that you’re writing. In that case, you will want to run hugo server -D If you go to http://localhost:1313/ in a browser, you should see your website load. If it doesn’t, the tab that’s running hugo server -D should have some error information. 7. Tweaking the parameters In config.yaml, we’re going to tweak some settings under the params section. 7.1 General Settings I want to set articlesPerRow: 3 to articlesPerRow: 4. This is mostly personal preference. I am also updating my favicon to be an avatar. I made it on and downloaded the SVG and placed it in themes/hugo-refresh/assets/images/avataaars.svg. We will use this same avatar on the homepage later in this section I’m setting jsMinify: true and cssMinify: true. This may lead to slightly slower build times, but it gives people viewing the page a slightly better experience in terms of load times. I’m setting mainColour: "#284ed8". This is the same color as the eyes in my avatar 7.2 Page not found Under pagNotFound: (yes, there is a typo, but it’s in the theme), I’m modifying my avatar to show a frowny face. I’ve saved it in themes/hugo-refresh/assets/images/avataaars_404.svg and updated the image config to image: "/images/avataaars_404.svg". You can test this by going to http://localhost:1313/404.html 7.3 Homepage settings In here, I make a few changes, firstly setting linkPosition: "menu+footer". This gives me a link to the homepage at both the top and bottom of the page. I set the title of the page to be Michael Kehoe and I’ve also updated my avatar. # homepage options homepage: # this options let you specify if you want a link to the homepage # it can be: "none", "menu", "footer" or "menu+footer" linkPosition: "menu+footer" # this options let you specify the text of the link to the homepage linkText: "Home" # option to specify the title in the homepage title: Michael Kehoe # option to specify the subtitle in the homepage subtitle: Leader of engineers who architect reliable, scalable infrastructure # option to specify image in the homepage image: "/images/avataaars.svg" # worker.svg # option to specify the width of the image in the homepage imageWidth: 500px If you go back to your browser window, you should see the word Home in the top navbar, updated text in the center of the page and a new avatar. 7.4 Footer settings Under footer:, we’re going to make some simple changes here and then some more complicated ones Disable the email option as we do not want to allow our email address to be spammed. #email: # link: # title: My Email Set my LinkedIn linkedin: link: michaelkkehoe title: LinkedIn Set my Github github: link: michael-kehoe title: Github Set my Twitter twitter: link: michaelkkehoe title: Twitter Set the copyright: copyright: Michael Kehoe - 2020 Adding Slideshare and Speakerdeck This one is a little harder, so after the instagram section, I’m adding the following configuration slideshare: link: MichaelKehoe3 title: slideshare In themes/hugo-refresh/layouts/partials/footer.html, we’re going to add the following after the {{ if isset .Site.Params.footer.instagram "link" }} section {{ if isset .Site.Params.footer.slideshare "link" }} <li> <a href="{{ }}" target="_blank"> <span class="icon"><i class="fa fa-slideshare"></i></span> {{ if isset .Site.Params.footer.slideshare "title" }} {{ .Site.Params.footer.slideshare.title }} {{ else }} SlideShare {{ end }} </a> </li> {{ end }} 8. Organizing Content Ok time to make some content! 8.1 Adding a credits page Create a folder content/credits and inside the folder create When your server loads, you’ll see a credits link in the footer of the page. Edit `content/credits/ --- title: "Credits" date: 2020-04-24T00:00:00+00:00 draft: tfalse hideLastModified: true --- The images used in the site comes from To add your credits and then click on the credits link in the footer. 8.2 Adding an About page Now I want to add a page about myself that has my biography. Create a file content/ and add the following content: --- title: "About" date: 2020-04-24T00:00:00+00:00 draft: false hideLastModified: false keepImageRatio: true tags: [] summary: "About Michael Kehoe" showInMenu: true --- <Biography content> When the hugo server reloads, you’ll see in the navbar a link called About This concludes this blog post, we’ll talk more in the next blog post about styling and content organization.


security.txt is a draft IETF standard for websites (webmasters) to communicate security vulnerability/ research policy. The purpose of the standard is to communicate in a standardized manner. The abstract of the RFC states: When security vulnerabilities are discovered by independent security researchers, they often lack the channels to report them properly. As a result, security vulnerabilities may be left unreported. This document defines a format (“security.txt”) to help organizations describe the process for security researchers to follow in order to report security vulnerabilities. Specification The security.txt file contains seven fields: Acknowledgements: This directive allows you to link to a page where security researchers are recognized for their reports. The link MUST begin with “https://". Canonical: This directive indicates the canonical URI where the security.txt file is located. The link MUST begin with “https://". Contact: This directive allows you to provide an address that researchers SHOULD use for reporting security vulnerabilities. The value MAY be an email address, a phone number and/or a web page with contac information. This directive MUST be present in a security.txt file. The link MUST begin with “https://". Encryption: This directive allows you to point to an encryption key that security researchers SHOULD use for encrypted communication. The link MUST begin with “https://". Hiring: The “Hiring” directive is used for linking to the vendor’s security-related job positions. Policy: This directive allows you to link to where your security policy and/ or disclosure policy is located. The link MUST begin with “https://". Prefered-Languages: This directive can be used to indicate a set of natural languages that are preferred when submitting security reports. This set MAY list multiple values, separated by commas. Web-based services should place the security.txt file under the /.well-known/ path; e.g. Example(s) You can find the current RFC draft here

2019 Devops Conferences

Decided to make a list again of 2019 Conferences. Feel free to ping me on twitter (@matrixtek and I can add anything I missed. January DevOpsDays New York City (24th-25th) February DevOpsDays Charlotte (7th-8th) DevOpsDays Geneva (21st-22nd) March QCon London (4th-8th) ScaleConf Cape Town (6th-8th) DevOpsDays Los Angeles (8th) IETF 104 (23rd-29th) Usenix SRECon Americas (25th-27th) DevOpsDays Vancouver (29th-30th) April QCon Beijing (25th-27th) DevOpsDays Tokyo (9th-10th) DevOpsDays Sao Paulo (10th-11th) DevOpsDays Seattle (23rd-24th) May DevOpsDays Austin (2nd-3rd) QCon Sao Paulo (6th-8th) DevOpsDays Zurich (14th-15th) DevOpsDays Salt Lake City (14th-15th) DevOpsDays Kyiv (17th-18th) Interop ITX (20th-23rd) KubeCon/ CloudNativeCon Barcelona (20th-23rd) June Monitorama Portland (3rd-5th) O’Reilly Velocity San Jose (10th-13th) Usenix SRECon Asia/ Australia (12th-14th) KubeCon/ CloudNativeCon China (24th-26th) QCon New York (24th-28th) DevOpsDays Amsterdam (26th-28th) July O’Reilly OSCon (15th-18th) IETF 105 (20th-26th) August DevOpsDays Minneapolis (6th-7th) September DevOpsDays Cairo (9th) October Usenix SRECon EMEA (2nd-4th) QCon Shanghai (17th-19th) Usenix LISA (28th-30th) November O’Reilly Velocity Berlin (4th-7th) QCon San Francisco (11th-15th) IETF 106 (16th-22nd) KubeCon/ CloudNativeCon San Diego (18th-21st)

On Bastion Hosts

I was at a meetup the other night and a student mentioned that they were learning about bastion hosts and wanted to learn more. So I thought I would do a deep dive on what they are and why to use them. What Bastion hosts are instances that sit within your public subnet and are typically accessed using SSH or RDP. Once remote connectivity has been established with the bastion host, it then acts as a ‘jump’ server, allowing you to use SSH or RDP to log in to other instances. Why Bastion hosts act as a gateway or ‘jump’ host into a secure network. The servers in the secure network will ONLY accept SSH connections from bastion hosts. This helps limit the number of points where you can SSH into servers from and limit it to a trusted set of hosts. This also significantly helps auditing of SSH connections in the secure network. Bastion hosts typically have more strigent security postures. This includes more regular patching, more detailed logging and auditing. How Bastion setups are rather quiet simple, here are a few simple steps to set one up: Provision a new server(s) that are ONLY dedicated for the purpose of bastion access Install any additional security measures (see the cyberciti reference below for specific recommendations Ensure that all servers in the secure network ONLY accept SSH connections from the bastion server(s) Configure your SSH client to talk to hosts in your private network. Replace the IdentityFile and domain-names to suit your network: $ cat ~/.ssh/config Host * IdentityFile %d/.ssh/keyname.extension ProxyCommand ssh -W %h:%p Host IdentityFile %d/.ssh/keyname.extension Host * PubkeyAuthentication yes References

Publication Updates (Jul 22 2018)

In the past month, I have had the pleasure to be able to record a few podcasts and have some other work published. You can find it all here: The Importance of Soft Skills in Engineering PyBay: Meet Michael Kehoe: Building Production Ready Python Applications PyBay Medium Fullstack Journey (PacketPushers): Michael Kehoe PacketPushers NetworkCollective: Michael Kehoe Network Collective

Future of Reliability Engineering (Part 2)

In early May, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Blog Post Series: Evolution of the Network Engineer Failure is the new Normal (move towards Chaos-Engineering) Automation as a Service Cloud is King Observe & Measure Failure is the new Normal (move towards Chaos-Engineering) a) Breaking down Silo’s Let’s be real, software will always have bugs, infrastructure will also eventually fail. This isn’t solely a engineering or operations problem, this is everyones problem Chaos-Engineering as a practice is actually a good example of breaking down silo’s. Everyone reading this is probably well versed in this meme about “it’s ops problems now” (link). Chaos Engineering forces that pattern to be broken by requiring developers to be invovled in the process. As a result of your chaos-engineering testing, engineering teams should be fixing weak points in your applications/ infrastructure. b) Failure Management Chaos-Engineering is a great way to test various processes around “failure”. This goes from engineering training to monitoring and automation, all the way to incident response. A continuing chaos-engineering practice should be a continual loop of learning and improvement beyond just the code or infrastructure; it’s also about improving processes. c) Testing Chaos-Engineering is obviously a form of testing in itself, but more generally speaking, it should be used as a way to improve your general testing posture. The feedback loop in a chaos-engineering practice should invovlve engineers writing more resilient code that makes it harder for systems to fail. Ofcourse these fixes also need to be continually tested. d) Automation As SRE’s, we aim reduce manual toil as much as possible and have tools/ systems to do the work for us; in a reliable, predictable manner. This same principal applies for chaos-engineering particularly because your aim is to break the system. Since you are trying to ‘break’ the system, you need to do this in a reliable, repeatable manner. If you are going to perform chaos-engineering at all, you should at the very least have a source-controlled script that everyone can find and understand. e) Measure Everything Within a Chaos Engineering practice, the aim is to alter the normal state of service, observe & measure everything that happens in a controlled environment. Before you perform any chaos-engineering test, you should have a solid understanding of what parts of the system you need to observe, closely observe what changes during the chaos test and write up your observations. Over time, you should be able to tell a story of higher availability/ reliability of the service as the chaos-engineering tests and results improve the software. I would strongly recommend a template similar to this to record your findings. In the next post, we’ll look at ‘Automation as a Service’.

LLDP on Linux

Link Layer Discovery Protocol (LLDP) is an independant IEEE protocol (IEEE 802.1AB) that helps with gathering/ advertising a device’s identity, capabilities, and neighbors. LLDP is Layer 2 protocol. LLDP is usually used on network devices (switches/ routers) to find ‘neighbor’ (connected) devices, but is equally useful on servers to find details of the switch it’s connected to. This is not enabled by default on Linux, but here’s a quick guide to get it working. Install the package $ sudo yum install lldp Start the daemon $ sudo service lldpd start Find your neighbor device $ sudo lldpcli show neighbors

A Chaos Engineering Gameday Report Template

Following on from my Postmortem Template Document, I thought it would be prudent to right a similar template for Chaos Engineering Gamedays. You can find the empty template here, I’ll create a filled-out example template in the coming weeks. Please feel free to tweet at me with any feedback

My SRE path on Full Stack Journey Podcast

I recently caught up with Scott Lowe of the Full Stack Journey Podcast to talk about my SRE career. You can find the podcast release here

Future of Reliability Engineering (Part 1)

Last month at Interop ITX, I gave a presentation on the ‘Future of Reliability Engineering’. I wanted to break down the five new trends that I see emerging in a blog-post series: Evolution of the Network Engineer (towards Network Reliability Engineers) a) Breaking down Silo’s The network is no longer a silo. Applications run over the network in a distributed fashion requiring low-latency and large data-pipes. With these requirements, network engineers must understand these requirements, understand how applications are generally deployed for troubleshooting purposes and ensure that they have models to plan for capacity management. b) Failure management “It’s not a matter of avodiing failure, it’s preparing for what to do when things fail.” Failures are going to happen, the problem is managing them. Theoretically protocols like OSPF, BGP, HSRP give us redundancy, but how does that play in practice. Have you tested it? Equally, a large number of failures in the network come from L1 grey failures. How to you plan to detect and mitigate against these faults? c) Testing Testing has been an area that network engineering has lacked in forever. In the past 10 or so years, tools like GNS3 have moved the needle. Similarly, as Linux Network Operating Systems become more common, the ability to stage configuration and run regression tests. In the future, we will get to the point where you can run tests of new configuration before it is deployed to production. d) Network programmability & Automation Over the past few years, network device management has finally evolved (slightly) away from SSH/ Telnet/ SNMP into programmable models where we can get away from copy-pasting configuration everywhere and polling devices to death (literally). This is not dissimilar from the evolution of server management, where tools like Puppet/ Chef/ CFEngine/ Salt came along and made device management extremely simple and orchestratable. In a number of ways (depending on the organization), this displaced the traditional system administration role. Coming back to the network again, as the network becomes programmable, the traditional network engineer role will need to evolve to know how to program and orchestrate these systems en-masse. e) Measure everything! Long live the days of only having SNMP available to measure network performance. The rise of streaming telemetry, and the use of network agents to validate network performance is a must. Further to that, triaging network issues is generally extremely difficult, so you should have some form of monitoring on most layers of the network L1, L2, L3, L4 and L7. There should be ways to measure availability of your network The next post in this series will be on Chaos Engineering.

A Postmortem Template

I’ve been thinking about this for awhile and really wanted to publish my own Postmortem Template. You can find the empty template here, I’ll create a filled-out example template in the coming weeks. Please feel free to tweet at me with any feedback

Publication Updates (June 05 2018)

Hi all, I’ve recently updated my publications page with my latest presentations from: Interop ITX 2018: The future of Reliability Engineering Velocity New York 2018: How to Monitor Containers Correctly SF Reliability Engineering - May Talks Devops Exchange SF April 2018: How to Build Production-Ready Microservices Information Week: 3 Myths about the Site Reliability Engineer, Debunked You can also find me later in the year at: PyBay 2018: Building Production-Ready Python Microservices Velocity New York 2018: How to Monitor Containers Correctly

35 Questions to ask in your job interview

I was digging through some old material I created for interviewing and thought I would share. So here it is…best questions to ask your interviewers. What are the day-to-day responsibilities of this job? What does your average day look like? What are the most important qualities for someone to be successful in this role? What are your expectations for the first months, 2 months, 6 months? Where do you think the company is headed in the next 5 years? Who is your top competitor and why? What are the biggest opportunities facing the company? What do you like best about working for this company? What is the typical career path for someone in this role? How to you measure the success of the person in this position? What are some of the challenges you expect the person in the role to face? What is the key to succeeding in the role? What is the onboarding process for new hires? Do you expect the responsibities for this role to change in the near future? What are the most immediate projects that need to be addressed? What training programs are available to your employees? What is the performance review process here? How long have you been with the company? How has your role changed since you’ve been here? How will the person hired into this role help the department? What is the most fun part about the job? What are the company’s senior leaders like? What do they care about and talk about the most? What type of employee tends to succeed here? Where do you see the company in three years, and how would the person in this role contribute to this vision? What excites you about coming to work? Can you explain the culture to me, with examples of how the company upholds it? What are the company’s core business goals? I like to collaborate with team members and brainstorm ideas to help reach communal goals. Can you give me examples of collaboration within the company? Why did you choose to work here? What is the preferred style of working to solve problems in the company? As an Employee, How Could I Exceed Your Expectations? What excites you most about your job, and what do you like most about this company? If I got the position, would I be work on any projects in addition to my day-to-day duties? What are the core values of the company? What have you learned working here? References:

SRECon Americas 2018 Day 2 Notes

Day 2 of SRECon started with an introduction from Kurt and Betsy (the Program Chairs) and then three set of plenary’s talks. The following is a set of notes I put together from the talks I went to today. If you Don’t Know WHere You’re Going, It Doesn’t Matter How Fast You Get There Nicole Forsgren & Jez Humble Slides found here. The talk generally was about strategic planning and measuring your performance/ success. This part of IT is actually very under-valued. These Tweets sums up the presentation well: - Where am I going? - Why do we care? - Improving performance/quality - Measuring performance - Culture & how to measure it#SREcon — Murali Suriar (@msuriar) March 28, 2018 Key point - it's the direction, not the destination that matters. #SREcon — Murali Suriar (@msuriar) March 28, 2018 Some nice points about the mis-use of Velocity and Utilization as Key Performance Indicators (KPI). Security and SRE: Natural Force Multipliers Cory Scott Slides here Heirarchy of Needs in SRE: Monitoring & Inciddent Response Postmortem & Analysis Testing & Release Procedures Capacity Planning Product Problem Statement: High Rate of Change Trust but verify Embrace the Error Budget Inject Engineering Discipline Testing in Production is the new normal Dark Canaries Security Challenges are similar to SRE Latency & perf impact Cascading failure scenarios Service Discovery Security Challenges Authentation Authorization Access Control Logic Data center technologies can be all controlled with a single web page application. Start with a known-good state Asset management Ensure visibility Validate consistently and constantly Takeaways or Giveaways Your data pipeline is your security lifeblood Human-in-the-loop is your last resort, not your first option All security solutions must be scalable Remove single points of security failure like you do for availability Assume that an attacker can be anywhere in your system or flow Capture and measure meaningful security telemetry What it Really Means to Be an Effective Engineer Edmond Lau See Effort <> Impact Impact = Hours spent x (impact produced/hours spent) Leverage = impact produced/ hours spent What are the high-leverage activities for engineers? Do the simple thing first Effective engineers invest in iteration speed Effictive engineers validate their ideas early and often What are the best practices for building good infrastructure for relationships Effective engineers explicitly design their alliances Effective engineers explicitly share their assumptions Work hard and get things done and focus on high-leverage activities and build infrastructure for their relationships The Day the DNS died Jeremy Blosser Slides here Impact Sending mail Application traffic Metrics Diagnosing blind (without metrics) is difficult! resolv.conf is fiddly Can only use first 3 entries Diagnosis Assymetric DNS packet flow (94% packet loss) The Cause [Undocumented] Connection tracking Response Incident response was functional Ability to respond was compromised New DNS design required New Design Dedicated VPC for isolation Open security groups with ACLs Seperate clusters for app/db vs MTA Use DNSMasq for local caching Lessons learnt Not all cloud limits are apparent Instrument your support services and protect It’a always a DNS problem…excpet when it’s a firewall problem resolf.conf is not agile Stable and Accurate Health-Checking of Horizontally-Scaled Services Lorenzo Salno See new Fastly paper on load-balancing: Balancing of the edge: transport affinity without network state - NSDI paper PoP Deployments Space and power are at a premium Architecture: Building a smarter load balancer Methods: machine learning - classifier signal processing - filter control theoery - controller Design Multiple stages system Denoising - remove noise from input signal Anomaly detection - identify misbehaving instance Hysteresis filter - stabilize output Implementation Host signals go into a filter which makes a decision about global state of host Don’t Ever Change! Are Imutable Deployments Really Simplier Faster, and Safer? Rob Hirschfelf Immutable Patterns Baseline + Config Live Boot + Config Image Deploy Image creation Do the configuration and capture the immage into a portable format This sounds like a lot of work and really slow Yes, but it’s faster, safer and more scalable Lessons Learned from Our Main Database Migrations at Facebook Yoshinori Matsunobu User Database Stores Social Graph Massively Sharded Low latency Automated Operations Pure Flash Storage What is MyRocks MySQL on top of RocksDB Open Source, distributed from MariaDB and Percona MyRocks Features Clustered Index Bloom filter and Column family Transactions, including consistency betweenbinlog and RocksDB Faster data loading, deletes and replication Dynamic Options TTL Online logical and binary backup MyRocks pros vs InnoDB Much smaller space (half compared to compressed InnoDB) Writes are faster Much smaller bytes written MyRocks cons vs InnoDB Lack several features No FK, Fulltext index, spactial index Must use Row based binary logging format Reads are slower than InnoDB Too many tuning options MyRocks migration - technical challenges Migration Creating MyRocks instances without downtime Creating second MyRocks instance without downtime Shadow traffic tests Promoting new master InnoDB vs MyRocks, from SRE PoV Server is busier because of double density Rocksdb is much newer database that changes rapidly MyRocks/ RocksDB relies on BufferIO For large transactions, COMMIT needs more work than InnoDB There are too amny tuning options Faster writes means replication slaves lag less often Issue: Mitigating stalls We upgraded kernel to 4.6 Changed data loading queries (schema changes) to use MyRocks bulk loading feature COmmit stalls every few mins –> now nearly zero Issue: A few deleted rows re-appeared Some of our secondary indexes had extra rows Turned out to be a bug in RocksDB compactions that in rare cases on heavy deletions, tombstones might now have been handled correctly Issue: Scanning outbound delete-markers Counting from one of the empty tables started taking a few minutes Lesons Learned Learn how core components work RocksDB depends on linux more than innodb Understanding how Linux works helps to fix issues faster Do not ignore outliers Many of our issues happened on only handful instances Leveraging Multiple Regions to Improve Site Reliability: Lessons learnt from Andrew Duch Lesson 1: Sorry I missed this Lesson 2: Not everything has to be active-active Lesson 3: Three is cheaper than two - you waste 50% of you capacity in a active-active model Lesson 4: Practie, Practice, Practice Lesson 5: Failover Automation needs to scale Unfortunately I had to skip the final set of sessions this afternoon due to a conflict. From all acounts, the sessions this afternoon were great. See everyone tomorrow for day 3!

SRECon US 2018 Day 3: What I'm seeing

The talk’s I’m wathing today are: Containerization War Stories Resolving Outages Faster with Better Debugging Strategies Monitoring DNS with Open-Source Solutions “Capacity Prediction” instead of “Capacity Planning”: How Uber Uses ML to Accurately Forecast Resource Utilization DIstributed Tracing, Lessons Learned Whispers in Chaos: Searching for Weak Signals in Incidents Architecting a Technical Post Mortem Your System has recovered from an Incident, but have your Developers The Day 3 Plenary sessions are: The History of Fire Escapes Leaping form mainframes to AWS: Technology Time Travel in the Government Operational Excellence in Aprils Fools’ Pranks Come and say Hi if you see me!

SRECon Americas 2018 Day 1 Review

Hi all, This year marks my 3rd year at SRECon Americas. This year brings a 3-day format with the first day being exclusively dedicated to workshops. Hooray! The workshops included: Containers from Scratch SRE Classroom, or How to Build a Distributed System in 3 Hours Profiling JVM Applications in Production Incident Command for IT - What We’ve Learned from the Fire Department Kubernetes 101 Chaos Engineering Bootcamp Ansible for SRE Teams Tech Writing 101 for SREs For the first session, I attended the Containers from Scratch session. As someone who understands the practical implementation of containers, I really appreciated seeing all the details behind it. You can find the following resources from the presentation: Tutorial material Linux Primitives I unfortunately didn’t get a chance to see any of Brent Chapman’s session today on Incident Management, but after going to his BayLISA presentation two weeks back, I know it would have been great. You can find his presentation materials here Bridget Kromhout did a detailed Kubernetes 101 session. From all accounts, it was awesome. You can find relevant materials here: SRECon Slides GitHub You can find James Meickle’s presentation on ‘Ansible for SRE’ here Update (March 28th, 8am): Tammy Butow posted her materials from her Chaos Engineering Bootcamp workshop: Github Speaker Deck Update (March 29th, 12pm): Dan Luedtke did his own version of the Containers Workshop in Go. See the post here Finally, I spent a little bit of time in the LinkedIn Engineering booth, thanks for everyone who stopped by and say Hi! to us.

San Francisco Chaos Engineering Meetup Slides

Tonight I have the priviledge of speaking alongside Russ Miles and Kolton Andrus at the San Francisco Chaos Engineering Meetup. You can find my slides from the event here

Michael's Morning Wrap - 11 March 2018

News Google talk about their Canary Analysis Service (CAS) here Baron Schwartz on what is obervability here Microservices 101 here Github Publish their incident report from their recent 1.7Tbps DDOS attack here Events [March 15th - BayLisa - Learning from the Fire Department: Experiences with Incident Command for IT](Learning from the Fire Department: Experiences with Incident Command for IT) March 21st - Performance Engineering Meetup @ LinkedIn March 22nd - Docker Birthday #5 March 29th - Papers We Love (San Francisco) April 4th - San Francisco meetups Meetup April 30 - May 4th Interop ITX Paper of the Week On Designing and Deploying Internet-Scale Services here RFC of the Week RFC-7540: Hypertext Transfer Protocol Version 2 (HTTP/2)

Michael's Morning Wrap - 5 March 2018

Hi everyone, welcome to this week’s wrap. I am now including a list of Meetups/ Events that may be of interest to my audience. These are all Bay Area based News The Evolution of Distributed Systems Management here A brilliant guide from Redhat on container lingo here Github survives the largest DDOS ever here How to manage feature flag tech debt here Events March 6th - Big Data Meetup @ LinkedIn - Tuning Spark & Hadoop Jobs with Dr Elephant March 7th - San Francisco Metrics Meetup [March 15th - BayLisa - Learning from the Fire Department: Experiences with Incident Command for IT](Learning from the Fire Department: Experiences with Incident Command for IT) March 21st - Performance Engineering Meetup @ LinkedIn March 22nd - Docker Birthday #5 March 29th - Papers We Love (San Francisco) April 4th - San Francisco meetups Meetup Paper of the Week B4: Experience with a Globally-Deployed Software Defined WAN here RFC of the Week RFC-8203: BGP Administrative Shutdown communication

Michael's Morning Wrap - 26 February 2018

Welcome to this week’s wrap ThousandEyes wrote a nice post on their new ‘Network Intelligence’ product and show it in action during the Dow Jones drop here A great view at how Software engineering is more than just writing code here A good introduction to Prometheus here A great piece on why engineering managers should do oncall here Mike Julian from Monitoring Weekly has started a new project called ‘Your Next Hire’. A site based around hiring engineers here Have a great week!

Michael's Morning Wrap - 19 February 2018

Welcome to this weeks wrap. Facebook wrote an excellent post here about planning, scaling and load-testing their live-video services for New Years (here) A good article that cuts through the hyperbole of observability and gives a solid analysis of the space (here) Real-time streaming ETL with Oracle Transactional data (here) Building your own CDN for fun and profit (here) Nginx now supports HTTP2 server push (here) Great writeup of great Devopsy resources (here)

Michael's Morning Wrap - 5 June 2017

Welcome to this weeks wrap. Unfortunately a bit late due to a busy week Fonseca: “An empirical study on the correctness of formally verified distributed systems” Erin Atwater: Netsim is a simulator game intended to teach you the basics of how computer networks function, with an emphasis on security. You will learn how to perform attacks that real hackers use, and see how they work in our simulator! Brandon Rhodes: Tutorial on Sphinx from Pycon Adrian Coyler: The Morning Paper on Operability Ethan Banks: Slides from Interop ITX: The Future of Networking Sachin Malhotra: How we fine-tuned HAProxy to achieve 2M concurrent SSL connections Argo from Cloudflare

Michael's Tuesday Morning Wrap - 30 May 2017

Some of my colleagues have mentioned to me that I share some really good articles on LinkedIn, so I thought I would try doing a weekly post with a wrap of the best things I read. I’m going to start on a Tuesday due to the Memorial Day public holiday. HighScalability: “The Always On Architecture - Moving Beyond Legacy Disaster Recovery” Bilgin Ibryam: “It Takes More Than a Circuit Braker to Create a Resilient Application” Ben Treynor, Mike Dahlin, Vivek Rau, Betsy Beyer: “The Calculus of Service Availability” Manas Gupta: “Monitorama 2017: My Impressions” Yuval Bachar: “Taking Open19 from Concept to Industry Standard” Nick Babich: “4 Ways Use Functional Animation in UI Design” Geoff Huston: “BBR TCP” Lisa N Roach: “Exploring Network Programmability with Python & Yang” Bruno Connelly & Bhaskaran Devaraj: “Building the SRE Culture at LinkedIn” See you next week!

Publication Updates (May 27 2017)

Hi all, I just updated my publications page with links to my SRECon17 Americas talks, my new LinkedIn engineering blog post. It was announced this week I will also have the privilege of speaking at SRECon17 EMEA in Dublin later this year. You can find me talking about: Networks for SRE’s: What do I need to know for troubleshooting applications Reducing MTTR and false escalations: Event Correlation at LinkedIn

Monitorama 2017 Summary

The past few days, I’ve been in Portland for the 2017 Monitorama conference. The conference had to literally fail-over between venues Monday night due to a large power-outage across the city. Monitorama brought together a a diverse crowd of engineers and vendors to spend 3 days discussing on call, logging, metrics, tracing and the philosophy of it all. You can find the schedule here And the video’s for each day: Day 1 Day 2 Day 3 ** ** Content Summary For some reason, there was a large amount of content dedicated to distributed tracing. It was actually a theme that dominated the conference. The amount of of open-source content that was inspired by the original Google Dapper (2010) paper seems to be coming mainstream. There was another dominant theme of fixing oncall. This was partially set by Alice Goldfuss’s talk on Day 1 and continued throughout the conference. To be honest, I had no idea how bad some people’s on-call shifts are. I’ve certainly done very well during my time at LinkedIn. It does seem that we need to get smarter about what we alert on. There was also a number of talks that boiled down to: “This is how my company monitors”. It was definitely interesting to see the use of open-source stack’s at larger companies and a general tendancy to dislike of paying large sums of money to vendors. Given my position (and privilege), I’ve been able to learn most of the content during my time at LinkedIn. There were however some talks that I walked away from thinking about how I/ LinkedIn can do a better job. Below are some of my favorite talks (in order of presentation). Day 1: The Tidyverse and the Future of the Monitoring Toolchain - John Rauser John gave a great overview of the Tidyverse toolset and the power of the R language. The visualizations he used in his presentation definitely inspired my team on how we can present some of our incident data in a more meaningful way. Day 1: Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Alice gave a very real presentation on the state of on call and how we shouldn’t tolerate it. Cleverly using #oncallselfie’s on Twitter, she created a narrative on how disruptive oncall can be to our lives and how we shouldn’t tolerate it (for the most part). For anyone who is in a team that gets paged more than 10 times a week, I’d recommend watching. Day 1: Linux debugging tools you’ll love - Julia Evans Julia ran through a number of great Linux debugging techniques and tools that can be used to find problems in your applications. Definitely a lot of tricks for everyone to pick up. Don’t forget to check out her Zines as well at Day 2: Real-time packet analysis at scale - Douglas Creager Douglas (from Google) ran through some interesting techniques for troubleshooting a hypothetical music streaming issues via doing packet analysis. Google created a tool called ‘TraceGraph’ which plots the number of packets (by-type)/ window vs time, to show interruptions in data-flow. Unfortunately he didn’t deep-dive into much ‘at-scale’ detail. Day 3: UX Design and Education for Effective Monitoring tools - Amy Nguyen Amy deep-dived on how you build a body of work that creates an engaging monitoring tool. She did a great job of highlighting anti-patterns in monitoring tools. She went on to give tips on how you build effective UI’s for monitoring systems. Final words Firstly, Kudos to the Monitorama team for running the conference so smoothly given what they had to deal with. Unfortunately, the conference had some competing threads on how you should create a monitoring philosophy which probably didn’t help the smaller companies in attendence. The idea that monitoring is broken is a half-truth at best. We have the best tools we ever have, we just haven’t been able to put a coherent strategy together (this is something I’ll try to blog about next week). My key take-aways are: Provide metrics/ logging/ tracing functionality in frameworks so they are free for developers We need a better way to ingest monitoring data in a sensible/ low-cost manner Need to make it easy to take all of this data and make it explorable/ use-able to everyone. Also, make it consistent as possible!!!! Alert sensibly, don’t get paged for something that can wait 12 hours. You should care about how oncall affects your work and your life outside of work

Monitorama Review Day 3

Hi again, This is today’s notes for Monitorama Day 3. Link to the video is here Today’s Schedule Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton UX Design and Education for Effective Monitoring Tools - Amy Nguyen Automating Dashboard Displays with ASAP - Kexin Rong Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Consistency in Monitoring with Microservices at Lyft - Yann Ramin Critical to Calm: Debugging Distributed Systems - Ian Bennett Managing Logs with a Serverless Cloud - Paul Fisher Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Kubernetes-defined monitoring - Gianluca Borello Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton Challenge Git clone failures in Mac environment…was a DNS issue Third party service outages - pipit, rubygems, launchpad PPA’s Stuff changed somewhere…leftpad Can’t always look at logs due to privacy concerns Lots of security/ privacy challenge So where are we: Adding metrics on jobs as trends UX Design and Education for Effective Monitoring Tools - Amy Nguyen Recent projects Tracing D3 Cache for openTSDB Documentation Why should we care about user experience Prevent misunderstandings - not everyone should be an expert at interpreting monitoring data Developer velocity - help people reach conclusions faster Data democracy - you don’t know what questions people want to answer with their own data UX and your situation (pyramid) Team Documentation Tools (this talk) UX and your situation Sharing what you know Education vs intuition Best practices - Use your expertise to determine the most helpful default behavior Potential pitfalls Performance: Low hanging fruit Backend Roll-up data over long time ranges Store latest data in memory (e..g. FB Gorilla paper and Beringei project Add a cache layer Frontend Don’t reload existing data if user changes time window Prevent the user from requesting the data incessantly Lazy-load graphs Designing what your users want Performance exploration Simplicity Automating Dashboard Displays with ASAP - Kexin Rong Talk outline motivation observation our research going fast Problem: Noisy Dashboards How to smooth plots automatically: More informative dashboard visualization Big idea: Smooth your dashboards Why: 38% more accurate, 44% faster response What do my dashboards tell me today Is plotting raw data always the best idea Q: What’s distracting abotu raw data? A: In many cases, spikes dominate the plot Q: What smoothing function should we use A: Moving average works Contstraint: preserve deviations in plots metric: measure kurtosis of the plot Use: scipy ASAP - As smooth as possible - while preserving long-term deviations Use ASAP.js library Going fast: Q: Finding optimal window size: A: Use grid search Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Doesn’t matter what the monitoring system says, the experience is broken for the user “There are three types of lies: Lies, damned lies, and service status pages” You need to talk about monitoring early in the development cycle “The key to not being woken up by monitoring alerts is to fix the cause for alerts” - Somethong on the internet, probably Consistency in Monitoring with Microservices at Lyft - Yann Ramin Approches and techniques to avoid production incidents with hundreds of micro services and diverse teams What are we trying to solve: when developers are oncall with micro services OR scaling operational mindfulness I clicked production deploy and Jenkins went green - Opepertunity to grow operational mindfulness No-one setup a pager duty list before going to production We need alarms on things! Lets copy and paste them from my last service We routinely approach monitoring as operations We don’t have the SRE role - we hire people who understand operations We have system metrics (CollectD, custom scripts) What do we get (with consistency in monitoring) Consistent measurement Consistent visibility Point-to-Point debugging Unified tracing Salt module for orchestratration (orca) provisions resources interacts with pagerduty, ensures a PD service is created makes sure there’s an oncall schedule blocks deploys if these are missing Dashboards: git monorepo ties in with salt dashboards defined in salt every service gets a default dashboard on deploy! Add extra dashboards via Salt Benefits consistent look at feel always got alarms flexibility Critical to Calm: Debugging Distributed Systems - Ian Bennett 3bn metrics emitted per minute Twitter uses YourKit for profiling Peeling the onion - debugging methodology Metrics/ Alerting Tuning Tracing/ Logs Profiling Instrumentation/ Code change When to peel make a single change, rinse, repeat avoid the urge to make too many changes Performance tips keep your code abstracted fixes should be as isolated as possible critical issues, pager angry: don’t panic Gut can be wrong You will get tired You will get frustrated May take days to come to correct fix Some examples of troubleshooting Managing Logs with a Serverless Cloud - Paul Fisher Monitoring the monolith Logs seems like a good best practice Doesn’t scale well Literally burn money via vendors Move from logs to metrics - you shouldn’t need to log into boxes to debug Lyft contrains AWS Avoid vendor lock-in Lyft’s logging pipeline Heka on all hosts Kinesis firehouse kibana proxy to auth Elastalert pagerduty Detailed walkthrough of pipeline Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Why Use tracing for dependency tracing Root Cause analysis distibuted transaction monitoring Demo Adding context propogation (tracing) is hard in existing code-bases Solution - Add to frameworks (simple config changes) They must want to use your product - or sticks and carrots Each organization is different - find the best way to implement it measure adoption - trace quality scores ensure requests are being properly traced Kubernetes-defined monitoring - Gianluca Borello Monitoring Kubernetes Best practices ideas proposals 4 things that are harder now (with microservices and kubernetes) Getting the data You (dev) should not be invovled in monitoring instrumentation You (dev) should not be involved in producing anything that’s not a custom metric Collect everything making sense of the data troubleshooting people A lot of tools we have now are not container-aware

Monitorama Review Day 2

Hi all, Continuing yesterday’s notes. Last night there was actually a large power outage in downtown Portland which caused us to changes venues today. These notes are somewhat incomplete, I’ll try to fix them up in the coming days. Thanks again to Sarah Huffman for her notes. Video can be found here Anomalies != Alerts - Betsy Nicols Sarah Huffman notes Now, Pain, Relief, Bonus: Use Case Detection & action need to be separated Because they aren’t: Anomalies = Alerts Pain 1. Alert Fatigue 1. Alerts —> Alert Fatigue Alerts = Anomalies Anomalies —> Alert Fatigue How many anomalies can we reasonably expect? Step 1. Population distribution Step 2. Population Density Step 3. Compute Anomalies/ Day To alert or not to alert decision required for reach anomaly anomalies = TP (true positive) union TN likely #FP >> #TP 2. Seeking needles Difficult to find a strategy to work out when to alert and when not to Relief Basic monitoring pattern Data —> Engine —> Alert Basic semantic context Streaming Async Sync Semantic Model (with analytics) Attribute Discovery Build data model using extra attributes Action Policy Works off what data we have and makes a decision Conditions Scope Actions Takeaways Best monitoring = Math + context preferred strategy anomalies + context = alerts Distributed Tracing: How we got here and wehere we’re going - Ben Sigelman Sarah Huffman notes Why are we here Chapter 1: What is the purpose of monitoring must tell stories get to ‘why’ (ASAP) storytelling got a lot harder recently One story, N storytellers microservices may be here to stay but they broke our old tools transactions are not independent the simple thing basic concurrency async concurrency Measuring symptoms metrics model symtoms well measure what the end user actually experiences aside: get raw timing from opentracing instrumentation There is such thing as too many metrics Chapter 2: Where Introducing Donut Zone: DaaS Microservice-oriented Use Open tracing Demo of open Trace All sampling must be all/ nothing per transaction Not all latency issues are due to contention A new way to diagnose Infer or assign contention ID’s (mutex’s, db tables, network links) Tag Spans with each contention ID they encounter automated root-cause for contention More demo’s of open trace Science! Science Harder: How we reinvented ourselves to be data literate, experiment driven and ship faster - Laura Thomson & Rebecca Weiss Sarah Huffman notes Decision-making - without evidence How do you find information about how your releases change the use of the product Browser = App? Not quite. Need to test against various OS/ Languages/ Failure to abstract Build a system to answer questions Resulted in different data collection systems (blocklist ping vs telemetry) Working around privacy can be hard Unified telemetry: One infrastructure, many types of probes and pings Many type of events Unify mental model Transparency: Results must be reproducible as a URL Audit a number all the way down to the code Open science, open methodology Push more data you can Experimenting with experiments Update daily on any release channel real sampling flip a preference, install a fature, collect data multiple experiments in flight at once Data will shape the web. What kind of web do you want Real-time packet analysis at scale - Douglas Creager Sarah Huffman notes 2 Goals: Packet captures are useful You don’t make to make fundemental changes to your monitoring stack Example scenario Streaming a song via application You get drops Infer throughput problems Logs aren’t usually enough to help solve the problem In-app RUM can help Need to look at the network itself Flow can sometimes be helpful, but unlikely in this case Need to see inside the connections - Packet capture Don’t need the look at the actual payload Tool at Google - TraceGraph Time vs packets graphs packets (by type) and windows to show problems Ok so the graph visualizes the problem - How do we solve it Looks like we have buffer bloat - can’t fix that problem in ISPs TCPDump to the rescue Google streams packet-captures to a central server for processing Instrumenting The Rest Of the Company: Hunting for Useful Metrics - Eric Sigler Sarah Huffman’s notes We have problem $foo, we are going to do $bar What data did you use to understand $foo? And how will you know if $bar improved anything “Without data, you’re just another person with an opinion” Example: We have a chatbot to do everything that we don’t understand Takeaway: Look for ways to reverse engineer existing metrics Useful metrics are everywhere. You aren’t alone in digging for metrics. Existing tools can be repurposed Whispers in the Chaos: Monitoring Weak Signals - J Paul Reed Monitoring @ CERN - Pedro Andrade Sarah Huffman’s notes Introduction to CERN 40GB data/s x 4 to collect during testing Where is the data stored: Main site Extension site (with 3x100Gb/s link) - Budapest use standard commodity software/ hardware 14k servers, 200k cores 190PB data stored Pedro’s team provides monitoring for the data storage/ computation Use Open Source tools Collectd Kafka as transport Spark for processing HDFS long term storage Some data in ES/ InfluxDB Kibana/ Grafana for visualizing openstack VM’s - All monitoring services run on this Config done with Puppet I volunteer as Tribute: The future of Oncall: Bridget Kromohout Sarah Huffman’s notes How many people dread phone ringing Change is the only constant

Monitorama Review Day 1

Hi all, I wanted to write some super rough notes of the various Monitorama talks for those (especially my peers) who weren’t able to attend this year. I’d like to give a shout-out to Sarah Huffman who drew notes from the presentations today Note: You can watch the stream here Note: I’ve done my best to put the key take-aways into each presenters talk (with my own opinions mixed in where noted). If you feel like I’ve made an error in representing your talk, please let me know and I’ll edit it. Today’s Schedule: The Tidyverse and the Future of the Monitoring toolchains - John Rauser Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Monitoring in the Enterprise - Bryan Liles Yo Dawg: Monitoring Monitoring Sytems at Netflix - Roy Rapoport Our Many Monsters - Megan Actil Tracing Production Services at Stripe - Aditya Mukerjee Linux debugging tools you’ll love - Julia Evans Instrumenting SmartTV’s and Smartphones in the Netflix app for modeling the Internet - Guy Cirino Monitoring: A Post Mortem - Charity Majors The Vasa: Redux - Pete Cheslock The Tidyverse and the Future of the Monitoring toolchain - John Rauser Sarah Huffman Notes R-language Tidyverse - “set of shared principles” The ideas in the tidyverse are going to transsform everything having to do with data manipulation and visualization ggplot2 - compact and expressive (vs D3 lib) way to draw plots Dataframe - Tibble (nested data frame) flexible, uniform data container R language - Can pipe datasets and chain operations together DPLYR - will displace SQL like languages for data-analytics work. DSL for data manipulation How to get started - RStudio Goal: Inspire tool makers - programming as a way of thinking “Toolmakers should look to the tidyverse for inspiration” **Martyrs of Film: Learning to hate the #oncallselfie - Alice Goldfuss ** Sarah Huffman Notes Benfits of oncall Hones troubleshooting Forces you to identify the weak points in your systems Teaches you what is and isn’t production-ready Team bonding Learn to hate the on call selfie - people complained on Twitter I get paged alot (noted via #oncallselfie) We use oncall outages as war-stories - and be hero’s Action scenes stop the plot Red flags (from alice’s survey) Too few owning too much Symptoms of larger problems: bumping thresholds snooze pages delays Poor Systems visibility/ Team visibility Too many pages 17% of people said 100+ (worst case) 1.1% people got 25-50 (best case) How do we get there Cleanup - actionable alerts Something breaks Customers notice I am I the best person to fix it I need to fix it immediately (side note) Cluster alerts - Get 1 alert for 50 servers rather than 50 alerts for 50 servers Devs oncall - More obligated to fix issues Companies who actively look at oncall numbers Heroic Etsy Github Monitorings things at your day job (Monitoring int he enterprise ) - Bryan Liles Sarah Huffman Notes Steps 1. Pick a tool 2. Pick another tool 3. Complain How do they know what to monitor How do they know when changes happen New problem: what should you monitor New problem: what should you alert on New problem: who should you alert New problem: what tools should I use New problem: how do you monitor your monitoring tools Step back and answer: Jow do you know if your stack works How do you know if your stack works well for others SLI - Service level indicator - measurement of some aspect of your service SLO - service level objective - target value SLA - service level agreement - what level of service have you and your consumers agreed to White-box vs black box monitoring Black box: Garabage in —> service —> garbage out White box: service (memory/ cpu/ secret sauce) How do you know if you’re meeting SLA’s/ SLO’s? Logs Structured log (json logs) Aggregate (send them somewhere centrally) Tell a story Metrics One or more numbers give details about something (SLI) Metrics are combined to create time-series Tracing: Single activity in your stack touches multiple resources MK Note: Brian is talking on open-tracing at Velocity Health endpoints E.g. GET /healthz {“database”: “ok”, “foo”: “ok”, “queue_length” :”ok”, “updated at”: <datetime>} do you know what’s going on Logs Metrics Tracing Other things e.g. what happened at 3pm yesterday logs, metrics, tracing, other things paint a picture How do we ensure greater visibility: Central point of contact for alerts Research tooling practices for teams What types of monitoring tools do we need Philosophies: USE: utilization, saturation and errors RED: Rate, error (date), durations (distribution) - Brendan Gregg Four golden signals: (latency, traffic, errs and saturation) - Google Yo Dawg: Monitoring Monitoring systems at Netflix - Roy Rapoport Sarah Huffman Notes A hero’s journey - product development lifecycle This will scale for atleast a month Monitoring ain’t alerting Alerting - output’s decisions and opinions “everything counts in large amounts” “the graph on the wall tells the story….” 20-25k alerts a day at netflix Have another monitoring system to monitor your monitoring system (Hot/ Cold) watcher Question “Is one tv show/ movie responsible for more Netflix Outages” - Alice Goldfish Our Many Monsters - Megan Anctil Sarah Huffman Notes Why, metrics, logging, alerting Vendor vs Non-vendor Business need Cost!!!! vizOps at Slack - 1-5 FTE Deep-dive into Slack implementations of: Monitoring: Graphite/ Granfana Logging: ELK Alerting: Icigna Cost analysis for above platforms Lessons leant Usability - escalation info must be valuable Creation - must be easy Key takeway: $$$"Is it worth it” is the time worth it Tracing Production Services at Stripe - Aditya Mukerjee Sarah Huffman Notes Tracing is about more than HTTP requests Venuer - “If you need to look at logs, there’s a gap in your observability tools” Metrics - no context Logs - hard to aggregate Request traces - require planning What’s the differennce between metrics/ logs/ tracing (if you squint, it’s hard to tell them apart) What if we could have all three, all the time??? Standard sensor format - Easier to do all three Intelligent metric pipelines (before the monitoring applications) Linux debugging tools you’ll love - Julia Evans Sarah Huffman Notes Accompanying Zine Starting off: read code, add print statements, know language Wizard tools strace tcpdump etc gdb perf ebpf ftrace Ask your OS what your progreams are doing strace can make your applications run 50x slower MK Note: Julia walked though some examples where time/ strace/ tcpdump/ ngrep were all helpful Instrumenting SmartTV’s and smartphones in the netflix app for modeling the internet - Guy cirino Sarah Huffman Notes Making the internet fast is slow faster - better networking slower - broader reach/ congestion Don’t wait for it, measure it and deal Working app > feature rich app We need to know what the internet looks like, without averages Logging anti-patterns Averages - can’t see the distribution, outliers heavily distort Sampling missed data rare events RUM data Don’t guess what the network is doing - measure it! Monitoring: A Post Mortem - Charity Majors Sarah Huffman Notes The Vasa: Redux - Pete Cheslock Sarah Huffman Notes Sponsor talks (only calling out what I choose to) Netsil Application maps Gives you visibility of your topology Techniques APM Tracking (zipkin) proxies OS tracing (pcap/ ePBF) MK Note: Not sure how this works for encrypted data streams Datadog They are hiring Apparently is everyone else What do you look for Knowledge Tools Experience Suggestions Knowledge Write blog peices Meetups (Knowledge) Tools Open source Experience Internships Share your knowledge Share your tools share your experience

Publication Updates (March 11 2017)

Hi all, I just updated my publications page with my APRICOT presentation from earlier in the month. If you’re coming to SRECon Americas 2017 this coming week, come and check out my presentations: Traffic shift: Avoiding disasters at scale Reducing MTTR and false escalations: Event Correlation at LinkedIn

Command of the Day 2: time

Simple explanation Prints the time it took to execute a command whatis time(1) - time command execution —help None man man 1 time Usage ===== time executable - Time the execution of the program executable time -p executable - Time the execution of the program executable and print in POSIX.2 format

Reminders about using traceroute in multi-path networks

Introduction First off, I would like to give a plug for this presentation about how to use/ interpret traceroute. Traceroute was written before the times of MPLS, ECMP and Clos network designs. Since the rise of those technologies/ topologies, the way that engineers troubleshoot latency/ packet-loss hasn’t really evolved. If you look at companies like Google/ Facebook/ LinkedIn, they all run Clos topologies within their production datacenters. So what does mean to engineers troubleshooting production issues? Firstly, the number of paths from Host A to Host B is usually going to be > 1. This means MTR (and some default uses of traceroute are largely obselete. Example Systems engineer A complains about higher latency from his application (spread across the datacenter) to the Oracle database that his application uses. Systems engineer B also complains that his (different) application is seeing higher latency to a different Oracle database (which are located in the same segment of the network). The go-to troubleshooting step here is to ask the DBA to verify if they are seeing higher execution times on the database which correlates in the percieved rise in latency. In this case, the DBA says there is no issue. The systems engineers then go to the network engineers and ask for help diagnosing the issue, confident that it isn’t an application-level issue. The network engineers ask for a traceroute and an mtr print-out. This would be perfect in an older-style network topology, but is not good for Clos style networks. In this particular case, there is ~20 distinct paths between Host A (application) and *Host B *(database). In a best case scenario, traceroute might show you 2-3 paths (by default), mtr will only show you one (the first one it discovers). The chances of you finding the bad path is unlikely. What to do next? Linux traceroute (with a few arguments) can actually help you here. traceroute allows you to send multiple queries to help discover all the interfaces you’d be routed through and potentially see interfaces with packet-loss or higher latency. e.g. [root@hostA ~]$ traceroute -q 10 hostB This may not discover all paths, but it is a start. Ofcourse you can run it multiple times to build a bigger picture of the network topology. Tools Facebook wrote this great article on troubleshooting networks and released fbtracert which is designed to fix this exact problem. Paris Traceroute is another open-source tool aiming to address this space. I haven’t used it, so I cannot vouch for its usefulness.

Command of the Day 1: compgen

Simple explanation Show all available commands, aliases and functions whatis compgen [builtins] (1) - bash built-in commands, see bash(1) help compgen: usage: compgen [-abcdefgjksuv] [-o option] [-A action] [-G globpat] [-W wordlist] [-F function] [-C command] [-X filterpat] [-P prefix] [-S suffix] [word]​ man man 1 bash Usage How are we going to create a list of commands for ‘Command of the Day’? Compgen! compgen -a: List of user aliases -b: List of built-in shell commands -c: List of all commands you can run -e: List of shell variables -k: List of built-in shell keywords -A function: List of available bash functions Tip Create an alias for compgen to show all functions: alias compgen=‘compgen -abckA function’. This will print in list format including all aliases, built-ins, commands and functions available References

Command of the Day

There are lots of really interesting commands on Unix/ Linux systems that are either poorly documented or are plainly forgotten about. So in an effort to educate myself and others, I thought I would try and do a command of the day. I’ll aim to do 5 posts a week, sometimes I’ll grep a couple of commands in to one post if they’re related.

2017 Devops Conferences

Hat-tip to Sarah Drasner who came up with a list of 2017 Front-end conferences that inspired this list. If I have missed any, please tweet me at @matrixtek and I’ll review it being added to the list. Note: There is a larger list over here that lists a number of the smaller conferences January None listed February Devops Days Charlotte March Devops Days Baltimore Devops Days Vancouver Elasticon 2017 SRECon17 Americas Strata + Hadoop World 2017 April Devops Days Atlanta Devops Days Seattle May ApacheCon Devops Days Austin Devops Days Salt Lake City Devops Days Toronto Devops Days Zurich Monitorama PyCon17 US SRECon17 Asia/ Australia June Devops Days Amsterdam Velocity San Jose July Devops Days Minneapolis GopherCon August SRECon17 Europe/ Middle East/ Africa September Devops Days Detroit October LISA17 Velocity London Velocity New York November None listed December None listed No Date Listed SaltConf Couchbase Connect 17

Keeping a NTP server secure

Introduction Over the years, bad actors have used the Network Time Protocol (NTP) as a successful DDOS attack vector. Generally speaking, the cause of these attacks is due to NTP mis-configuration. This post will look at how to build and configure an NTP server and provide insight to help keep your NTP server safe. Assumptions 1. You are not exposing this server to the public internet 2. You are running several NTP servers within your network to keep it highly-available. Tip: Using Anycast is a good way to create a highly-available set of NTP servers for a larger-sized network 3. If you really need very accurate time, do not run NTP servers on Virtual Machines or Containers. 4. You are running a firewall that blocks packets that come from outside the network on UDP port 123 Installation Debian based systems: apt-get install ntp Linux based systems: yum install ntp Configuration This is an example configuration for a Linux Server with upstream time servers and NTP clients connecting to the server from the network. iptables rules: -A INPUT -s 0/0 -s 0/0 -p udp --source-port 123:123 -m state --state ESTABLISHED -j ACCEPT -A OUTPUT -s 0/0 -s 0/0 -p udp --destination-port 123:123 -m state --state NEW,ESTABLISHED -j ACCEPT Automation I strongly suggest you use some type of configuration management system to manage NTP configuration over a fleet of systems. I recommend: Puppet - puppetlabs-ntp References

Complete guide to iptables implementation

I’ve been wanting to put this article together for some time, a complete guide to implementing iptables on a Linux Server. Firstly, my assumptions: You have a reasonable grasp of Linux and Iptables You want to use Iptables to secure a Linux server The Basics Iptables has by default three chains for the FILTER table: INPUT OUTPUT FORWARD In this case, we’re going to focus on the INPUT chain (Incoming to firewall. For packets coming to the local server) Implementation Automation I implement these rules using the puppet-iptables module. The module is regularly updated and has a very large feature-set. References:

Antatomy of the cmdline: id Tool Refernces [Man Page] Key compoents of the tool Arguments -g/ –group -G/ –groups -n/–name -r/ –real -u/ –user –version C Source Code Github