2019 Devops Conferences

Decided to make a list again of 2019 Conferences. Feel free to ping me on twitter (@matrixtek and I can add anything I missed. January DevOpsDays New York City (24th-25th) February DevOpsDays Charlotte (7th-8th) DevOpsDays Geneva (21st-22nd) March QCon London (4th-8th) ScaleConf Cape Town (6th-8th) DevOpsDays Los Angeles (8th) IETF 104 (23rd-29th) Usenix SRECon Americas (25th-27th) DevOpsDays Vancouver (29th-30th) April QCon Beijing (25th-27th) DevOpsDays Tokyo (9th-10th) DevOpsDays Sao Paulo (10th-11th) DevOpsDays Seattle (23rd-24th) May DevOpsDays Austin (2nd-3rd) QCon Sao Paulo (6th-8th) DevOpsDays Zurich (14th-15th) DevOpsDays Salt Lake City (14th-15th) DevOpsDays Kyiv (17th-18th) Interop ITX (20th-23rd) KubeCon/ CloudNativeCon Barcelona (20th-23rd) June Monitorama Portland (3rd-5th) O’Reilly Velocity San Jose (10th-13th) Usenix SRECon Asia/ Australia (12th-14th) KubeCon/ CloudNativeCon China (24th-26th) QCon New York (24th-28th) DevOpsDays Amsterdam (26th-28th) July O’Reilly OSCon (15th-18th) IETF 105 (20th-26th) August DevOpsDays Minneapolis (6th-7th) September DevOpsDays Cairo (9th) October Usenix SRECon EMEA (2nd-4th) QCon Shanghai (17th-19th) Usenix LISA (28th-30th) November O’Reilly Velocity Berlin (4th-7th) QCon San Francisco (11th-15th) IETF 106 (16th-22nd) KubeCon/ CloudNativeCon San Diego (18th-21st)

Publication Updates (Jul 22 2018)

In the past month, I have had the pleasure to be able to record a few podcasts and have some other work published. You can find it all here: The Importance of Soft Skills in Engineering PyBay: Meet Michael Kehoe: Building Production Ready Python Applications PyBay Medium Fullstack Journey (PacketPushers): Michael Kehoe PacketPushers NetworkCollective: Michael Kehoe Network Collective

Publication Updates (June 05 2018)

Hi all, I’ve recently updated my publications page with my latest presentations from: Interop ITX 2018: The future of Reliability Engineering Velocity New York 2018: How to Monitor Containers Correctly SF Reliability Engineering - May Talks Devops Exchange SF April 2018: How to Build Production-Ready Microservices Information Week: 3 Myths about the Site Reliability Engineer, Debunked You can also find me later in the year at: PyBay 2018: Building Production-Ready Python Microservices Velocity New York 2018: How to Monitor Containers Correctly

SRECon Americas 2018 Day 2 Notes

Day 2 of SRECon started with an introduction from Kurt and Betsy (the Program Chairs) and then three set of plenary’s talks. The following is a set of notes I put together from the talks I went to today. If you Don’t Know WHere You’re Going, It Doesn’t Matter How Fast You Get There Nicole Forsgren & Jez Humble Slides found here. The talk generally was about strategic planning and measuring your performance/ success. This part of IT is actually very under-valued. These Tweets sums up the presentation well: - Where am I going? - Why do we care? - Improving performance/quality - Measuring performance - Culture & how to measure it#SREcon — Murali Suriar (@msuriar) March 28, 2018 Key point - it's the direction, not the destination that matters. #SREcon — Murali Suriar (@msuriar) March 28, 2018 Some nice points about the mis-use of Velocity and Utilization as Key Performance Indicators (KPI). Security and SRE: Natural Force Multipliers Cory Scott Slides here Heirarchy of Needs in SRE: Monitoring & Inciddent Response Postmortem & Analysis Testing & Release Procedures Capacity Planning Product Problem Statement: High Rate of Change Trust but verify Embrace the Error Budget Inject Engineering Discipline Testing in Production is the new normal Dark Canaries Security Challenges are similar to SRE Latency & perf impact Cascading failure scenarios Service Discovery Security Challenges Authentation Authorization Access Control Logic Data center technologies can be all controlled with a single web page application. Start with a known-good state Asset management Ensure visibility Validate consistently and constantly Takeaways or Giveaways Your data pipeline is your security lifeblood Human-in-the-loop is your last resort, not your first option All security solutions must be scalable Remove single points of security failure like you do for availability Assume that an attacker can be anywhere in your system or flow Capture and measure meaningful security telemetry What it Really Means to Be an Effective Engineer Edmond Lau See Effort <> Impact Impact = Hours spent x (impact produced/hours spent) Leverage = impact produced/ hours spent What are the high-leverage activities for engineers? Do the simple thing first Effective engineers invest in iteration speed Effictive engineers validate their ideas early and often What are the best practices for building good infrastructure for relationships Effective engineers explicitly design their alliances Effective engineers explicitly share their assumptions Work hard and get things done and focus on high-leverage activities and build infrastructure for their relationships The Day the DNS died Jeremy Blosser Slides here Impact Sending mail Application traffic Metrics Diagnosing blind (without metrics) is difficult! resolv.conf is fiddly Can only use first 3 entries Diagnosis Assymetric DNS packet flow (94% packet loss) The Cause [Undocumented] Connection tracking Response Incident response was functional Ability to respond was compromised New DNS design required New Design Dedicated VPC for isolation Open security groups with ACLs Seperate clusters for app/db vs MTA Use DNSMasq for local caching Lessons learnt Not all cloud limits are apparent Instrument your support services and protect It’a always a DNS problem…excpet when it’s a firewall problem resolf.conf is not agile Stable and Accurate Health-Checking of Horizontally-Scaled Services Lorenzo Salno See new Fastly paper on load-balancing: Balancing of the edge: transport affinity without network state - NSDI paper PoP Deployments Space and power are at a premium Architecture: Building a smarter load balancer Methods: machine learning - classifier signal processing - filter control theoery - controller Design Multiple stages system Denoising - remove noise from input signal Anomaly detection - identify misbehaving instance Hysteresis filter - stabilize output Implementation Host signals go into a filter which makes a decision about global state of host Don’t Ever Change! Are Imutable Deployments Really Simplier Faster, and Safer? Rob Hirschfelf Immutable Patterns Baseline + Config Live Boot + Config Image Deploy Image creation Do the configuration and capture the immage into a portable format This sounds like a lot of work and really slow Yes, but it’s faster, safer and more scalable Lessons Learned from Our Main Database Migrations at Facebook Yoshinori Matsunobu User Database Stores Social Graph Massively Sharded Low latency Automated Operations Pure Flash Storage What is MyRocks MySQL on top of RocksDB Open Source, distributed from MariaDB and Percona MyRocks Features Clustered Index Bloom filter and Column family Transactions, including consistency betweenbinlog and RocksDB Faster data loading, deletes and replication Dynamic Options TTL Online logical and binary backup MyRocks pros vs InnoDB Much smaller space (half compared to compressed InnoDB) Writes are faster Much smaller bytes written MyRocks cons vs InnoDB Lack several features No FK, Fulltext index, spactial index Must use Row based binary logging format Reads are slower than InnoDB Too many tuning options MyRocks migration - technical challenges Migration Creating MyRocks instances without downtime Creating second MyRocks instance without downtime Shadow traffic tests Promoting new master InnoDB vs MyRocks, from SRE PoV Server is busier because of double density Rocksdb is much newer database that changes rapidly MyRocks/ RocksDB relies on BufferIO For large transactions, COMMIT needs more work than InnoDB There are too amny tuning options Faster writes means replication slaves lag less often Issue: Mitigating stalls We upgraded kernel to 4.6 Changed data loading queries (schema changes) to use MyRocks bulk loading feature COmmit stalls every few mins –> now nearly zero Issue: A few deleted rows re-appeared Some of our secondary indexes had extra rows Turned out to be a bug in RocksDB compactions that in rare cases on heavy deletions, tombstones might now have been handled correctly Issue: Scanning outbound delete-markers Counting from one of the empty tables started taking a few minutes Lesons Learned Learn how core components work RocksDB depends on linux more than innodb Understanding how Linux works helps to fix issues faster Do not ignore outliers Many of our issues happened on only handful instances Leveraging Multiple Regions to Improve Site Reliability: Lessons learnt from Andrew Duch Lesson 1: Sorry I missed this Lesson 2: Not everything has to be active-active Lesson 3: Three is cheaper than two - you waste 50% of you capacity in a active-active model Lesson 4: Practie, Practice, Practice Lesson 5: Failover Automation needs to scale Unfortunately I had to skip the final set of sessions this afternoon due to a conflict. From all acounts, the sessions this afternoon were great. See everyone tomorrow for day 3!

SRECon US 2018 Day 3: What I'm seeing

The talk’s I’m wathing today are: Containerization War Stories Resolving Outages Faster with Better Debugging Strategies Monitoring DNS with Open-Source Solutions “Capacity Prediction” instead of “Capacity Planning”: How Uber Uses ML to Accurately Forecast Resource Utilization DIstributed Tracing, Lessons Learned Whispers in Chaos: Searching for Weak Signals in Incidents Architecting a Technical Post Mortem Your System has recovered from an Incident, but have your Developers The Day 3 Plenary sessions are: The History of Fire Escapes Leaping form mainframes to AWS: Technology Time Travel in the Government Operational Excellence in Aprils Fools’ Pranks Come and say Hi if you see me!

SRECon Americas 2018 Day 1 Review

Hi all, This year marks my 3rd year at SRECon Americas. This year brings a 3-day format with the first day being exclusively dedicated to workshops. Hooray! The workshops included: Containers from Scratch SRE Classroom, or How to Build a Distributed System in 3 Hours Profiling JVM Applications in Production Incident Command for IT - What We’ve Learned from the Fire Department Kubernetes 101 Chaos Engineering Bootcamp Ansible for SRE Teams Tech Writing 101 for SREs For the first session, I attended the Containers from Scratch session. As someone who understands the practical implementation of containers, I really appreciated seeing all the details behind it. You can find the following resources from the presentation: Tutorial material Linux Primitives I unfortunately didn’t get a chance to see any of Brent Chapman’s session today on Incident Management, but after going to his BayLISA presentation two weeks back, I know it would have been great. You can find his presentation materials here Bridget Kromhout did a detailed Kubernetes 101 session. From all accounts, it was awesome. You can find relevant materials here: SRECon Slides GitHub You can find James Meickle’s presentation on ‘Ansible for SRE’ here Update (March 28th, 8am): Tammy Butow posted her materials from her Chaos Engineering Bootcamp workshop: Github Speaker Deck Update (March 29th, 12pm): Dan Luedtke did his own version of the Containers Workshop in Go. See the post here Finally, I spent a little bit of time in the LinkedIn Engineering booth, thanks for everyone who stopped by and say Hi! to us.

San Francisco Chaos Engineering Meetup Slides

Tonight I have the priviledge of speaking alongside Russ Miles and Kolton Andrus at the San Francisco Chaos Engineering Meetup. You can find my slides from the event here

Publication Updates (May 27 2017)

Hi all, I just updated my publications page with links to my SRECon17 Americas talks, my new LinkedIn engineering blog post. It was announced this week I will also have the privilege of speaking at SRECon17 EMEA in Dublin later this year. You can find me talking about: Networks for SRE’s: What do I need to know for troubleshooting applications Reducing MTTR and false escalations: Event Correlation at LinkedIn

Monitorama 2017 Summary

The past few days, I’ve been in Portland for the 2017 Monitorama conference. The conference had to literally fail-over between venues Monday night due to a large power-outage across the city. Monitorama brought together a a diverse crowd of engineers and vendors to spend 3 days discussing on call, logging, metrics, tracing and the philosophy of it all. You can find the schedule here And the video’s for each day: Day 1 Day 2 Day 3 ** ** Content Summary For some reason, there was a large amount of content dedicated to distributed tracing. It was actually a theme that dominated the conference. The amount of of open-source content that was inspired by the original Google Dapper (2010) paper seems to be coming mainstream. There was another dominant theme of fixing oncall. This was partially set by Alice Goldfuss’s talk on Day 1 and continued throughout the conference. To be honest, I had no idea how bad some people’s on-call shifts are. I’ve certainly done very well during my time at LinkedIn. It does seem that we need to get smarter about what we alert on. There was also a number of talks that boiled down to: “This is how my company monitors”. It was definitely interesting to see the use of open-source stack’s at larger companies and a general tendancy to dislike of paying large sums of money to vendors. Given my position (and privilege), I’ve been able to learn most of the content during my time at LinkedIn. There were however some talks that I walked away from thinking about how I/ LinkedIn can do a better job. Below are some of my favorite talks (in order of presentation). Day 1: The Tidyverse and the Future of the Monitoring Toolchain - John Rauser John gave a great overview of the Tidyverse toolset and the power of the R language. The visualizations he used in his presentation definitely inspired my team on how we can present some of our incident data in a more meaningful way. Day 1: Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Alice gave a very real presentation on the state of on call and how we shouldn’t tolerate it. Cleverly using #oncallselfie’s on Twitter, she created a narrative on how disruptive oncall can be to our lives and how we shouldn’t tolerate it (for the most part). For anyone who is in a team that gets paged more than 10 times a week, I’d recommend watching. Day 1: Linux debugging tools you’ll love - Julia Evans Julia ran through a number of great Linux debugging techniques and tools that can be used to find problems in your applications. Definitely a lot of tricks for everyone to pick up. Don’t forget to check out her Zines as well at Day 2: Real-time packet analysis at scale - Douglas Creager Douglas (from Google) ran through some interesting techniques for troubleshooting a hypothetical music streaming issues via doing packet analysis. Google created a tool called ‘TraceGraph’ which plots the number of packets (by-type)/ window vs time, to show interruptions in data-flow. Unfortunately he didn’t deep-dive into much ‘at-scale’ detail. Day 3: UX Design and Education for Effective Monitoring tools - Amy Nguyen Amy deep-dived on how you build a body of work that creates an engaging monitoring tool. She did a great job of highlighting anti-patterns in monitoring tools. She went on to give tips on how you build effective UI’s for monitoring systems. Final words Firstly, Kudos to the Monitorama team for running the conference so smoothly given what they had to deal with. Unfortunately, the conference had some competing threads on how you should create a monitoring philosophy which probably didn’t help the smaller companies in attendence. The idea that monitoring is broken is a half-truth at best. We have the best tools we ever have, we just haven’t been able to put a coherent strategy together (this is something I’ll try to blog about next week). My key take-aways are: Provide metrics/ logging/ tracing functionality in frameworks so they are free for developers We need a better way to ingest monitoring data in a sensible/ low-cost manner Need to make it easy to take all of this data and make it explorable/ use-able to everyone. Also, make it consistent as possible!!!! Alert sensibly, don’t get paged for something that can wait 12 hours. You should care about how oncall affects your work and your life outside of work

Monitorama Review Day 3

Hi again, This is today’s notes for Monitorama Day 3. Link to the video is here Today’s Schedule Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton UX Design and Education for Effective Monitoring Tools - Amy Nguyen Automating Dashboard Displays with ASAP - Kexin Rong Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Consistency in Monitoring with Microservices at Lyft - Yann Ramin Critical to Calm: Debugging Distributed Systems - Ian Bennett Managing Logs with a Serverless Cloud - Paul Fisher Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Kubernetes-defined monitoring - Gianluca Borello Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton Challenge Git clone failures in Mac environment…was a DNS issue Third party service outages - pipit, rubygems, launchpad PPA’s Stuff changed somewhere…leftpad Can’t always look at logs due to privacy concerns Lots of security/ privacy challenge So where are we: Adding metrics on jobs as trends UX Design and Education for Effective Monitoring Tools - Amy Nguyen Recent projects Tracing D3 Cache for openTSDB Documentation Why should we care about user experience Prevent misunderstandings - not everyone should be an expert at interpreting monitoring data Developer velocity - help people reach conclusions faster Data democracy - you don’t know what questions people want to answer with their own data UX and your situation (pyramid) Team Documentation Tools (this talk) UX and your situation Sharing what you know Education vs intuition Best practices - Use your expertise to determine the most helpful default behavior Potential pitfalls Performance: Low hanging fruit Backend Roll-up data over long time ranges Store latest data in memory (e..g. FB Gorilla paper and Beringei project Add a cache layer Frontend Don’t reload existing data if user changes time window Prevent the user from requesting the data incessantly Lazy-load graphs Designing what your users want Performance exploration Simplicity Automating Dashboard Displays with ASAP - Kexin Rong Talk outline motivation observation our research going fast Problem: Noisy Dashboards How to smooth plots automatically: More informative dashboard visualization Big idea: Smooth your dashboards Why: 38% more accurate, 44% faster response What do my dashboards tell me today Is plotting raw data always the best idea Q: What’s distracting abotu raw data? A: In many cases, spikes dominate the plot Q: What smoothing function should we use A: Moving average works Contstraint: preserve deviations in plots metric: measure kurtosis of the plot Use: scipy ASAP - As smooth as possible - while preserving long-term deviations Use ASAP.js library Going fast: Q: Finding optimal window size: A: Use grid search Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Doesn’t matter what the monitoring system says, the experience is broken for the user “There are three types of lies: Lies, damned lies, and service status pages” You need to talk about monitoring early in the development cycle “The key to not being woken up by monitoring alerts is to fix the cause for alerts” - Somethong on the internet, probably Consistency in Monitoring with Microservices at Lyft - Yann Ramin Approches and techniques to avoid production incidents with hundreds of micro services and diverse teams What are we trying to solve: when developers are oncall with micro services OR scaling operational mindfulness I clicked production deploy and Jenkins went green - Opepertunity to grow operational mindfulness No-one setup a pager duty list before going to production We need alarms on things! Lets copy and paste them from my last service We routinely approach monitoring as operations We don’t have the SRE role - we hire people who understand operations We have system metrics (CollectD, custom scripts) What do we get (with consistency in monitoring) Consistent measurement Consistent visibility Point-to-Point debugging Unified tracing Salt module for orchestratration (orca) provisions resources interacts with pagerduty, ensures a PD service is created makes sure there’s an oncall schedule blocks deploys if these are missing Dashboards: git monorepo ties in with salt dashboards defined in salt every service gets a default dashboard on deploy! Add extra dashboards via Salt Benefits consistent look at feel always got alarms flexibility Critical to Calm: Debugging Distributed Systems - Ian Bennett 3bn metrics emitted per minute Twitter uses YourKit for profiling Peeling the onion - debugging methodology Metrics/ Alerting Tuning Tracing/ Logs Profiling Instrumentation/ Code change When to peel make a single change, rinse, repeat avoid the urge to make too many changes Performance tips keep your code abstracted fixes should be as isolated as possible critical issues, pager angry: don’t panic Gut can be wrong You will get tired You will get frustrated May take days to come to correct fix Some examples of troubleshooting Managing Logs with a Serverless Cloud - Paul Fisher Monitoring the monolith Logs seems like a good best practice Doesn’t scale well Literally burn money via vendors Move from logs to metrics - you shouldn’t need to log into boxes to debug Lyft contrains AWS Avoid vendor lock-in Lyft’s logging pipeline Heka on all hosts Kinesis firehouse kibana proxy to auth Elastalert pagerduty Detailed walkthrough of pipeline Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Why Use tracing for dependency tracing Root Cause analysis distibuted transaction monitoring Demo Adding context propogation (tracing) is hard in existing code-bases Solution - Add to frameworks (simple config changes) They must want to use your product - or sticks and carrots Each organization is different - find the best way to implement it measure adoption - trace quality scores ensure requests are being properly traced Kubernetes-defined monitoring - Gianluca Borello Monitoring Kubernetes Best practices ideas proposals 4 things that are harder now (with microservices and kubernetes) Getting the data You (dev) should not be invovled in monitoring instrumentation You (dev) should not be involved in producing anything that’s not a custom metric Collect everything making sense of the data troubleshooting people A lot of tools we have now are not container-aware

Monitorama Review Day 2

Hi all, Continuing yesterday’s notes. Last night there was actually a large power outage in downtown Portland which caused us to changes venues today. These notes are somewhat incomplete, I’ll try to fix them up in the coming days. Thanks again to Sarah Huffman for her notes. Video can be found here Anomalies != Alerts - Betsy Nicols Sarah Huffman notes Now, Pain, Relief, Bonus: Use Case Detection & action need to be separated Because they aren’t: Anomalies = Alerts Pain 1. Alert Fatigue 1. Alerts —> Alert Fatigue Alerts = Anomalies Anomalies —> Alert Fatigue How many anomalies can we reasonably expect? Step 1. Population distribution Step 2. Population Density Step 3. Compute Anomalies/ Day To alert or not to alert decision required for reach anomaly anomalies = TP (true positive) union TN likely #FP >> #TP 2. Seeking needles Difficult to find a strategy to work out when to alert and when not to Relief Basic monitoring pattern Data —> Engine —> Alert Basic semantic context Streaming Async Sync Semantic Model (with analytics) Attribute Discovery Build data model using extra attributes Action Policy Works off what data we have and makes a decision Conditions Scope Actions Takeaways Best monitoring = Math + context preferred strategy anomalies + context = alerts Distributed Tracing: How we got here and wehere we’re going - Ben Sigelman Sarah Huffman notes Why are we here Chapter 1: What is the purpose of monitoring must tell stories get to ‘why’ (ASAP) storytelling got a lot harder recently One story, N storytellers microservices may be here to stay but they broke our old tools transactions are not independent the simple thing basic concurrency async concurrency Measuring symptoms metrics model symtoms well measure what the end user actually experiences aside: get raw timing from opentracing instrumentation There is such thing as too many metrics Chapter 2: Where Introducing Donut Zone: DaaS Microservice-oriented Use Open tracing Demo of open Trace All sampling must be all/ nothing per transaction Not all latency issues are due to contention A new way to diagnose Infer or assign contention ID’s (mutex’s, db tables, network links) Tag Spans with each contention ID they encounter automated root-cause for contention More demo’s of open trace Science! Science Harder: How we reinvented ourselves to be data literate, experiment driven and ship faster - Laura Thomson & Rebecca Weiss Sarah Huffman notes Decision-making - without evidence How do you find information about how your releases change the use of the product Browser = App? Not quite. Need to test against various OS/ Languages/ Failure to abstract Build a system to answer questions Resulted in different data collection systems (blocklist ping vs telemetry) Working around privacy can be hard Unified telemetry: One infrastructure, many types of probes and pings Many type of events Unify mental model Transparency: Results must be reproducible as a URL Audit a number all the way down to the code Open science, open methodology Push more data you can Experimenting with experiments Update daily on any release channel real sampling flip a preference, install a fature, collect data multiple experiments in flight at once Data will shape the web. What kind of web do you want Real-time packet analysis at scale - Douglas Creager Sarah Huffman notes 2 Goals: Packet captures are useful You don’t make to make fundemental changes to your monitoring stack Example scenario Streaming a song via application You get drops Infer throughput problems Logs aren’t usually enough to help solve the problem In-app RUM can help Need to look at the network itself Flow can sometimes be helpful, but unlikely in this case Need to see inside the connections - Packet capture Don’t need the look at the actual payload Tool at Google - TraceGraph Time vs packets graphs packets (by type) and windows to show problems Ok so the graph visualizes the problem - How do we solve it Looks like we have buffer bloat - can’t fix that problem in ISPs TCPDump to the rescue Google streams packet-captures to a central server for processing Instrumenting The Rest Of the Company: Hunting for Useful Metrics - Eric Sigler Sarah Huffman’s notes We have problem $foo, we are going to do $bar What data did you use to understand $foo? And how will you know if $bar improved anything “Without data, you’re just another person with an opinion” Example: We have a chatbot to do everything that we don’t understand Takeaway: Look for ways to reverse engineer existing metrics Useful metrics are everywhere. You aren’t alone in digging for metrics. Existing tools can be repurposed Whispers in the Chaos: Monitoring Weak Signals - J Paul Reed Monitoring @ CERN - Pedro Andrade Sarah Huffman’s notes Introduction to CERN 40GB data/s x 4 to collect during testing Where is the data stored: Main site Extension site (with 3x100Gb/s link) - Budapest use standard commodity software/ hardware 14k servers, 200k cores 190PB data stored Pedro’s team provides monitoring for the data storage/ computation Use Open Source tools Collectd Kafka as transport Spark for processing HDFS long term storage Some data in ES/ InfluxDB Kibana/ Grafana for visualizing openstack VM’s - All monitoring services run on this Config done with Puppet I volunteer as Tribute: The future of Oncall: Bridget Kromohout Sarah Huffman’s notes How many people dread phone ringing Change is the only constant

Monitorama Review Day 1

Hi all, I wanted to write some super rough notes of the various Monitorama talks for those (especially my peers) who weren’t able to attend this year. I’d like to give a shout-out to Sarah Huffman who drew notes from the presentations today Note: You can watch the stream here Note: I’ve done my best to put the key take-aways into each presenters talk (with my own opinions mixed in where noted). If you feel like I’ve made an error in representing your talk, please let me know and I’ll edit it. Today’s Schedule: The Tidyverse and the Future of the Monitoring toolchains - John Rauser Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Monitoring in the Enterprise - Bryan Liles Yo Dawg: Monitoring Monitoring Sytems at Netflix - Roy Rapoport Our Many Monsters - Megan Actil Tracing Production Services at Stripe - Aditya Mukerjee Linux debugging tools you’ll love - Julia Evans Instrumenting SmartTV’s and Smartphones in the Netflix app for modeling the Internet - Guy Cirino Monitoring: A Post Mortem - Charity Majors The Vasa: Redux - Pete Cheslock The Tidyverse and the Future of the Monitoring toolchain - John Rauser Sarah Huffman Notes R-language Tidyverse - “set of shared principles” The ideas in the tidyverse are going to transsform everything having to do with data manipulation and visualization ggplot2 - compact and expressive (vs D3 lib) way to draw plots Dataframe - Tibble (nested data frame) flexible, uniform data container R language - Can pipe datasets and chain operations together DPLYR - will displace SQL like languages for data-analytics work. DSL for data manipulation How to get started - RStudio Goal: Inspire tool makers - programming as a way of thinking “Toolmakers should look to the tidyverse for inspiration” **Martyrs of Film: Learning to hate the #oncallselfie - Alice Goldfuss ** Sarah Huffman Notes Benfits of oncall Hones troubleshooting Forces you to identify the weak points in your systems Teaches you what is and isn’t production-ready Team bonding Learn to hate the on call selfie - people complained on Twitter I get paged alot (noted via #oncallselfie) We use oncall outages as war-stories - and be hero’s Action scenes stop the plot Red flags (from alice’s survey) Too few owning too much Symptoms of larger problems: bumping thresholds snooze pages delays Poor Systems visibility/ Team visibility Too many pages 17% of people said 100+ (worst case) 1.1% people got 25-50 (best case) How do we get there Cleanup - actionable alerts Something breaks Customers notice I am I the best person to fix it I need to fix it immediately (side note) Cluster alerts - Get 1 alert for 50 servers rather than 50 alerts for 50 servers Devs oncall - More obligated to fix issues Companies who actively look at oncall numbers Heroic Etsy Github Monitorings things at your day job (Monitoring int he enterprise ) - Bryan Liles Sarah Huffman Notes Steps 1. Pick a tool 2. Pick another tool 3. Complain How do they know what to monitor How do they know when changes happen New problem: what should you monitor New problem: what should you alert on New problem: who should you alert New problem: what tools should I use New problem: how do you monitor your monitoring tools Step back and answer: Jow do you know if your stack works How do you know if your stack works well for others SLI - Service level indicator - measurement of some aspect of your service SLO - service level objective - target value SLA - service level agreement - what level of service have you and your consumers agreed to White-box vs black box monitoring Black box: Garabage in —> service —> garbage out White box: service (memory/ cpu/ secret sauce) How do you know if you’re meeting SLA’s/ SLO’s? Logs Structured log (json logs) Aggregate (send them somewhere centrally) Tell a story Metrics One or more numbers give details about something (SLI) Metrics are combined to create time-series Tracing: Single activity in your stack touches multiple resources MK Note: Brian is talking on open-tracing at Velocity Health endpoints E.g. GET /healthz {“database”: “ok”, “foo”: “ok”, “queue_length” :”ok”, “updated at”: <datetime>} do you know what’s going on Logs Metrics Tracing Other things e.g. what happened at 3pm yesterday logs, metrics, tracing, other things paint a picture How do we ensure greater visibility: Central point of contact for alerts Research tooling practices for teams What types of monitoring tools do we need Philosophies: USE: utilization, saturation and errors RED: Rate, error (date), durations (distribution) - Brendan Gregg Four golden signals: (latency, traffic, errs and saturation) - Google Yo Dawg: Monitoring Monitoring systems at Netflix - Roy Rapoport Sarah Huffman Notes A hero’s journey - product development lifecycle This will scale for atleast a month Monitoring ain’t alerting Alerting - output’s decisions and opinions “everything counts in large amounts” “the graph on the wall tells the story….” 20-25k alerts a day at netflix Have another monitoring system to monitor your monitoring system (Hot/ Cold) watcher Question “Is one tv show/ movie responsible for more Netflix Outages” - Alice Goldfish Our Many Monsters - Megan Anctil Sarah Huffman Notes Why, metrics, logging, alerting Vendor vs Non-vendor Business need Cost!!!! vizOps at Slack - 1-5 FTE Deep-dive into Slack implementations of: Monitoring: Graphite/ Granfana Logging: ELK Alerting: Icigna Cost analysis for above platforms Lessons leant Usability - escalation info must be valuable Creation - must be easy Key takeway: $$$"Is it worth it” is the time worth it Tracing Production Services at Stripe - Aditya Mukerjee Sarah Huffman Notes Tracing is about more than HTTP requests Venuer - “If you need to look at logs, there’s a gap in your observability tools” Metrics - no context Logs - hard to aggregate Request traces - require planning What’s the differennce between metrics/ logs/ tracing (if you squint, it’s hard to tell them apart) What if we could have all three, all the time??? Standard sensor format - Easier to do all three Intelligent metric pipelines (before the monitoring applications) Linux debugging tools you’ll love - Julia Evans Sarah Huffman Notes Accompanying Zine Starting off: read code, add print statements, know language Wizard tools strace tcpdump etc gdb perf ebpf ftrace Ask your OS what your progreams are doing strace can make your applications run 50x slower MK Note: Julia walked though some examples where time/ strace/ tcpdump/ ngrep were all helpful Instrumenting SmartTV’s and smartphones in the netflix app for modeling the internet - Guy cirino Sarah Huffman Notes Making the internet fast is slow faster - better networking slower - broader reach/ congestion Don’t wait for it, measure it and deal Working app > feature rich app We need to know what the internet looks like, without averages Logging anti-patterns Averages - can’t see the distribution, outliers heavily distort Sampling missed data rare events RUM data Don’t guess what the network is doing - measure it! Monitoring: A Post Mortem - Charity Majors Sarah Huffman Notes The Vasa: Redux - Pete Cheslock Sarah Huffman Notes Sponsor talks (only calling out what I choose to) Netsil Application maps Gives you visibility of your topology Techniques APM Tracking (zipkin) proxies OS tracing (pcap/ ePBF) MK Note: Not sure how this works for encrypted data streams Datadog They are hiring Apparently is everyone else What do you look for Knowledge Tools Experience Suggestions Knowledge Write blog peices Meetups (Knowledge) Tools Open source Experience Internships Share your knowledge Share your tools share your experience

Publication Updates (March 11 2017)

Hi all, I just updated my publications page with my APRICOT presentation from earlier in the month. If you’re coming to SRECon Americas 2017 this coming week, come and check out my presentations: Traffic shift: Avoiding disasters at scale Reducing MTTR and false escalations: Event Correlation at LinkedIn

2017 Devops Conferences

Hat-tip to Sarah Drasner who came up with a list of 2017 Front-end conferences that inspired this list. If I have missed any, please tweet me at @matrixtek and I’ll review it being added to the list. Note: There is a larger list over here that lists a number of the smaller conferences January None listed February Devops Days Charlotte March Devops Days Baltimore Devops Days Vancouver Elasticon 2017 SRECon17 Americas Strata + Hadoop World 2017 April Devops Days Atlanta Devops Days Seattle May ApacheCon Devops Days Austin Devops Days Salt Lake City Devops Days Toronto Devops Days Zurich Monitorama PyCon17 US SRECon17 Asia/ Australia June Devops Days Amsterdam Velocity San Jose July Devops Days Minneapolis GopherCon August SRECon17 Europe/ Middle East/ Africa September Devops Days Detroit October LISA17 Velocity London Velocity New York November None listed December None listed No Date Listed SaltConf Couchbase Connect 17