Monitorama 2017 Summary

The past few days, I’ve been in Portland for the 2017 Monitorama conference. The conference had to literally fail-over between venues Monday night due to a large power-outage across the city. Monitorama brought together a a diverse crowd of engineers and vendors to spend 3 days discussing on call, logging, metrics, tracing and the philosophy of it all. You can find the schedule here And the video’s for each day: Day 1 Day 2 Day 3 ** ** Content Summary For some reason, there was a large amount of content dedicated to distributed tracing. It was actually a theme that dominated the conference. The amount of of open-source content that was inspired by the original Google Dapper (2010) paper seems to be coming mainstream. There was another dominant theme of fixing oncall. This was partially set by Alice Goldfuss’s talk on Day 1 and continued throughout the conference. To be honest, I had no idea how bad some people’s on-call shifts are. I’ve certainly done very well during my time at LinkedIn. It does seem that we need to get smarter about what we alert on. There was also a number of talks that boiled down to: “This is how my company monitors”. It was definitely interesting to see the use of open-source stack’s at larger companies and a general tendancy to dislike of paying large sums of money to vendors. Given my position (and privilege), I’ve been able to learn most of the content during my time at LinkedIn. There were however some talks that I walked away from thinking about how I/ LinkedIn can do a better job. Below are some of my favorite talks (in order of presentation). Day 1: The Tidyverse and the Future of the Monitoring Toolchain - John Rauser John gave a great overview of the Tidyverse toolset and the power of the R language. The visualizations he used in his presentation definitely inspired my team on how we can present some of our incident data in a more meaningful way. Day 1: Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Alice gave a very real presentation on the state of on call and how we shouldn’t tolerate it. Cleverly using #oncallselfie’s on Twitter, she created a narrative on how disruptive oncall can be to our lives and how we shouldn’t tolerate it (for the most part). For anyone who is in a team that gets paged more than 10 times a week, I’d recommend watching. Day 1: Linux debugging tools you’ll love - Julia Evans Julia ran through a number of great Linux debugging techniques and tools that can be used to find problems in your applications. Definitely a lot of tricks for everyone to pick up. Don’t forget to check out her Zines as well at Day 2: Real-time packet analysis at scale - Douglas Creager Douglas (from Google) ran through some interesting techniques for troubleshooting a hypothetical music streaming issues via doing packet analysis. Google created a tool called ‘TraceGraph’ which plots the number of packets (by-type)/ window vs time, to show interruptions in data-flow. Unfortunately he didn’t deep-dive into much ‘at-scale’ detail. Day 3: UX Design and Education for Effective Monitoring tools - Amy Nguyen Amy deep-dived on how you build a body of work that creates an engaging monitoring tool. She did a great job of highlighting anti-patterns in monitoring tools. She went on to give tips on how you build effective UI’s for monitoring systems. Final words Firstly, Kudos to the Monitorama team for running the conference so smoothly given what they had to deal with. Unfortunately, the conference had some competing threads on how you should create a monitoring philosophy which probably didn’t help the smaller companies in attendence. The idea that monitoring is broken is a half-truth at best. We have the best tools we ever have, we just haven’t been able to put a coherent strategy together (this is something I’ll try to blog about next week). My key take-aways are: Provide metrics/ logging/ tracing functionality in frameworks so they are free for developers We need a better way to ingest monitoring data in a sensible/ low-cost manner Need to make it easy to take all of this data and make it explorable/ use-able to everyone. Also, make it consistent as possible!!!! Alert sensibly, don’t get paged for something that can wait 12 hours. You should care about how oncall affects your work and your life outside of work

Monitorama Review Day 3

Hi again, This is today’s notes for Monitorama Day 3. Link to the video is here Today’s Schedule Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton UX Design and Education for Effective Monitoring Tools - Amy Nguyen Automating Dashboard Displays with ASAP - Kexin Rong Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Consistency in Monitoring with Microservices at Lyft - Yann Ramin Critical to Calm: Debugging Distributed Systems - Ian Bennett Managing Logs with a Serverless Cloud - Paul Fisher Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Kubernetes-defined monitoring - Gianluca Borello Monitoring in a world where you can’t “fix” most of your systems Errors - Brandon Burton Challenge Git clone failures in Mac environment…was a DNS issue Third party service outages - pipit, rubygems, launchpad PPA’s Stuff changed somewhere…leftpad Can’t always look at logs due to privacy concerns Lots of security/ privacy challenge So where are we: Adding metrics on jobs as trends UX Design and Education for Effective Monitoring Tools - Amy Nguyen Recent projects Tracing D3 Cache for openTSDB Documentation Why should we care about user experience Prevent misunderstandings - not everyone should be an expert at interpreting monitoring data Developer velocity - help people reach conclusions faster Data democracy - you don’t know what questions people want to answer with their own data UX and your situation (pyramid) Team Documentation Tools (this talk) UX and your situation Sharing what you know Education vs intuition Best practices - Use your expertise to determine the most helpful default behavior Potential pitfalls Performance: Low hanging fruit Backend Roll-up data over long time ranges Store latest data in memory (e..g. FB Gorilla paper and Beringei project Add a cache layer Frontend Don’t reload existing data if user changes time window Prevent the user from requesting the data incessantly Lazy-load graphs Designing what your users want Performance exploration Simplicity Automating Dashboard Displays with ASAP - Kexin Rong Talk outline motivation observation our research going fast Problem: Noisy Dashboards How to smooth plots automatically: More informative dashboard visualization Big idea: Smooth your dashboards Why: 38% more accurate, 44% faster response What do my dashboards tell me today Is plotting raw data always the best idea Q: What’s distracting abotu raw data? A: In many cases, spikes dominate the plot Q: What smoothing function should we use A: Moving average works Contstraint: preserve deviations in plots metric: measure kurtosis of the plot Use: scipy ASAP - As smooth as possible - while preserving long-term deviations Use ASAP.js library Going fast: Q: Finding optimal window size: A: Use grid search Monitoring That Cares (The End of User Based Monitoring) - Francois Concil Doesn’t matter what the monitoring system says, the experience is broken for the user “There are three types of lies: Lies, damned lies, and service status pages” You need to talk about monitoring early in the development cycle “The key to not being woken up by monitoring alerts is to fix the cause for alerts” - Somethong on the internet, probably Consistency in Monitoring with Microservices at Lyft - Yann Ramin Approches and techniques to avoid production incidents with hundreds of micro services and diverse teams What are we trying to solve: when developers are oncall with micro services OR scaling operational mindfulness I clicked production deploy and Jenkins went green - Opepertunity to grow operational mindfulness No-one setup a pager duty list before going to production We need alarms on things! Lets copy and paste them from my last service We routinely approach monitoring as operations We don’t have the SRE role - we hire people who understand operations We have system metrics (CollectD, custom scripts) What do we get (with consistency in monitoring) Consistent measurement Consistent visibility Point-to-Point debugging Unified tracing Salt module for orchestratration (orca) provisions resources interacts with pagerduty, ensures a PD service is created makes sure there’s an oncall schedule blocks deploys if these are missing Dashboards: git monorepo ties in with salt dashboards defined in salt every service gets a default dashboard on deploy! Add extra dashboards via Salt Benefits consistent look at feel always got alarms flexibility Critical to Calm: Debugging Distributed Systems - Ian Bennett 3bn metrics emitted per minute Twitter uses YourKit for profiling Peeling the onion - debugging methodology Metrics/ Alerting Tuning Tracing/ Logs Profiling Instrumentation/ Code change When to peel make a single change, rinse, repeat avoid the urge to make too many changes Performance tips keep your code abstracted fixes should be as isolated as possible critical issues, pager angry: don’t panic Gut can be wrong You will get tired You will get frustrated May take days to come to correct fix Some examples of troubleshooting Managing Logs with a Serverless Cloud - Paul Fisher Monitoring the monolith Logs seems like a good best practice Doesn’t scale well Literally burn money via vendors Move from logs to metrics - you shouldn’t need to log into boxes to debug Lyft contrains AWS Avoid vendor lock-in Lyft’s logging pipeline Heka on all hosts Kinesis firehouse kibana proxy to auth Elastalert pagerduty Detailed walkthrough of pipeline Distributed Tracing at Uber scale: Creating a treasure map for your monitoring data - Yuri Shkuro Why Use tracing for dependency tracing Root Cause analysis distibuted transaction monitoring Demo Adding context propogation (tracing) is hard in existing code-bases Solution - Add to frameworks (simple config changes) They must want to use your product - or sticks and carrots Each organization is different - find the best way to implement it measure adoption - trace quality scores ensure requests are being properly traced Kubernetes-defined monitoring - Gianluca Borello Monitoring Kubernetes Best practices ideas proposals 4 things that are harder now (with microservices and kubernetes) Getting the data You (dev) should not be invovled in monitoring instrumentation You (dev) should not be involved in producing anything that’s not a custom metric Collect everything making sense of the data troubleshooting people A lot of tools we have now are not container-aware

Monitorama Review Day 2

Hi all, Continuing yesterday’s notes. Last night there was actually a large power outage in downtown Portland which caused us to changes venues today. These notes are somewhat incomplete, I’ll try to fix them up in the coming days. Thanks again to Sarah Huffman for her notes. Video can be found here Anomalies != Alerts - Betsy Nicols Sarah Huffman notes Now, Pain, Relief, Bonus: Use Case Detection & action need to be separated Because they aren’t: Anomalies = Alerts Pain 1. Alert Fatigue 1. Alerts —> Alert Fatigue Alerts = Anomalies Anomalies —> Alert Fatigue How many anomalies can we reasonably expect? Step 1. Population distribution Step 2. Population Density Step 3. Compute Anomalies/ Day To alert or not to alert decision required for reach anomaly anomalies = TP (true positive) union TN likely #FP >> #TP 2. Seeking needles Difficult to find a strategy to work out when to alert and when not to Relief Basic monitoring pattern Data —> Engine —> Alert Basic semantic context Streaming Async Sync Semantic Model (with analytics) Attribute Discovery Build data model using extra attributes Action Policy Works off what data we have and makes a decision Conditions Scope Actions Takeaways Best monitoring = Math + context preferred strategy anomalies + context = alerts Distributed Tracing: How we got here and wehere we’re going - Ben Sigelman Sarah Huffman notes Why are we here Chapter 1: What is the purpose of monitoring must tell stories get to ‘why’ (ASAP) storytelling got a lot harder recently One story, N storytellers microservices may be here to stay but they broke our old tools transactions are not independent the simple thing basic concurrency async concurrency Measuring symptoms metrics model symtoms well measure what the end user actually experiences aside: get raw timing from opentracing instrumentation There is such thing as too many metrics Chapter 2: Where Introducing Donut Zone: DaaS Microservice-oriented Use Open tracing Demo of open Trace All sampling must be all/ nothing per transaction Not all latency issues are due to contention A new way to diagnose Infer or assign contention ID’s (mutex’s, db tables, network links) Tag Spans with each contention ID they encounter automated root-cause for contention More demo’s of open trace Science! Science Harder: How we reinvented ourselves to be data literate, experiment driven and ship faster - Laura Thomson & Rebecca Weiss Sarah Huffman notes Decision-making - without evidence How do you find information about how your releases change the use of the product Browser = App? Not quite. Need to test against various OS/ Languages/ Failure to abstract Build a system to answer questions Resulted in different data collection systems (blocklist ping vs telemetry) Working around privacy can be hard Unified telemetry: One infrastructure, many types of probes and pings Many type of events Unify mental model Transparency: Results must be reproducible as a URL Audit a number all the way down to the code Open science, open methodology Push more data you can Experimenting with experiments Update daily on any release channel real sampling flip a preference, install a fature, collect data multiple experiments in flight at once Data will shape the web. What kind of web do you want Real-time packet analysis at scale - Douglas Creager Sarah Huffman notes 2 Goals: Packet captures are useful You don’t make to make fundemental changes to your monitoring stack Example scenario Streaming a song via application You get drops Infer throughput problems Logs aren’t usually enough to help solve the problem In-app RUM can help Need to look at the network itself Flow can sometimes be helpful, but unlikely in this case Need to see inside the connections - Packet capture Don’t need the look at the actual payload Tool at Google - TraceGraph Time vs packets graphs packets (by type) and windows to show problems Ok so the graph visualizes the problem - How do we solve it Looks like we have buffer bloat - can’t fix that problem in ISPs TCPDump to the rescue Google streams packet-captures to a central server for processing Instrumenting The Rest Of the Company: Hunting for Useful Metrics - Eric Sigler Sarah Huffman’s notes We have problem $foo, we are going to do $bar What data did you use to understand $foo? And how will you know if $bar improved anything “Without data, you’re just another person with an opinion” Example: We have a chatbot to do everything that we don’t understand Takeaway: Look for ways to reverse engineer existing metrics Useful metrics are everywhere. You aren’t alone in digging for metrics. Existing tools can be repurposed Whispers in the Chaos: Monitoring Weak Signals - J Paul Reed Monitoring @ CERN - Pedro Andrade Sarah Huffman’s notes Introduction to CERN 40GB data/s x 4 to collect during testing Where is the data stored: Main site Extension site (with 3x100Gb/s link) - Budapest use standard commodity software/ hardware 14k servers, 200k cores 190PB data stored Pedro’s team provides monitoring for the data storage/ computation Use Open Source tools Collectd Kafka as transport Spark for processing HDFS long term storage Some data in ES/ InfluxDB Kibana/ Grafana for visualizing openstack VM’s - All monitoring services run on this Config done with Puppet I volunteer as Tribute: The future of Oncall: Bridget Kromohout Sarah Huffman’s notes How many people dread phone ringing Change is the only constant

Monitorama Review Day 1

Hi all, I wanted to write some super rough notes of the various Monitorama talks for those (especially my peers) who weren’t able to attend this year. I’d like to give a shout-out to Sarah Huffman who drew notes from the presentations today Note: You can watch the stream here Note: I’ve done my best to put the key take-aways into each presenters talk (with my own opinions mixed in where noted). If you feel like I’ve made an error in representing your talk, please let me know and I’ll edit it. Today’s Schedule: The Tidyverse and the Future of the Monitoring toolchains - John Rauser Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss Monitoring in the Enterprise - Bryan Liles Yo Dawg: Monitoring Monitoring Sytems at Netflix - Roy Rapoport Our Many Monsters - Megan Actil Tracing Production Services at Stripe - Aditya Mukerjee Linux debugging tools you’ll love - Julia Evans Instrumenting SmartTV’s and Smartphones in the Netflix app for modeling the Internet - Guy Cirino Monitoring: A Post Mortem - Charity Majors The Vasa: Redux - Pete Cheslock The Tidyverse and the Future of the Monitoring toolchain - John Rauser Sarah Huffman Notes R-language Tidyverse - “set of shared principles” The ideas in the tidyverse are going to transsform everything having to do with data manipulation and visualization ggplot2 - compact and expressive (vs D3 lib) way to draw plots Dataframe - Tibble (nested data frame) flexible, uniform data container R language - Can pipe datasets and chain operations together DPLYR - will displace SQL like languages for data-analytics work. DSL for data manipulation How to get started - RStudio Goal: Inspire tool makers - programming as a way of thinking “Toolmakers should look to the tidyverse for inspiration” **Martyrs of Film: Learning to hate the #oncallselfie - Alice Goldfuss ** Sarah Huffman Notes Benfits of oncall Hones troubleshooting Forces you to identify the weak points in your systems Teaches you what is and isn’t production-ready Team bonding Learn to hate the on call selfie - people complained on Twitter I get paged alot (noted via #oncallselfie) We use oncall outages as war-stories - and be hero’s Action scenes stop the plot Red flags (from alice’s survey) Too few owning too much Symptoms of larger problems: bumping thresholds snooze pages delays Poor Systems visibility/ Team visibility Too many pages 17% of people said 100+ (worst case) 1.1% people got 25-50 (best case) How do we get there Cleanup - actionable alerts Something breaks Customers notice I am I the best person to fix it I need to fix it immediately (side note) Cluster alerts - Get 1 alert for 50 servers rather than 50 alerts for 50 servers Devs oncall - More obligated to fix issues Companies who actively look at oncall numbers Heroic Etsy Github Monitorings things at your day job (Monitoring int he enterprise ) - Bryan Liles Sarah Huffman Notes Steps 1. Pick a tool 2. Pick another tool 3. Complain How do they know what to monitor How do they know when changes happen New problem: what should you monitor New problem: what should you alert on New problem: who should you alert New problem: what tools should I use New problem: how do you monitor your monitoring tools Step back and answer: Jow do you know if your stack works How do you know if your stack works well for others SLI - Service level indicator - measurement of some aspect of your service SLO - service level objective - target value SLA - service level agreement - what level of service have you and your consumers agreed to White-box vs black box monitoring Black box: Garabage in —> service —> garbage out White box: service (memory/ cpu/ secret sauce) How do you know if you’re meeting SLA’s/ SLO’s? Logs Structured log (json logs) Aggregate (send them somewhere centrally) Tell a story Metrics One or more numbers give details about something (SLI) Metrics are combined to create time-series Tracing: Single activity in your stack touches multiple resources MK Note: Brian is talking on open-tracing at Velocity Health endpoints E.g. GET /healthz {“database”: “ok”, “foo”: “ok”, “queue_length” :”ok”, “updated at”: <datetime>} do you know what’s going on Logs Metrics Tracing Other things e.g. what happened at 3pm yesterday logs, metrics, tracing, other things paint a picture How do we ensure greater visibility: Central point of contact for alerts Research tooling practices for teams What types of monitoring tools do we need Philosophies: USE: utilization, saturation and errors RED: Rate, error (date), durations (distribution) - Brendan Gregg Four golden signals: (latency, traffic, errs and saturation) - Google Yo Dawg: Monitoring Monitoring systems at Netflix - Roy Rapoport Sarah Huffman Notes A hero’s journey - product development lifecycle This will scale for atleast a month Monitoring ain’t alerting Alerting - output’s decisions and opinions “everything counts in large amounts” “the graph on the wall tells the story….” 20-25k alerts a day at netflix Have another monitoring system to monitor your monitoring system (Hot/ Cold) watcher Question “Is one tv show/ movie responsible for more Netflix Outages” - Alice Goldfish Our Many Monsters - Megan Anctil Sarah Huffman Notes Why, metrics, logging, alerting Vendor vs Non-vendor Business need Cost!!!! vizOps at Slack - 1-5 FTE Deep-dive into Slack implementations of: Monitoring: Graphite/ Granfana Logging: ELK Alerting: Icigna Cost analysis for above platforms Lessons leant Usability - escalation info must be valuable Creation - must be easy Key takeway: $$$"Is it worth it” is the time worth it Tracing Production Services at Stripe - Aditya Mukerjee Sarah Huffman Notes Tracing is about more than HTTP requests Venuer - “If you need to look at logs, there’s a gap in your observability tools” Metrics - no context Logs - hard to aggregate Request traces - require planning What’s the differennce between metrics/ logs/ tracing (if you squint, it’s hard to tell them apart) What if we could have all three, all the time??? Standard sensor format - Easier to do all three Intelligent metric pipelines (before the monitoring applications) Linux debugging tools you’ll love - Julia Evans Sarah Huffman Notes Accompanying Zine Starting off: read code, add print statements, know language Wizard tools strace tcpdump etc gdb perf ebpf ftrace Ask your OS what your progreams are doing strace can make your applications run 50x slower MK Note: Julia walked though some examples where time/ strace/ tcpdump/ ngrep were all helpful Instrumenting SmartTV’s and smartphones in the netflix app for modeling the internet - Guy cirino Sarah Huffman Notes Making the internet fast is slow faster - better networking slower - broader reach/ congestion Don’t wait for it, measure it and deal Working app > feature rich app We need to know what the internet looks like, without averages Logging anti-patterns Averages - can’t see the distribution, outliers heavily distort Sampling missed data rare events RUM data Don’t guess what the network is doing - measure it! Monitoring: A Post Mortem - Charity Majors Sarah Huffman Notes The Vasa: Redux - Pete Cheslock Sarah Huffman Notes Sponsor talks (only calling out what I choose to) Netsil Application maps Gives you visibility of your topology Techniques APM Tracking (zipkin) proxies OS tracing (pcap/ ePBF) MK Note: Not sure how this works for encrypted data streams Datadog They are hiring Apparently is everyone else What do you look for Knowledge Tools Experience Suggestions Knowledge Write blog peices Meetups (Knowledge) Tools Open source Experience Internships Share your knowledge Share your tools share your experience