Monitorama Review Day 1

Hi all,

I wanted to write some super rough notes of the various Monitorama talks for those (especially my peers) who weren’t able to attend this year. I’d like to give a shout-out to Sarah Huffman who drew notes from the presentations today

Note: You can watch the stream here

Note: I’ve done my best to put the key take-aways into each presenters talk (with my own opinions mixed in where noted). If you feel like I’ve made an error in representing your talk, please let me know and I’ll edit it.

Today’s Schedule:

  • The Tidyverse and the Future of the Monitoring toolchains - John Rauser
  • Martyrs on Film: Learning to hate the #oncallselfie - Alice Goldfuss
  • Monitoring in the Enterprise - Bryan Liles
  • Yo Dawg: Monitoring Monitoring Sytems at Netflix - Roy Rapoport
  • Our Many Monsters - Megan Actil
  • Tracing Production Services at Stripe - Aditya Mukerjee
  • Linux debugging tools you’ll love - Julia Evans
  • Instrumenting SmartTV’s and Smartphones in the Netflix app for modeling the Internet - Guy Cirino
  • Monitoring: A Post Mortem - Charity Majors
  • The Vasa: Redux - Pete Cheslock

The Tidyverse and the Future of the Monitoring toolchain - John Rauser

  • Sarah Huffman Notes
  • R-language
  • Tidyverse - “set of shared principles”
  • The ideas in the tidyverse are going to transsform everything having to do with data manipulation and visualization
  • ggplot2 - compact and expressive (vs D3 lib) way to draw plots
  • Dataframe - Tibble (nested data frame)
    • flexible, uniform data container
  • R language - Can pipe datasets and chain operations together
  • DPLYR - will displace SQL like languages for data-analytics work.
    • DSL for data manipulation
  • How to get started - RStudio
  • Goal: Inspire tool makers - programming as a way of thinking
  • “Toolmakers should look to the tidyverse for inspiration”

**Martyrs of Film: Learning to hate the #oncallselfie - Alice Goldfuss **

  • Sarah Huffman Notes

  • Benfits of oncall

    • Hones troubleshooting
    • Forces you to identify the weak points in your systems
    • Teaches you what is and isn’t production-ready
    • Team bonding
  • Learn to hate the on call selfie - people complained on Twitter I get paged alot (noted via #oncallselfie)

  • We use oncall outages as war-stories - and be hero’s

    • Action scenes stop the plot
  • Red flags (from alice’s survey)

    • Too few owning too much
    • Symptoms of larger problems:
      • bumping thresholds
      • snooze pages
      • delays
      • Poor Systems visibility/ Team visibility
  • Too many pages

    • 17% of people said 100+ (worst case)
    • 1.1% people got 25-50 (best case)
  • How do we get there

    • Cleanup - actionable alerts

      • Something breaks
      • Customers notice
      • I am I the best person to fix it
      • I need to fix it immediately
    • (side note) Cluster alerts - Get 1 alert for 50 servers rather than 50 alerts for 50 servers

  • Devs oncall - More obligated to fix issues

  • Companies who actively look at oncall numbers

    • Heroic
    • Etsy
    • Github

Monitorings things at your day job (Monitoring int he enterprise ) - Bryan Liles

  • Sarah Huffman Notes
  • Steps
    • 1. Pick a tool
    • 2. Pick another tool
    • 3. Complain
  • How do they know what to monitor
  • How do they know when changes happen
  • New problem: what should you monitor
  • New problem: what should you alert on
  • New problem: who should you alert
  • New problem: what tools should I use
  • New problem: how do you monitor your monitoring tools
  • Step back and answer:
    • Jow do you know if your stack works
    • How do you know if your stack works well for others
  • SLI - Service level indicator - measurement of some aspect of your service
  • SLO - service level objective - target value
  • SLA - service level agreement - what level of service have you and your consumers agreed to
  • White-box vs black box monitoring
  • Black box: Garabage in —> service —> garbage out
  • White box: service (memory/ cpu/ secret sauce)
  • How do you know if you’re meeting SLA’s/ SLO’s?
  • Logs
    • Structured log (json logs)
    • Aggregate (send them somewhere centrally)
    • Tell a story
  • Metrics
    • One or more numbers give details about something (SLI)
    • Metrics are combined to create time-series
  • Tracing:
    • Single activity in your stack touches multiple resources
    • MK Note: Brian is talking on open-tracing at Velocity
  • Health endpoints
    • E.g. GET /healthz
    • {“database”: “ok”, “foo”: “ok”, “queue_length” :”ok”, “updated at”: <datetime>}
  • do you know what’s going on
    • Logs
    • Metrics
    • Tracing
    • Other things
    • e.g. what happened at 3pm yesterday
    • logs, metrics, tracing, other things paint a picture
  • How do we ensure greater visibility:
    • Central point of contact for alerts
    • Research tooling practices for teams
    • What types of monitoring tools do we need
  • Philosophies:
    • USE: utilization, saturation and errors
    • RED: Rate, error (date), durations (distribution) - Brendan Gregg
    • Four golden signals: (latency, traffic, errs and saturation) - Google

Yo Dawg: Monitoring Monitoring systems at Netflix - Roy Rapoport

  • Sarah Huffman Notes
  • A hero’s journey - product development lifecycle
    • This will scale for atleast a month
  • Monitoring ain’t alerting
    • Alerting - output’s decisions and opinions
  • “everything counts in large amounts”
  • “the graph on the wall tells the story….”
  • 20-25k alerts a day at netflix
  • Have another monitoring system to monitor your monitoring system (Hot/ Cold) watcher
  • Question “Is one tv show/ movie responsible for more Netflix Outages” - Alice Goldfish

Our Many Monsters - Megan Anctil

  • Sarah Huffman Notes

  • Why, metrics, logging, alerting

  • Vendor vs Non-vendor

    • Business need
    • Cost!!!!
  • vizOps at Slack - 1-5 FTE

  • Deep-dive into Slack implementations of:

    • Monitoring: Graphite/ Granfana
    • Logging: ELK
    • Alerting: Icigna
  • Cost analysis for above platforms

  • Lessons leant

    • Usability - escalation info must be valuable
    • Creation - must be easy
  • Key takeway:

    • $$$"Is it worth it”
    • is the time worth it

Tracing Production Services at Stripe - Aditya Mukerjee

  • Sarah Huffman Notes
  • Tracing is about more than HTTP requests
  • Venuer - https://veneur.org
  • “If you need to look at logs, there’s a gap in your observability tools”
  • Metrics - no context
  • Logs - hard to aggregate
  • Request traces - require planning
  • What’s the differennce between metrics/ logs/ tracing (if you squint, it’s hard to tell them apart)
  • What if we could have all three, all the time???
  • Standard sensor format - Easier to do all three
  • Intelligent metric pipelines (before the monitoring applications)

Linux debugging tools you’ll love - Julia Evans

  • Sarah Huffman Notes
  • Accompanying Zine
  • Starting off: read code, add print statements, know language
  • Wizard tools
    • strace
    • tcpdump etc
    • gdb
    • perf
    • ebpf
    • ftrace
  • Ask your OS what your progreams are doing
  • strace can make your applications run 50x slower
  • MK Note: Julia walked though some examples where time/ strace/ tcpdump/ ngrep were all helpful

Instrumenting SmartTV’s and smartphones in the netflix app for modeling the internet - Guy cirino

  • Sarah Huffman Notes
  • Making the internet fast is slow
    • faster - better networking
    • slower - broader reach/ congestion
  • Don’t wait for it, measure it and deal
  • Working app > feature rich app
  • We need to know what the internet looks like, without averages
  • Logging anti-patterns
  • Averages - can’t see the distribution, outliers heavily distort
  • Sampling
    • missed data
    • rare events
  • RUM data
  • Don’t guess what the network is doing - measure it!

Monitoring: A Post Mortem - Charity Majors

The Vasa: Redux - Pete Cheslock

Sponsor talks (only calling out what I choose to)

  • Netsil
    • Application maps
    • Gives you visibility of your topology
    • Techniques
      • APM
      • Tracking (zipkin)
      • proxies
      • OS tracing (pcap/ ePBF)
    • MK Note: Not sure how this works for encrypted data streams
  • Datadog
    • They are hiring
      • Apparently is everyone else
    • What do you look for
      • Knowledge
      • Tools
      • Experience
    • Suggestions
      • Knowledge
        • Write blog peices
        • Meetups (Knowledge)
      • Tools
        • Open source
        • studentpack.datadoghq.com
      • Experience
        • Internships
      • Share your knowledge
      • Share your tools
        • share your experience
Last modified: 25 April 2020