Hi all,
Continuing yesterday’s notes. Last night there was actually a large power outage in downtown Portland which caused us to changes venues today. These notes are somewhat incomplete, I’ll try to fix them up in the coming days.
Thanks again to Sarah Huffman for her notes.
Video can be found here
Anomalies != Alerts - Betsy Nicols
- Sarah Huffman notes
- Now, Pain, Relief, Bonus: Use Case
- Detection & action need to be separated
- Because they aren’t: Anomalies = Alerts
- Pain
- 1. Alerts —> Alert Fatigue
- Alerts = Anomalies
- Anomalies —> Alert Fatigue
- How many anomalies can we reasonably expect?
- Step 1. Population distribution
- Step 2. Population Density
- Step 3. Compute Anomalies/ Day
- To alert or not to alert
- decision required for reach anomaly
- anomalies = TP (true positive) union TN
- likely #FP >> #TP
- 2. Seeking needles
- Difficult to find a strategy to work out when to alert and when not to
- Relief
- Basic monitoring pattern
- Basic semantic context
- Semantic Model (with analytics)
- Attribute Discovery
- Build data model using extra attributes
- Action Policy
- Works off what data we have and makes a decision
- Conditions
- Takeaways
- Best monitoring = Math + context
- preferred strategy
- anomalies + context = alerts
Distributed Tracing: How we got here and wehere we’re going - Ben Sigelman
- Sarah Huffman notes
- Why are we here
- Chapter 1: What is the purpose of monitoring
- must tell stories
- get to ‘why’ (ASAP)
- storytelling got a lot harder recently
- One story, N storytellers
- microservices may be here to stay
- but they broke our old tools
- transactions are not independent
- the simple thing
- basic concurrency
- Measuring symptoms
- metrics model symtoms well
- measure what the end user actually experiences
- aside: get raw timing from opentracing instrumentation
- There is such thing as too many metrics
- Chapter 2: Where
- Introducing Donut Zone: DaaS
- Microservice-oriented
- Use Open tracing
- Demo of open Trace
- All sampling must be all/ nothing per transaction
- Not all latency issues are due to contention
- A new way to diagnose
- Infer or assign contention ID’s (mutex’s, db tables, network links)
- Tag Spans with each contention ID they encounter
- automated root-cause for contention
- More demo’s of open trace
Science! Science Harder: How we reinvented ourselves to be data literate, experiment driven and ship faster - Laura Thomson & Rebecca Weiss
- Sarah Huffman notes
- Decision-making - without evidence
- How do you find information about how your releases change the use of the product
- Browser = App? Not quite.
- Need to test against various OS/ Languages/
- Failure to abstract
- Build a system to answer questions
- Resulted in different data collection systems (blocklist ping vs telemetry)
- Working around privacy can be hard
- Unified telemetry:
- One infrastructure, many types of probes and pings
- Many type of events
- Unify mental model
- Transparency:
- Results must be reproducible as a URL
- Audit a number all the way down to the code
- Open science, open methodology
- Push more data you can
- Experimenting with experiments
- Update daily on any release channel
- real sampling
- flip a preference, install a fature, collect data
- multiple experiments in flight at once
- Data will shape the web. What kind of web do you want
Real-time packet analysis at scale - Douglas Creager
- Sarah Huffman notes
- 2 Goals:
- Packet captures are useful
- You don’t make to make fundemental changes to your monitoring stack
- Example scenario
- Streaming a song via application
- You get drops
- Infer throughput problems
- Logs aren’t usually enough to help solve the problem
- In-app RUM can help
- Need to look at the network itself
- Flow can sometimes be helpful, but unlikely in this case
- Need to see inside the connections - Packet capture
- Don’t need the look at the actual payload
- Tool at Google - TraceGraph
- Time vs packets
- graphs packets (by type) and windows to show problems
- Ok so the graph visualizes the problem - How do we solve it
- Looks like we have buffer bloat - can’t fix that problem in ISPs
- TCPDump to the rescue
- Google streams packet-captures to a central server for processing
Instrumenting The Rest Of the Company: Hunting for Useful Metrics - Eric Sigler
- Sarah Huffman’s notes
- We have problem $foo, we are going to do $bar
- What data did you use to understand $foo? And how will you know if $bar improved anything
- “Without data, you’re just another person with an opinion”
- Example: We have a chatbot to do everything that we don’t understand
- Takeaway:
- Look for ways to reverse engineer existing metrics
- Useful metrics are everywhere. You aren’t alone in digging for metrics. Existing tools can be repurposed
Whispers in the Chaos: Monitoring Weak Signals - J Paul Reed
Monitoring @ CERN - Pedro Andrade
- Sarah Huffman’s notes
- Introduction to CERN
- 40GB data/s x 4 to collect during testing
- Where is the data stored:
- Main site
- Extension site (with 3x100Gb/s link) - Budapest
- use standard commodity software/ hardware
- 14k servers, 200k cores
- 190PB data stored
- Pedro’s team provides monitoring for the data storage/ computation
- Use Open Source tools
- Collectd
- Kafka as transport
- Spark for processing
- HDFS long term storage
- Some data in ES/ InfluxDB
- Kibana/ Grafana for visualizing
- openstack VM’s - All monitoring services run on this
- Config done with Puppet
I volunteer as Tribute: The future of Oncall: Bridget Kromohout
- Sarah Huffman’s notes
- How many people dread phone ringing
- Change is the only constant