SRECon Americas 2018 Day 2 Notes

Day 2 of SRECon started with an introduction from Kurt and Betsy (the Program Chairs) and then three set of plenary’s talks. The following is a set of notes I put together from the talks I went to today. If you Don’t Know WHere You’re Going, It Doesn’t Matter How Fast You Get There Nicole Forsgren & Jez Humble Slides found here. The talk generally was about strategic planning and measuring your performance/ success. This part of IT is actually very under-valued. These Tweets sums up the presentation well: - Where am I going? - Why do we care? - Improving performance/quality - Measuring performance - Culture & how to measure it#SREcon — Murali Suriar (@msuriar) March 28, 2018 Key point - it's the direction, not the destination that matters. #SREcon — Murali Suriar (@msuriar) March 28, 2018 Some nice points about the mis-use of Velocity and Utilization as Key Performance Indicators (KPI). Security and SRE: Natural Force Multipliers Cory Scott Slides here Heirarchy of Needs in SRE: Monitoring & Inciddent Response Postmortem & Analysis Testing & Release Procedures Capacity Planning Product Problem Statement: High Rate of Change Trust but verify Embrace the Error Budget Inject Engineering Discipline Testing in Production is the new normal Dark Canaries Security Challenges are similar to SRE Latency & perf impact Cascading failure scenarios Service Discovery Security Challenges Authentation Authorization Access Control Logic Data center technologies can be all controlled with a single web page application. Start with a known-good state Asset management Ensure visibility Validate consistently and constantly Takeaways or Giveaways Your data pipeline is your security lifeblood Human-in-the-loop is your last resort, not your first option All security solutions must be scalable Remove single points of security failure like you do for availability Assume that an attacker can be anywhere in your system or flow Capture and measure meaningful security telemetry What it Really Means to Be an Effective Engineer Edmond Lau See Effort <> Impact Impact = Hours spent x (impact produced/hours spent) Leverage = impact produced/ hours spent What are the high-leverage activities for engineers? Do the simple thing first Effective engineers invest in iteration speed Effictive engineers validate their ideas early and often What are the best practices for building good infrastructure for relationships Effective engineers explicitly design their alliances Effective engineers explicitly share their assumptions Work hard and get things done and focus on high-leverage activities and build infrastructure for their relationships The Day the DNS died Jeremy Blosser Slides here Impact Sending mail Application traffic Metrics Diagnosing blind (without metrics) is difficult! resolv.conf is fiddly Can only use first 3 entries Diagnosis Assymetric DNS packet flow (94% packet loss) The Cause [Undocumented] Connection tracking Response Incident response was functional Ability to respond was compromised New DNS design required New Design Dedicated VPC for isolation Open security groups with ACLs Seperate clusters for app/db vs MTA Use DNSMasq for local caching Lessons learnt Not all cloud limits are apparent Instrument your support services and protect It’a always a DNS problem…excpet when it’s a firewall problem resolf.conf is not agile Stable and Accurate Health-Checking of Horizontally-Scaled Services Lorenzo Salno See new Fastly paper on load-balancing: Balancing of the edge: transport affinity without network state - NSDI paper PoP Deployments Space and power are at a premium Architecture: Building a smarter load balancer Methods: machine learning - classifier signal processing - filter control theoery - controller Design Multiple stages system Denoising - remove noise from input signal Anomaly detection - identify misbehaving instance Hysteresis filter - stabilize output Implementation Host signals go into a filter which makes a decision about global state of host Don’t Ever Change! Are Imutable Deployments Really Simplier Faster, and Safer? Rob Hirschfelf Immutable Patterns Baseline + Config Live Boot + Config Image Deploy Image creation Do the configuration and capture the immage into a portable format This sounds like a lot of work and really slow Yes, but it’s faster, safer and more scalable Lessons Learned from Our Main Database Migrations at Facebook Yoshinori Matsunobu User Database Stores Social Graph Massively Sharded Low latency Automated Operations Pure Flash Storage What is MyRocks MySQL on top of RocksDB Open Source, distributed from MariaDB and Percona MyRocks Features Clustered Index Bloom filter and Column family Transactions, including consistency betweenbinlog and RocksDB Faster data loading, deletes and replication Dynamic Options TTL Online logical and binary backup MyRocks pros vs InnoDB Much smaller space (half compared to compressed InnoDB) Writes are faster Much smaller bytes written MyRocks cons vs InnoDB Lack several features No FK, Fulltext index, spactial index Must use Row based binary logging format Reads are slower than InnoDB Too many tuning options MyRocks migration - technical challenges Migration Creating MyRocks instances without downtime Creating second MyRocks instance without downtime Shadow traffic tests Promoting new master InnoDB vs MyRocks, from SRE PoV Server is busier because of double density Rocksdb is much newer database that changes rapidly MyRocks/ RocksDB relies on BufferIO For large transactions, COMMIT needs more work than InnoDB There are too amny tuning options Faster writes means replication slaves lag less often Issue: Mitigating stalls We upgraded kernel to 4.6 Changed data loading queries (schema changes) to use MyRocks bulk loading feature COmmit stalls every few mins –> now nearly zero Issue: A few deleted rows re-appeared Some of our secondary indexes had extra rows Turned out to be a bug in RocksDB compactions that in rare cases on heavy deletions, tombstones might now have been handled correctly Issue: Scanning outbound delete-markers Counting from one of the empty tables started taking a few minutes Lesons Learned Learn how core components work RocksDB depends on linux more than innodb Understanding how Linux works helps to fix issues faster Do not ignore outliers Many of our issues happened on only handful instances Leveraging Multiple Regions to Improve Site Reliability: Lessons learnt from Andrew Duch Lesson 1: Sorry I missed this Lesson 2: Not everything has to be active-active Lesson 3: Three is cheaper than two - you waste 50% of you capacity in a active-active model Lesson 4: Practie, Practice, Practice Lesson 5: Failover Automation needs to scale Unfortunately I had to skip the final set of sessions this afternoon due to a conflict. From all acounts, the sessions this afternoon were great. See everyone tomorrow for day 3!

SRECon US 2018 Day 3: What I'm seeing

The talk’s I’m wathing today are: Containerization War Stories Resolving Outages Faster with Better Debugging Strategies Monitoring DNS with Open-Source Solutions “Capacity Prediction” instead of “Capacity Planning”: How Uber Uses ML to Accurately Forecast Resource Utilization DIstributed Tracing, Lessons Learned Whispers in Chaos: Searching for Weak Signals in Incidents Architecting a Technical Post Mortem Your System has recovered from an Incident, but have your Developers The Day 3 Plenary sessions are: The History of Fire Escapes Leaping form mainframes to AWS: Technology Time Travel in the Government Operational Excellence in Aprils Fools’ Pranks Come and say Hi if you see me!

SRECon Americas 2018 Day 1 Review

Hi all, This year marks my 3rd year at SRECon Americas. This year brings a 3-day format with the first day being exclusively dedicated to workshops. Hooray! The workshops included: Containers from Scratch SRE Classroom, or How to Build a Distributed System in 3 Hours Profiling JVM Applications in Production Incident Command for IT - What We’ve Learned from the Fire Department Kubernetes 101 Chaos Engineering Bootcamp Ansible for SRE Teams Tech Writing 101 for SREs For the first session, I attended the Containers from Scratch session. As someone who understands the practical implementation of containers, I really appreciated seeing all the details behind it. You can find the following resources from the presentation: Tutorial material Linux Primitives I unfortunately didn’t get a chance to see any of Brent Chapman’s session today on Incident Management, but after going to his BayLISA presentation two weeks back, I know it would have been great. You can find his presentation materials here Bridget Kromhout did a detailed Kubernetes 101 session. From all accounts, it was awesome. You can find relevant materials here: SRECon Slides GitHub You can find James Meickle’s presentation on ‘Ansible for SRE’ here Update (March 28th, 8am): Tammy Butow posted her materials from her Chaos Engineering Bootcamp workshop: Github Speaker Deck Update (March 29th, 12pm): Dan Luedtke did his own version of the Containers Workshop in Go. See the post here Finally, I spent a little bit of time in the LinkedIn Engineering booth, thanks for everyone who stopped by and say Hi! to us.