In this episode of Alexa’s Input (AI), I sit down with Sal Furino to explore the hidden engineering work that keeps modern systems reliable.
We break down what Service Level Objectives, Indicators (SLOs/SLIs), and error budgets actually mean in practice, why reliability is as much a cultural problem as a technical one, and how teams can better measure real user experience instead of just infrastructure health.
Sal also explains reliability engineering and the challenges of reliability at scale, like:
- Why latency and correctness become harder to measure with GenAI
- The difference between a bad incident and a fundamentally bad system
- How observability and telemetry shape modern engineering organizations
- Why most teams focus too much on infrastructure metrics and not enough on user happiness
- Why “the best systems are the ones nobody notices.”
If you work in AI infrastructure, distributed systems, platform engineering, observability, or SRE, this episode is a must listen!
SRECon Talk Dashboards & Dragons: Reliability Magic for AI Platforms by Alexa Griffith and Sal Furino: https://youtu.be/aWMB_7ksbkc?si=S49nPyAl_hCUIH7y
General Podcast Links
Watch: https://www.youtube.com/@alexa_griffith
Read: https://alexasinput.substack.com/
Listen: https://creators.spotify.com/pod/profile/alexagriffith/
More: https://linktr.ee/alexagriffith
Learn more about the host at
Website: https://alexagriffith.com/
LinkedIn: https://www.linkedin.com/in/alexa-griffith/
Find out more about the guest at:
LinkedIn: https://www.linkedin.com/in/salvatore-furino/
Rootly Interview: https://rootly.com/humans-of-reliability/salvatore-furino
Reliability at Scale Talk: https://youtu.be/J-VrU5JHPlk?si=8aV8acy57NWX30KA
Bloomberg Careers: https://bloomberg.avature.net/careers/SearchJobs
Chapters
00:00 - Introduction: Reliability in a world reshaped by generative AI
02:22 - The importance of seamless, background system design
04:41 - Becoming a Customer Reliability Engineer at Bloomberg
05:17 - Clarifying the CRE role and its customer focus
08:02 - The importance of observability and high-scale performance in finance
09:00 - Balancing technical and cultural aspects of reliability
10:19 - Coaching teams to be proactive using error budgets and SLIs
12:21 - The social-technical system: People, processes, and tools
13:06 - Mediation of differing opinions on reliability practices
15:06 - The nuanced approach to alerting and incident response
17:08 - The significance of tiered SLOs and the concept of error budgets
21:08 - Using signals like latency, correctness, availability, saturation in system measurement
22:53 - The impact of service level "nines" on system design and resilience
28:00 - Handling non-determinism and trust in AI responses
33:01 - Error budgets and their role in managing deployments
34:10 - The challenge of achieving five nines and data durability considerations
40:03 - Adapting SLOs for GenAI systems: core principles remain intact
42:23 - Measuring non-deterministic AI responses and quality proxies
44:41 - The ongoing importance of reliability even in AI/ML contexts
47:25 - Reacting to error budget exhaustion and proactive mitigation
50:42 - The significance of involving cross-functional teams during outages
55:36 - Advocating reliability investment to leadership
56:24 - The customer perspective: reliability as a fundamental feature
58:42 - Connecting with Sal Furino: where to follow his work and learn more about Bloomberg's engineering culture
59:20 - Final advice: Focus on user happiness to avoid common pitfalls in adopting SLOs
Fler avsnitt av Alexa's Input (AI)
Visa alla avsnitt av Alexa's Input (AI)Alexa's Input (AI) med Alexa Griffith finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
