Building Reliable Systems at Bloomberg with Sal Furino

In this episode of Alexa’s Input (AI), I sit down with Sal Furino to explore the hidden engineering work that keeps modern systems reliable.

We break down what Service Level Objectives, Indicators (SLOs/SLIs), and error budgets actually mean in practice, why reliability is as much a cultural problem as a technical one, and how teams can better measure real user experience instead of just infrastructure health.

Sal also explains reliability engineering and the challenges of reliability at scale, like:

Why latency and correctness become harder to measure with GenAI
The difference between a bad incident and a fundamentally bad system
How observability and telemetry shape modern engineering organizations
Why most teams focus too much on infrastructure metrics and not enough on user happiness
Why “the best systems are the ones nobody notices.”

If you work in AI infrastructure, distributed systems, platform engineering, observability, or SRE, this episode is a must listen!

SRECon Talk Dashboards & Dragons: Reliability Magic for AI Platforms by Alexa Griffith and Sal Furino: https://youtu.be/aWMB_7ksbkc?si=S49nPyAl_hCUIH7y

General Podcast Links

Watch: ⁠⁠⁠⁠⁠https://www.youtube.com/@alexa_griffith⁠⁠⁠⁠⁠

Read: ⁠⁠⁠⁠⁠⁠⁠https://alexasinput.substack.com/⁠⁠⁠⁠⁠⁠⁠

Listen:⁠⁠ ⁠https://creators.spotify.com/pod/profile/alexagriffith/⁠⁠⁠

More: ⁠⁠⁠⁠⁠https://linktr.ee/alexagriffith⁠⁠⁠⁠⁠

Learn more about the host at

Website: ⁠⁠⁠⁠⁠https://alexagriffith.com/⁠⁠⁠⁠⁠

LinkedIn: ⁠⁠⁠⁠⁠https://www.linkedin.com/in/alexa-griffith/⁠⁠⁠⁠⁠

Find out more about the guest at:

LinkedIn: https://www.linkedin.com/in/salvatore-furino/

Rootly Interview: https://rootly.com/humans-of-reliability/salvatore-furino

Reliability at Scale Talk: https://youtu.be/J-VrU5JHPlk?si=8aV8acy57NWX30KA

Bloomberg Careers: https://bloomberg.avature.net/careers/SearchJobs

Chapters

00:00 - Introduction: Reliability in a world reshaped by generative AI
02:22 - The importance of seamless, background system design
04:41 - Becoming a Customer Reliability Engineer at Bloomberg
05:17 - Clarifying the CRE role and its customer focus
08:02 - The importance of observability and high-scale performance in finance
09:00 - Balancing technical and cultural aspects of reliability
10:19 - Coaching teams to be proactive using error budgets and SLIs
12:21 - The social-technical system: People, processes, and tools
13:06 - Mediation of differing opinions on reliability practices
15:06 - The nuanced approach to alerting and incident response
17:08 - The significance of tiered SLOs and the concept of error budgets
21:08 - Using signals like latency, correctness, availability, saturation in system measurement
22:53 - The impact of service level "nines" on system design and resilience
28:00 - Handling non-determinism and trust in AI responses
33:01 - Error budgets and their role in managing deployments
34:10 - The challenge of achieving five nines and data durability considerations
40:03 - Adapting SLOs for GenAI systems: core principles remain intact
42:23 - Measuring non-deterministic AI responses and quality proxies
44:41 - The ongoing importance of reliability even in AI/ML contexts
47:25 - Reacting to error budget exhaustion and proactive mitigation
50:42 - The significance of involving cross-functional teams during outages
55:36 - Advocating reliability investment to leadership
56:24 - The customer perspective: reliability as a fundamental feature
58:42 - Connecting with Sal Furino: where to follow his work and learn more about Bloomberg's engineering culture
59:20 - Final advice: Focus on user happiness to avoid common pitfalls in adopting SLOs

Fler avsnitt av Alexa's Input (AI)