Gimlet Labs runs an inference cloud built on heterogeneous silicon. Their software traces a PyTorch workload, segments it into its component parts, and schedules each piece onto the best-suited hardware — connecting chips from different vendors on a single high-speed fabric.
In this interview, Gimlet co-founder Natalie Serrino and former Intel executive Beltir walk through the architecture (graph trace, optimal split points, lowering each segment to TensorRT on NVIDIA and equivalents elsewhere), the three customer segments they sell into (frontier labs, sovereign clouds, AI natives), and a concrete demo: on GPT-OSS 120B at 8K input / 1K output, running the speculative decoder on a d-Matrix Corsair card while NVIDIA B200s handle the verifier shifts the throughput-vs-interactivity Pareto frontier roughly 4× over GPU-only speculative decode.
The most surprising takeaway: most Neoclouds gave significant equity to a single silicon vendor in exchange for capacity. Hardware amortization is around 70% of their annual costs, and the equity terms prevent them from diversifying their silicon. So the only software innovation they can ship is disaggregation on top of one vendor's stack — never across vendors. Gimlet's two-track model (deploying orchestration software inside customer data centers, plus running their own Neocloud built on mixed silicon) is the answer to that constraint.
Read the full transcript on Chipstrat.
Chapters:
0:00 Intro and the chips no one's connected before
0:33 Inference cloud for agents
1:02 From Intel to Gimlet
2:14 The case for heterogeneous inference
4:03 Disaggregating inference by resource profile
6:24 Tracing PyTorch into a schedulable graph
8:08 Connecting chips never connected before
10:52 CPUs as the agentic workhorse
12:01 Tool calls in the same data center as the LLM
13:21 Latency vs throughput on a shared fabric
14:57 Three customer buckets
15:54 Sovereigns: make an API call, not a porting project
19:37 "Cracked software is the platform"
22:24 Why merchant silicon vendors need partners
25:18 Hyperscalers outsourcing CapEx, not just kernels
28:49 AI natives: latency budgets, not just price
32:06 The d-Matrix partnership
33:31 The Pareto frontier chart
35:56 Speculative decode on Corsair: 4× shift
37:27 4× faster, or 3× more customers?
41:22 Why most Neoclouds can't follow this model
42:34 Gimlet's two-track business model
44:30 CoreWeave vs Together vs Gimlet
45:15 Series A and hiring
Relevant reading:
The Information on Gimlet helping OpenAI optimize for Cerebras: https://www.theinformation.com/newsletters/ai-agenda/startup-helping-openai-optimize-ai-cerebras-chips
Sachin Katti and Zain Asgar coauthored research at Stanford: https://arxiv.org/abs/2507.19635
Follow Chipstrat:
Newsletter: https://www.chipstrat.com
X: https://x.com/chipstrat
Fler avsnitt av Semi Doped
Visa alla avsnitt av Semi DopedSemi Doped med Vikram Sekar and Austin Lyons finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
