The day agents got opinionated
Yoni Leitersdorf recently hosted Tomasz Tunguz on his podcast, to discuss building real agent systems, why “unified tools” beat tiny ones, and the investment theses emerging from the friction
I picked Tomasz Tunguz because he’s the rare investor who ships his own tools and writes down what breaks. He runs Theory Ventures, a concentrated, data-first fund, and he still publishes on his blog several times a week. If you don’t read him yet, start at his homepage and wander from there.
Full the full podcast episode - go here.
We opened with the meta. Podcasts are full of signal but low on “information density,” so Tomasz built a system that ingests shows and newsletters, transcribes, summarizes, enriches, and auto-updates his firm’s CRM with new companies. It runs in production at Theory. Then we spent most of our time on the guts: chunking, vector databases, prompt programming, evals, tool calling, MCP, and what all of that means for builders and for investors.
Claude Code
A recurring theme in our conversation was Tomasz’s use of Claude Code for much of his work and experiments. “I live in Calude Code. I use the terminal more than I use my browser.”
That line captures where Tomasz is right now. Less browsing. More building. His stack starts with ingestion, chunking, and a vector database. He has tried different stores but likes LanceDB for multimodal data and speed. He wrote publicly about why he partnered with the company, which also tells you where his head is as an investor.
From there, he programs prompts with DSPy, optimizes them along several dimensions using GEPA (a multi-objective method), and runs results through evals before anything is admitted into a workflow. The point is simple: automate like an engineer, not like a magician.
The email humbler
Tomasz tried to automate email replies. He mirrored years of messages into a vector DB, added style cues, wrote local model pipelines, even tried fine-tuning. The result surprised both of us.
I asked him - what are chances that I email that I send him will be responded to by AI.
“Zero.”
“It is really hard … there is so much context associated with answering an email that is not immediately addressable to a robot.”
He walked through the pitfalls: social nuance, thread etiquette, non-deterministic tool calls, who to BCC on intros, and why he eventually replaced cleverness with deterministic rules for certain cases. If Google struggles to make “smart reply” feel smart with the world’s largest corpus, maybe your weekend agent isn’t going to nail it either.
From 35 tiny tools to a few “unified” ones
Over July 4th weekend he built about 35 single-purpose tools. Classic Unix vibes: each tool does one thing, does it well, piped together. Failure rates were brutal. “30 to 40%” of calls failed, and the longer the session, the worse it got. A Salesforce action model helped with selection, but lacked the world knowledge to carry the task end to end.
Then he tried Anthropic’s guidance: fewer tools, each one larger, with lots of flags. Think FFmpeg (a million different capabilities in one tool) rather than a bag of tiny scripts. Failures dropped to roughly “10 or 15%,” speed improved, and the system behaved more predictably.
One more turn of the screw came from Cloudflare’s research: don’t expose tools like shell commands. Present them as a TypeScript API and ask the model to write code that calls that API. Tomasz wrapped his capabilities that way and saw accuracy gains and big token-caching wins. Cloudflare’s write-up is great reading, and the community has started to reproduce the approach.
Today his agent stack centers on four or five unified tools around big domains: email, CRM, notes, content feeds, social. Each exposes documented TypeScript function definitions the model can compose at runtime. The docs live in a CLAUDE.md
the model reads on start. It is less human-legible than a chain of tiny tools, but it works.
Planning model, execution model
One pattern we both see inside real products: split “plan” from “do.” Use a small, focused model to elicit assumptions, draft a PRD-like plan, ask “what haven’t we covered,” set tests and guardrails, then hand the plan to a larger model to execute. Keep the contexts separate. He discovered this experimentally; we’ve seen the same in production.
MCP is early, but the curve is steep
MCP, Anthropic’s Model Context Protocol, is becoming the de facto standard for connecting assistants to data and tools. Tomasz has been publishing MCP analytics and found most servers are still infra-oriented rather than app-level. That lines up with our experience too. The platform moment is arriving, and the rough edges for MCP are security, auth, sandboxes, and data-loss prevention.
The next wave is making MCP usable by code, not by chat. The TypeScript API pattern plus unified tools looks like the path of least resistance right now, and Cloudflare’s “Code Mode” explains why.
Where the founder hat meets the fund hat
The summer’s experiments changed how Theory invests. One result: a bet on LanceDB for multimodal data at scale. Another: a growing thesis around observability and debugging for agentic systems. Tool chains call dozens of functions. When results go sideways, you are “wiggling one side of a rope” and hoping the other side wiggles correctly. The market needs breakpoints, traces, and interpretable states for plans and tools. Whether that comes from model vendors or startups is an open question, but the problem is real.
Security is hot too. Prompt injection through innocuous surfaces like calendar invites is not a thought experiment anymore. Sandboxes shrink blast radius. Non-human identities and least-privilege policies need first-class treatment in MCP land. Builders should assume these concerns are part of any enterprise proof-of-concept.
Last bit - the future will reword generalists
Towards the end of the episode we discuss how AI will change what it means to be highly productive. Our conclusion, we will all need to become strong generalists able to command a fleet of agents.
If you’re new to Tomasz’s writing
Start on his blog. It’s short, frequent, and grounded in actual experiments and data. He has been documenting MCP adoption, the rise of unified tools, and broader market surveys for years, first at Redpoint and now at Theory Ventures.
And now - hop over to listen to the full episode, and subscribe to the podcast.