Build less, ship more: Inference engineering means making just-in-time context generation
Inference time optimization leads to better outcomes than model quality improvements now. Just don't call it prompt engineering...
This is the first of a series of posts explaining Product Command.
Product command is a coordination model for centralizing intent (the why, success conditions, and constraints), decentralizing execution (letting coworkers decide the less crucial details), and aligning both through explicit verification steps.
This contrasts with a command-and-control coordination model, in which a commander makes decisions, and subordinates execute.
The result is outputs (products) that require less effort to achieve better quality.
This post focuses on building the right environment for AI to find its own context.
LLMs are a communication problem, not a computer science problem.
I haven’t written any code in 6 months. In the last month, I’ve been able to let Codex run unattended for 20-60m and get a feature right the first time.
The key? Quality communication. I’m a startup founder and software engineer, but I started my career as a journalist. That means I actually studied Communication in college, not computer science.
CS and Comm might not be as divergent as you think. Weirdly, these two fields have a shared heritage and patron saint: Claude Shannon.
Shannon describes a scientific definition for communication called A Mathematical Theory of Communication. CS focuses on the math, communication studies focuses on the mechanisms.
Computers are obviously a Computer Science problem. From 1950 to 2015, communication mechanisms were literal. Feedback, signal-to-noise, transmission mechanism, etc., all literally meant wires transmitting electrons. Machine learning isn’t about understanding words, it’s about Bayesian regression. That means math, not communication.
LLMs are different. Training time is still all math and builds the model's instincts. But the intelligence part of artificial intelligence happens at inference time, not training time.
And inference time is communicated in words, not numbers.
Inference engineering
I’d like to introduce a new term: Inference engineer. An inference engineer is someone who manages both sides of Claude Shannon’s model... both the math and the mechanics. That means, yes, writing great prompts, but more importantly, it means clear communication between the person using the AI and the AI implementing it.
A few years back, I left Facebook to build a startup with my best friends. Our startup helped take long videos and turn them into short ones using LLMs.
As we started to scale, the quality of our AI implementation didn’t. Our company’s biggest hurdle ended up being humans reviewing the quality of clips, not developing our product or marketing.
If AI is supposed to save time, it has to actually do that. And if it doesn’t, humans-in-the-loop behind the scenes have to cover for it. Our LLM wasn’t good enough, and so our business couldn’t scale without people. That made our team ask more fundamental questions.
How can we actually know something is reliable? How do we communicate our preferences to the AI? How can we communicate our users’ preferences to the AI?
Sean Grove is one of the smartest people in this area, and he turned me onto Model Spec, Constitutional AI, and other concepts about how to align and verify model behavior.
He gave me the words to communicate what I was feeling: Communication with models defines success with AI. Specs are permanent, code is an artifact of a spec.
Programming will go away, specs will not.
Learning -> building
We took this idea and tried to figure out a practical application. For the last year, our team has been working as sort of consultants... circling ideas like fine-tuning, evals, execution frameworks, and communication theories to fill a growing confidence gap between model performance and consistent outputs.
We’ve worked with some great teams, and were always able to close their trust gaps in surprising ways. (I’ll be posting about those ways! Like and subscribe!)
Finally, we’ve figured out how to explain it. We call it product command.
Product command is a coordination methodology designed to create an environment for an agent to discover the right information at precisely the right moment.
That means agents, rather than humans, tell agents precisely what to do. Humans instead describe their intent and leave the AI to build its own context windows sufficiently to do a task.
People (or other parent agents) are responsible for specifying what the goal is of the next interaction, and the agent is responsible for execution. People shouldn’t be digging in to see which tool calls were made, they should be specifying an intent and then ensuring that intent was met.
The most basic implementation is creating artifacts that say precisely the right amount of information to execute a plan, let agents execute that plan mostly unattended, and then learn from the execution to improve the loop.
We’ll have more tactics and info as we go. In the next post, I’ll explain why this works and why human institutions don’t fundamentally get more effective as they scale, but AI can.
