Context engineering is the way

By Dan Sisco on June 26, 2025

context engineering

evals

reliability

llm

We’ve…been working on this for over a year…and…he just…he tweeted it out.

I really like the term “context engineering” over prompt engineering.

It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.
— tobi lutke (@tobi) June 19, 2025

Context engineering is the new hotness, and we’re so excited there’s a term for this now!

Threading this needle is the key to building a reliable, successful LLM application. And it happens to be exactly what we’ve been working on at Bolt Foundry.

What is context?

Context is everything your model sees before it sends a response.

It’s your prompt, user message, tool calls, user turns, samples, and grader.

With too much context (like prompt stuffing) you divert the LLM’s attention and reliability plummets. With too little context, LLMs are left guessing and filling in gaps, which is similarly unreliable.

Why does this matter?

In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step>
— Andrej Karpathy (@karpathy) June 25, 2025

In human-to-human communication, context is key. We get context through dozens of verbal and nonverbal cues, like body language, tone of voice, eye contact, pitch, and more. Notably remote work sucks because we lose the majority of these contextual clues on Zoom.

LLMs also need context to perform reliably.

We’ve found the best way to provide this context is:

Create data samples from examples of success and failure
Build graders from those samples that reinforce what you want the LLM to do
Structure your prompt with clear information hierarchy

The difference between ai slop and magical experiences is the context you give to the model
— boris tane (@boristane) June 23, 2025

Our work with samples, graders, and evals is context engineering at its core. We’re structuring feedback and examples to optimize LLM performance, which is exactly what Karpathy is describing.

What does this look like in practice?

Merely crafting prompts does not seem like a real fulltime role, but figuring out how to compress context, chain prompts, recover from errors, and measure improvements is super challenging.”
— Amjad Masad (@amasad) January 21, 2023

We recently implemented this approach with Fastpitch, an AI-generated sports newsletter. We wrote about it here, but the highlights are:

We started by creating Ground Truth samples of stories collected by the LLM, scored by a human
We created an additional collection of synthetic samples to reinforce the learning
We used those data samples to build a Grader that evaluates story data
We iterated on that Grader until it agreed with the Ground Truth samples
We then used that Grader to adjust our prompt

Proper information hierarchy also helps LLMs perform better. We've seen this over and over with customers.

We took one customer from 86% reliability on XML output to 100% in less than an hour with some basic prompt tweaks.

This approach to giving the model "just the right information" with human-graded samples, a calibrated Grader, and correctly formatted prompt is the heart of context engineering.

We're thrilled to see more people talking about this.

If you're interested in learning more about context engineering, and making LLM development more science than art, join our community on Discord.