Context engineering is the way

image.png

We’ve…been working on this for over a year…and…he just…he tweeted it out.

Context engineering is the new hotness, and we’re so excited there’s a term for this now!

Threading this needle is the key to building a reliable, successful LLM application. And it happens to be exactly what we’ve been working on at Bolt Foundry.

What is context?

Context is everything your model sees before it sends a response.

It’s your prompt, user message, tool calls, user turns, samples, and grader.

With too much context (like prompt stuffing) you divert the LLM’s attention and reliability plummets. With too little context, LLMs are left guessing and filling in gaps, which is similarly unreliable.

Why does this matter?

In human-to-human communication, context is key. We get context through dozens of verbal and nonverbal cues, like body language, tone of voice, eye contact, pitch, and more. Notably remote work sucks because we lose the majority of these contextual clues on Zoom.

LLMs also need context to perform reliably.

We’ve found the best way to provide this context is:

  1. Create data samples from examples of success and failure
  2. Build graders from those samples that reinforce what you want the LLM to do
  3. Structure your prompt with clear information hierarchy

Our work with samples, graders, and evals is context engineering at its core. We’re structuring feedback and examples to optimize LLM performance, which is exactly what Karpathy is describing.

What does this look like in practice?

We recently implemented this approach with Fastpitch, an AI-generated sports newsletter. We wrote about it here, but the highlights are:

  1. We started by creating Ground Truth samples of stories collected by the LLM, scored by a human
  2. We created an additional collection of synthetic samples to reinforce the learning
  3. We used those data samples to build a Grader that evaluates story data
  4. We iterated on that Grader until it agreed with the Ground Truth samples
  5. We then used that Grader to adjust our prompt

Proper information hierarchy also helps LLMs perform better. We've seen this over and over with customers.

We took one customer from 86% reliability on XML output to 100% in less than an hour with some basic prompt tweaks.

This approach to giving the model "just the right information" with human-graded samples, a calibrated Grader, and correctly formatted prompt is the heart of context engineering.

We're thrilled to see more people talking about this.

If you're interested in learning more about context engineering, and making LLM development more science than art, join our community on Discord.