Most teams manage prompts the way they managed code before version control — scattered across Slack messages, buried in notebooks, copy-pasted between services. This doesn’t scale.
Prompts Are Code
A prompt is a function: it takes inputs, produces outputs, has edge cases, and breaks in production. Treat it accordingly.
SUMMARY_PROMPT_V2 = """
You are a technical writer. Summarize the following article.
Rules:
- Maximum 3 sentences
- Lead with the key insight
- Preserve technical accuracy
- Do not add information not present in the source
Article:
{article_text}
Summary:
"""
# v1: Basic summarization — produced vague outputs
# v2: Added rules and constraints — 40% improvement in user ratings
Prompt regression is silent and deadly. A seemingly innocent change to a system prompt can degrade output quality for specific edge cases without affecting average performance. Without automated evals, you won’t notice until users complain — and by then, you’ve already shipped the regression.
Building a Prompt Workflow
Every prompt in our system goes through this lifecycle:
- Draft — Write the initial prompt with clear instructions and constraints
- Test — Run against a golden dataset of 50+ examples
- Review — Peer review for ambiguity and edge cases
- Version — Tag and store with metadata (author, intent, eval scores)
- Deploy — Roll out with feature flags, monitor eval metrics
- Iterate — Improve based on production data, loop back to step 2
What Gets Measured Gets Managed
Track these metrics for every prompt version:
- Task completion rate
- Output consistency across runs
- Latency and token usage
- User satisfaction signals
The discipline isn’t new. We’re just applying it to a new interface.