Implement the following plan:
The benchmark API currently uses a single LLM call per batch. Manu provided 5 production prompts (data/benchmark/prompts/) that define a multi-step matching funnel. We need to replace the single-call approach with this full pipeline: Enhance → Subject Score → Area/Standard Score → Rerank.
No client-side changes needed — the API contract ({ results: BenchmarkMatch[] }) stays the same.
| Action | File |
|---|---|
| Create | lib/benchmark.ts — Pipeline orchestrator, prompt loader, schemas, step functions |
| Rewrite | app/api/benchmark/route.ts — Delegate to runBenchmarkPipeline() |
| Cleanup | lib/prompts.ts — Remove now-unused buildBenchmarkMatchingPrompt() |
lib/benchmark.ts)loadBenchmarkPrompts() reads all 5 .txt files from data/benchmark/prompts/, caches at module levelinterpolatePrompt(template, vars) replaces {var} and {{var}} placeholdersenhance_comment_prompt.txt with {comment} = batch listing{ enhanced_comments: { id, text }[] }CurriculumStandard[] via extractSubjects() (group by subject, combine summary values)subject_prompt_v3.txt with {subjects} = formatted subject list, {comment} = batch, {{feedback}} = ""{ scores: { observationId, subject, score (0-4), explanation }[] }[id] Area, Standard via formatAreasStandardsForPrompt()area_standard_prompt_v3.txt with {areas_standards}, {comment}, {{feedback}}{ scores: { observationId, standardId, score (0-4), explanation }[] }Subject > Area > Standard per observationrerank_prompt_v3.txt with {comment}, {matches}, {{feedback}}{ results: { observationId, standardId, score (0-10), reasoning }[] }BenchmarkMatch[] — same format as current APIapp/api/benchmark/route.ts)generateObject / Zod schema codeCurriculumStandard[] objects (instead of context string)runBenchmarkPipeline(openrouter, observations, standards, feedback=""){ results } unchangedloadTemplates(), crewMatchesLevels() — existing functionsgenerateObject() + Zod + openrouter("google/gemini-2.5-flash") — same pattern as proposals routefs.readFile for prompt loading — same pattern as loadTemplate() in parse-curriculum.tssystem_prompt.txt used as system message for all 4 generateObject() callsnpx tsc --noEmit — no new errorsnpx next build — passes, /api/benchmark still registeredIf you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/gabrielduarte/.claude/projects/-Users-gabrielduarte-Documents-GitHub-Volantis-create-curriculum/d16ef590-ffca-4050-ab83-3457efad18c0.jsonl
I'll start by reading the key files I need to understand before implementing.
Now let me also look at the proposals route to understand the generateObject pattern used there.
Now I have a clear picture of everything. Let me also check what buildBenchmarkMatchingPrompt is referenced from, to ensure I clean up all imports.
Good — only used in the benchmark route. Now let me create the task list and start implementing.
Now let me create the main pipeline orchestrator. This is the largest piece of work.
Now rewrite the benchmark route to delegate to the pipeline.
Now clean up lib/prompts.ts — remove buildBenchmarkMatchingPrompt().
Let me run the TypeScript check first.
The pre-existing errors are still there (chat-shell.tsx and chat/route.ts). I only need to fix the one new error in lib/benchmark.ts — LanguageModelV1 should be LanguageModel.
Now replace all usages of LanguageModelV1 with LanguageModel.
All remaining errors are the pre-existing ones (chat-shell.tsx and chat/route.ts). No new errors from our changes. Now let me run the build.
Build passes cleanly. /api/benchmark is still registered as a dynamic route.
All done. Here's a summary of what was implemented:
1. Created lib/benchmark.ts — Full 4-step pipeline orchestrator:
.txt files from data/benchmark/prompts/, with interpolatePrompt() for {var}/{{var}} substitutionenhance_comment_prompt.txt. Falls back to originals on failure.subject_prompt_v3.txt. Filters to score >= 3 (RELEVANT). Falls back to treating all subjects as relevant.area_standard_prompt_v3.txt. Falls back to empty matches.rerank_prompt_v3.txt. Falls back to normalized step 3 scores.BenchmarkMatch[]2. Rewrote app/api/benchmark/route.ts — Removed all inline generateObject/Zod schema code. Custom curriculum JSON is now converted into CurriculumStandard[] objects (not a context string). Delegates to runBenchmarkPipeline(). API contract ({ results: BenchmarkMatch[] }) is unchanged.
3. Cleaned lib/prompts.ts — Removed buildBenchmarkMatchingPrompt() (now unused).
npx tsc --noEmit — no new errors (only pre-existing chat-shell.tsx/chat/route.ts ones)npx next build — passes, all routes registered correctlyBase directory for this skill: /Users/gabrielduarte/.claude/skills/prompt-engineering
Advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability.
Teach the model by showing examples instead of explaining rules. Include 2-5 input-output pairs that demonstrate the desired behavior. Use when you need consistent formatting, specific reasoning patterns, or handling of edge cases. More examples improve accuracy but consume tokens—balance based on task complexity.
Example:
Request step-by-step reasoning before the final answer. Add "Let's think step by step" (zero-shot) or include example reasoning traces (few-shot). Use for complex problems requiring multi-step logic, mathematical reasoning, or when you need to verify the model's thought process. Improves accuracy on analytical tasks by 30-50%.
Example:
Systematically improve prompts through testing and refinement. Start simple, measure performance (accuracy, consistency, token usage), then iterate. Test on diverse inputs including edge cases. Use A/B testing to compare variations. Critical for production prompts where consistency and cost matter.
Example:
Build reusable prompt structures with variables, conditional sections, and modular components. Use for multi-turn conversations, role-based interactions, or when the same pattern applies to different inputs. Reduces duplication and ensures consistency across similar tasks.
Example:
Set global behavior and constraints that persist across the conversation. Define the model's role, expertise level, output format, and safety guidelines. Use system prompts for stable instructions that shouldn't change turn-to-turn, freeing up user message tokens for variable content.
Example:
Start with simple prompts, add complexity only when needed:
Level 1: Direct instruction
Level 2: Add constraints
Level 3: Add reasoning
Level 4: Add examples
[System Context] → [Task Instruction] → [Examples] → [Input Data] → [Output Format]
Build prompts that gracefully handle failures:
ARGUMENTS: analyze the prompts of the benchmark
I'll analyze the benchmark prompts thoroughly — reading each one and evaluating them against prompt engineering best practices.
Now let me do a deep analysis of each prompt.
The 4-step funnel (Enhance → Subject → Area/Standard → Rerank) is a strong pattern — it progressively narrows the search space, reducing token cost and improving precision. However, several prompts have issues worth addressing.
system_prompt.txt — Grade: CIssues:
[id] Area, Standard, match list). This is misleading context.generateObject() with a schema, so the model can't deviate anyway.Recommendation: Rewrite as a focused 3-line role + task definition. Drop the security theater and the input format description (each step prompt already specifies its own format).
enhance_comment_prompt.txt — Grade: B-Issues:
[id]: text pairs. The prompt doesn't tell the model to process multiple comments or preserve IDs.{ enhanced_comments: [{ id, text }] } but the prompt never mentions returning IDs or an array. The model has to figure this out from the schema alone.Recommendation: Add 2 few-shot examples (one that needs fixing, one that doesn't). Explicitly describe the batch input format and the expected output structure.
subject_prompt_v3.txt — Grade: A-Strengths:
⭑ section headersIssues:
explanation fields. There's no top-level summary field. This instruction goes nowhere.{comment} placeholder contains multiple [id]: text pairs — the prompt should say "For each observation, evaluate every subject."Recommendation: Remove instruction #4 (or add a summary field to the schema). Clarify batch processing. Add 1 few-shot example.
area_standard_prompt_v3.txt — Grade: B+Strengths:
Issues:
Recommendation: Fix "subject" → "area standard" in RELEVANT description. Add filtering context. Add 1 example.
rerank_prompt_v3.txt — Grade: A-Strengths:
Issues:
Recommendation: Add 1 compact few-shot example. Add "END OF MATCHES" delimiter.
| Issue | Impact | Fix Effort |
|---|---|---|
| System prompt is generic/misleading for step-specific tasks | Medium — adds noise, wastes tokens | Low |
| No few-shot examples in any prompt | High — inconsistent scoring calibration | Medium |
| Batch vs. singular mismatch in enhance + subject prompts | Medium — model infers from schema but may hallucinate IDs | Low |
| Copy-paste error in area_standard "subject" label | Low — model likely infers correct intent | Trivial |
| Subject prompt instruction #4 has no schema backing | Low — model ignores or jams it into explanation | Trivial |
Want me to implement any of these improvements?
implement the changes
Let me read all prompts to make precise edits.
Good, I have all prompts. Now let me implement all 5 prompt rewrites in parallel.
All 5 prompts rewritten. Now let me mark tasks and verify the build.
Only pre-existing errors. Let me also verify the build.
Build passes cleanly.
All prompts updated. Here's a summary of every change:
1. system_prompt.txt — Rewrote from scratch
generateObject schema enforcement)2. enhance_comment_prompt.txt — Restructured + added examples
[id]: observation text3. subject_prompt_v3.txt — Fixed + added example
4. area_standard_prompt_v3.txt — Fixed copy-paste error + added context + example
5. rerank_prompt_v3.txt — Added example + delimiter
END OF MATCHES delimiter to close the variable-length matches sectionwill these prompts work with the curriculum structure that it needs to make the matches?
Good question — let me trace the actual data through each step and verify the formats match.
Now let me look at real curriculum data to see the actual field values.
I found a real issue. Let me trace the exact data through each step.
The code in Step 2 does exact string matching between the model's returned subject field and real CurriculumStandard.subject values:
Real subject names are: "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples in subject and rerank prompts use shortened names like "Innovation", "Self-Management". If the model follows the example pattern and returns short names, Step 3 filtering breaks silently — no standards pass through, zero matches.
Same issue exists in the rerank prompt where the example shows Innovation > Critical thinking > ... but real data is Meta Skills Innovation > Critical thinking > ....
Let me fix the prompts and add an explicit exactness instruction.
Good. Now I have the full picture. Three fixes needed:
subject_prompt_v3.txt: Use real full subject names in example + add instruction to return names exactlyrerank_prompt_v3.txt: Use full subject names in example matchesarea_standard_prompt_v3.txt: Example is fine (uses IDs and area/standard format correctly — no subject names involved)Now let me verify the build still passes.
Build passes. Here's what I found and fixed:
The pipeline does exact string matching on subject names returned by the model:
Step 2: model returns subject: "Innovation" → stored in Set
Step 3: code checks allRelevantSubjects.has("Meta Skills Innovation") → FALSE → 0 standards pass
Real subject names are compound like "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples showed short names like "Innovation", teaching the model to shorten them.
subject_prompt_v3.txt:
Meta Skills Innovation, Meta Skills Self-Management, Meta Skills Social Intelligence with realistic summary textrerank_prompt_v3.txt:
Meta Skills Innovation > Critical thinking > ... instead of Innovation > Critical thinking > ...area_standard_prompt_v3.txt — no change needed. This step uses [id] Area, Standard format (no subject names) and returns standardId which maps directly to real IDs. The example already matches the real data format.
Great, so now I will create a repo to select these 200 observations to benchmark the curriculum. How should I do it?
This is a strategy question more than a code one, so let me think through what makes a good benchmark set.
Core structure: A curated set of real (or realistic) teacher observations, each tagged with expected matches so you can measure pipeline accuracy.
observations/ observations.json ← the 200 entries (what your app consumes) observations-annotated.json ← same entries + expected_standards for scoring README.md ← criteria, coverage goals, contribution guide
Each annotated entry:
The plain observations.json (without annotations) is what gets copied into your app's data/benchmark/ for actual runs. The annotated version lets you score precision/recall.
| Category | Target | Why |
|---|---|---|
| Each subject (8 across templates) | ~15-20 obs each | Ensures no subject is untested |
| Each education level | ~50 per level | Language/complexity varies drastically |
| Multi-skill observations | ~30 | "She led the group and came up with a creative solution" — tests whether pipeline picks up 2+ standards |
| Vague/ambiguous observations | ~20 | "He did great today" — should return low confidence or no match |
| No-match observations | ~15 | Observations outside curriculum scope — tests precision |
| Keyword traps | ~15 | "The student was creative in avoiding work" — tests semantic understanding vs keyword matching |
The biggest question. Three options:
From Manu/teachers directly — Ask for anonymized real observations from existing Volantis users. Best quality, most realistic. Even 50 real ones are worth more than 200 synthetic ones.
LLM-generated with human review — Generate candidate observations per subject/level/tag, then have a teacher review and edit. Fast to produce, decent quality if reviewed.
Hybrid — Start with 50-80 real ones, fill gaps with generated ones for underrepresented subjects/levels.
The repo should include a simple scoring script that:
expected_standardsThis lets you measure whether prompt changes actually improve accuracy.
Want me to scaffold the repo structure with the annotated schema, a scoring script, and a generation helper to fill coverage gaps?
what's the structure that observations.json should have?
Let me check what the app actually consumes.
Now let me check how the shell consumes it.
The contract is simple. The app expects:
That's it — just id and text. The BenchmarkObservation type also has an optional source field but nothing reads it currently:
So the file your external repo produces just needs to be a JSON file with that { version, observations: [{ id, text }] } shape. Any extra fields (like expected_standards or tags for your scoring) would live only in the annotated version in the external repo and get stripped when you copy the file over.
create an .md file with how the observations they need to be formatted
Created data/benchmark/OBSERVATIONS.md with the file structure, field definitions, examples of good vs weak observations, and a brief explanation of how the pipeline consumes them.