Good, I have all prompts. Now let me implement all 5 prompt rewrites in parallel.
All 5 prompts rewritten. Now let me mark tasks and verify the build.
Only pre-existing errors. Let me also verify the build.
Build passes cleanly.
All prompts updated. Here's a summary of every change:
1. system_prompt.txt — Rewrote from scratch
generateObject schema enforcement)2. enhance_comment_prompt.txt — Restructured + added examples
[id]: observation text3. subject_prompt_v3.txt — Fixed + added example
4. area_standard_prompt_v3.txt — Fixed copy-paste error + added context + example
5. rerank_prompt_v3.txt — Added example + delimiter
END OF MATCHES delimiter to close the variable-length matches sectionNow let me also look at the proposals route to understand the generateObject pattern used there.
Good question — let me trace the actual data through each step and verify the formats match.
Now let me look at real curriculum data to see the actual field values.
I found a real issue. Let me trace the exact data through each step.
The code in Step 2 does exact string matching between the model's returned subject field and real CurriculumStandard.subject values:
Real subject names are: "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples in subject and rerank prompts use shortened names like "Innovation", "Self-Management". If the model follows the example pattern and returns short names, Step 3 filtering breaks silently — no standards pass through, zero matches.
Same issue exists in the rerank prompt where the example shows Innovation > Critical thinking > ... but real data is Meta Skills Innovation > Critical thinking > ....
Let me fix the prompts and add an explicit exactness instruction.
Good. Now I have the full picture. Three fixes needed:
subject_prompt_v3.txt: Use real full subject names in example + add instruction to return names exactlyrerank_prompt_v3.txt: Use full subject names in example matchesarea_standard_prompt_v3.txt: Example is fine (uses IDs and area/standard format correctly — no subject names involved)Now let me verify the build still passes.
Build passes. Here's what I found and fixed:
The pipeline does exact string matching on subject names returned by the model:
Step 2: model returns subject: "Innovation" → stored in Set
Step 3: code checks allRelevantSubjects.has("Meta Skills Innovation") → FALSE → 0 standards pass
Real subject names are compound like "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples showed short names like "Innovation", teaching the model to shorten them.
subject_prompt_v3.txt:
Meta Skills Innovation, Meta Skills Self-Management, Meta Skills Social Intelligence with realistic summary textrerank_prompt_v3.txt:
Meta Skills Innovation > Critical thinking > ... instead of Innovation > Critical thinking > ...area_standard_prompt_v3.txt — no change needed. This step uses [id] Area, Standard format (no subject names) and returns standardId which maps directly to real IDs. The example already matches the real data format.
This is a strategy question more than a code one, so let me think through what makes a good benchmark set.
Core structure: A curated set of real (or realistic) teacher observations, each tagged with expected matches so you can measure pipeline accuracy.
observations/ observations.json ← the 200 entries (what your app consumes) observations-annotated.json ← same entries + expected_standards for scoring README.md ← criteria, coverage goals, contribution guide
Each annotated entry:
The plain observations.json (without annotations) is what gets copied into your app's data/benchmark/ for actual runs. The annotated version lets you score precision/recall.
| Category | Target | Why |
|---|---|---|
| Each subject (8 across templates) | ~15-20 obs each | Ensures no subject is untested |
| Each education level | ~50 per level | Language/complexity varies drastically |
| Multi-skill observations | ~30 | "She led the group and came up with a creative solution" — tests whether pipeline picks up 2+ standards |
| Vague/ambiguous observations | ~20 | "He did great today" — should return low confidence or no match |
| No-match observations | ~15 | Observations outside curriculum scope — tests precision |
| Keyword traps | ~15 | "The student was creative in avoiding work" — tests semantic understanding vs keyword matching |
The biggest question. Three options:
From Manu/teachers directly — Ask for anonymized real observations from existing Volantis users. Best quality, most realistic. Even 50 real ones are worth more than 200 synthetic ones.
LLM-generated with human review — Generate candidate observations per subject/level/tag, then have a teacher review and edit. Fast to produce, decent quality if reviewed.
Hybrid — Start with 50-80 real ones, fill gaps with generated ones for underrepresented subjects/levels.
The repo should include a simple scoring script that:
expected_standardsThis lets you measure whether prompt changes actually improve accuracy.
Want me to scaffold the repo structure with the annotated schema, a scoring script, and a generation helper to fill coverage gaps?
Let me check what the app actually consumes.
Now let me check how the shell consumes it.
The contract is simple. The app expects:
That's it — just id and text. The BenchmarkObservation type also has an optional source field but nothing reads it currently:
So the file your external repo produces just needs to be a JSON file with that { version, observations: [{ id, text }] } shape. Any extra fields (like expected_standards or tags for your scoring) would live only in the annotated version in the external repo and get stripped when you copy the file over.
Created data/benchmark/OBSERVATIONS.md with the file structure, field definitions, examples of good vs weak observations, and a brief explanation of how the pipeline consumes them.
Now I have a clear picture of everything. Let me also check what buildBenchmarkMatchingPrompt is referenced from, to ensure I clean up all imports.
Good — only used in the benchmark route. Now let me create the task list and start implementing.
I'll start by reading the key files I need to understand before implementing.
Now let me create the main pipeline orchestrator. This is the largest piece of work.
Now rewrite the benchmark route to delegate to the pipeline.
Now clean up lib/prompts.ts — remove buildBenchmarkMatchingPrompt().
Let me run the TypeScript check first.
The pre-existing errors are still there (chat-shell.tsx and chat/route.ts). I only need to fix the one new error in lib/benchmark.ts — LanguageModelV1 should be LanguageModel.
Now replace all usages of LanguageModelV1 with LanguageModel.
All remaining errors are the pre-existing ones (chat-shell.tsx and chat/route.ts). No new errors from our changes. Now let me run the build.
Build passes cleanly. /api/benchmark is still registered as a dynamic route.
All done. Here's a summary of what was implemented:
1. Created lib/benchmark.ts — Full 4-step pipeline orchestrator:
.txt files from data/benchmark/prompts/, with interpolatePrompt() for {var}/{{var}} substitutionenhance_comment_prompt.txt. Falls back to originals on failure.subject_prompt_v3.txt. Filters to score >= 3 (RELEVANT). Falls back to treating all subjects as relevant.area_standard_prompt_v3.txt. Falls back to empty matches.rerank_prompt_v3.txt. Falls back to normalized step 3 scores.BenchmarkMatch[]2. Rewrote app/api/benchmark/route.ts — Removed all inline generateObject/Zod schema code. Custom curriculum JSON is now converted into CurriculumStandard[] objects (not a context string). Delegates to runBenchmarkPipeline(). API contract ({ results: BenchmarkMatch[] }) is unchanged.
3. Cleaned lib/prompts.ts — Removed buildBenchmarkMatchingPrompt() (now unused).
npx tsc --noEmit — no new errors (only pre-existing chat-shell.tsx/chat/route.ts ones)npx next build — passes, all routes registered correctlyI'll analyze the benchmark prompts thoroughly — reading each one and evaluating them against prompt engineering best practices.
Now let me do a deep analysis of each prompt.
The 4-step funnel (Enhance → Subject → Area/Standard → Rerank) is a strong pattern — it progressively narrows the search space, reducing token cost and improving precision. However, several prompts have issues worth addressing.
system_prompt.txt — Grade: CIssues:
[id] Area, Standard, match list). This is misleading context.generateObject() with a schema, so the model can't deviate anyway.Recommendation: Rewrite as a focused 3-line role + task definition. Drop the security theater and the input format description (each step prompt already specifies its own format).
enhance_comment_prompt.txt — Grade: B-Issues:
[id]: text pairs. The prompt doesn't tell the model to process multiple comments or preserve IDs.{ enhanced_comments: [{ id, text }] } but the prompt never mentions returning IDs or an array. The model has to figure this out from the schema alone.Recommendation: Add 2 few-shot examples (one that needs fixing, one that doesn't). Explicitly describe the batch input format and the expected output structure.
subject_prompt_v3.txt — Grade: A-Strengths:
⭑ section headersIssues:
explanation fields. There's no top-level summary field. This instruction goes nowhere.{comment} placeholder contains multiple [id]: text pairs — the prompt should say "For each observation, evaluate every subject."Recommendation: Remove instruction #4 (or add a summary field to the schema). Clarify batch processing. Add 1 few-shot example.
area_standard_prompt_v3.txt — Grade: B+Strengths:
Issues:
Recommendation: Fix "subject" → "area standard" in RELEVANT description. Add filtering context. Add 1 example.
rerank_prompt_v3.txt — Grade: A-Strengths:
Issues:
Recommendation: Add 1 compact few-shot example. Add "END OF MATCHES" delimiter.
| Issue | Impact | Fix Effort |
|---|---|---|
| System prompt is generic/misleading for step-specific tasks | Medium — adds noise, wastes tokens | Low |
| No few-shot examples in any prompt | High — inconsistent scoring calibration | Medium |
| Batch vs. singular mismatch in enhance + subject prompts | Medium — model infers from schema but may hallucinate IDs | Low |
| Copy-paste error in area_standard "subject" label | Low — model likely infers correct intent | Trivial |
| Subject prompt instruction #4 has no schema backing | Low — model ignores or jams it into explanation | Trivial |
Want me to implement any of these improvements?
Let me read all prompts to make precise edits.