Good, I have all prompts. Now let me implement all 5 prompt rewrites in parallel.

All 5 prompts rewritten. Now let me mark tasks and verify the build.

Only pre-existing errors. Let me also verify the build.

Build passes cleanly.

All prompts updated. Here's a summary of every change:

Changes Made

1. system_prompt.txt — Rewrote from scratch

Replaced verbose spec-doc with 4-line focused role + rules
Dropped security theater (redundant with generateObject schema enforcement)
Dropped misleading "JSON File with three columns" input description (each step defines its own format)
Added the key behavioral rule: "Focus on skill/behavior, not keyword overlap"

2. enhance_comment_prompt.txt — Restructured + added examples

Switched from wall-of-text to bullet-point criteria
Clarified batch input format: [id]: observation text
Added 2 few-shot examples (one needing fixes, one unchanged)
Explicitly stated to preserve original IDs

3. subject_prompt_v3.txt — Fixed + added example

Removed orphan instruction #4 ("provide a brief summary") that had no schema backing
Changed "teacher's comment" (singular) → "one or more teacher observations" throughout
Changed scoring guide wording from "comment" → "observation" for consistency
Added 1 few-shot example showing 3 subjects scored against 1 observation with varying scores

4. area_standard_prompt_v3.txt — Fixed copy-paste error + added context + example

Fixed RELEVANT (3) description: "supports or extends the subject" → "supports or extends the area standard"
Added pre-filtering context: "These area standards were pre-filtered to subjects that are relevant"
Changed singular "comment" → batch-aware "observations" throughout
Added 1 few-shot example showing 3 standards scored (4, 3, 0) with explanations

5. rerank_prompt_v3.txt — Added example + delimiter

Changed singular "comment" → batch-aware "observations" throughout
Added 1 few-shot example showing realistic score spread (9, 7, 2)
Added END OF MATCHES delimiter to close the variable-length matches section
Moved feedback section before matches (keeping long data at the end)

will these prompts work with the curriculum structure that it needs to make the matches?

Now let me also look at the proposals route to understand the generateObject pattern used there.

Good question — let me trace the actual data through each step and verify the formats match.

Now let me look at real curriculum data to see the actual field values.

I found a real issue. Let me trace the exact data through each step.

The Problem: Subject Name Matching

The code in Step 2 does exact string matching between the model's returned subject field and real CurriculumStandard.subject values:

typescript

// Step 2 output: model returns subject nameset.add(score.subject)  // e.g. "Innovation"
// Step 3 filter: compares against real standard.subjectallRelevantSubjects.has(s.subject)  // s.subject = "Meta Skills Innovation"

// Step 2 output: model returns subject nameset.add(score.subject)  // e.g. "Innovation"
// Step 3 filter: compares against real standard.subjectallRelevantSubjects.has(s.subject)  // s.subject = "Meta Skills Innovation"

Real subject names are: "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples in subject and rerank prompts use shortened names like "Innovation", "Self-Management". If the model follows the example pattern and returns short names, Step 3 filtering breaks silently — no standards pass through, zero matches.

Same issue exists in the rerank prompt where the example shows Innovation > Critical thinking > ... but real data is Meta Skills Innovation > Critical thinking > ....

Let me fix the prompts and add an explicit exactness instruction.

Good. Now I have the full picture. Three fixes needed:

subject_prompt_v3.txt: Use real full subject names in example + add instruction to return names exactly
rerank_prompt_v3.txt: Use full subject names in example matches
area_standard_prompt_v3.txt: Example is fine (uses IDs and area/standard format correctly — no subject names involved)

Now let me verify the build still passes.

Build passes. Here's what I found and fixed:

The Bug

The pipeline does exact string matching on subject names returned by the model:

Step 2: model returns subject: "Innovation" → stored in SetStep 3: code checks allRelevantSubjects.has("Meta Skills Innovation") → FALSE → 0 standards pass

Step 2: model returns subject: "Innovation" → stored in SetStep 3: code checks allRelevantSubjects.has("Meta Skills Innovation") → FALSE → 0 standards pass

Real subject names are compound like "Meta Skills Innovation", "IB Learner Profile Communication", "Approaches to Teaching and Learning" — but the few-shot examples showed short names like "Innovation", teaching the model to shorten them.

Fixes Applied

subject_prompt_v3.txt:

Added instruction #3: "Return subject names EXACTLY as they appear in the input list — do not shorten, abbreviate, or rephrase them."
Updated few-shot example to use realistic full names: Meta Skills Innovation, Meta Skills Self-Management, Meta Skills Social Intelligence with realistic summary text

rerank_prompt_v3.txt:

Updated example matches to use full subject names: Meta Skills Innovation > Critical thinking > ... instead of Innovation > Critical thinking > ...

area_standard_prompt_v3.txt — no change needed. This step uses [id] Area, Standard format (no subject names) and returns standardId which maps directly to real IDs. The example already matches the real data format.

Great, so now I will create a repo to select these 200 observations to benchmark the curriculum. How should I do it?

This is a strategy question more than a code one, so let me think through what makes a good benchmark set.

What the observations repo needs

Core structure: A curated set of real (or realistic) teacher observations, each tagged with expected matches so you can measure pipeline accuracy.

observations/  observations.json          ← the 200 entries (what your app consumes)  observations-annotated.json ← same entries + expected_standards for scoring  README.md                  ← criteria, coverage goals, contribution guide

observations/  observations.json          ← the 200 entries (what your app consumes)  observations-annotated.json ← same entries + expected_standards for scoring  README.md                  ← criteria, coverage goals, contribution guide

Each annotated entry:

json

{  "id": "obs-001",  "text": "She questioned the author's assumptions and offered counter-arguments during the book discussion.",  "level": "middle",  "expected_standards": ["meta-skills-42", "meta-skills-45"],  "tags": ["critical-thinking", "clear-signal"]}

{  "id": "obs-001",  "text": "She questioned the author's assumptions and offered counter-arguments during the book discussion.",  "level": "middle",  "expected_standards": ["meta-skills-42", "meta-skills-45"],  "tags": ["critical-thinking", "clear-signal"]}

The plain observations.json (without annotations) is what gets copied into your app's data/benchmark/ for actual runs. The annotated version lets you score precision/recall.

Coverage you should aim for

Category	Target	Why
Each subject (8 across templates)	~15-20 obs each	Ensures no subject is untested
Each education level	~50 per level	Language/complexity varies drastically
Multi-skill observations	~30	"She led the group and came up with a creative solution" — tests whether pipeline picks up 2+ standards
Vague/ambiguous observations	~20	"He did great today" — should return low confidence or no match
No-match observations	~15	Observations outside curriculum scope — tests precision
Keyword traps	~15	"The student was creative in avoiding work" — tests semantic understanding vs keyword matching

How to source them

The biggest question. Three options:

From Manu/teachers directly — Ask for anonymized real observations from existing Volantis users. Best quality, most realistic. Even 50 real ones are worth more than 200 synthetic ones.
LLM-generated with human review — Generate candidate observations per subject/level/tag, then have a teacher review and edit. Fast to produce, decent quality if reviewed.
Hybrid — Start with 50-80 real ones, fill gaps with generated ones for underrepresented subjects/levels.

Scoring script

The repo should include a simple scoring script that:

Runs your pipeline against the observations
Compares results to expected_standards
Reports precision, recall, and F1 per subject/level

This lets you measure whether prompt changes actually improve accuracy.

Want me to scaffold the repo structure with the annotated schema, a scoring script, and a generation helper to fill coverage gaps?

what's the structure that observations.json should have?

Let me check what the app actually consumes.

Now let me check how the shell consumes it.

The contract is simple. The app expects:

json

{  "version": 1,  "observations": [    { "id": "obs-001", "text": "The actual teacher observation text" },    { "id": "obs-002", "text": "..." }  ]}

{  "version": 1,  "observations": [    { "id": "obs-001", "text": "The actual teacher observation text" },    { "id": "obs-002", "text": "..." }  ]}

That's it — just id and text. The BenchmarkObservation type also has an optional source field but nothing reads it currently:

typescript

// from lib/curriculum-types.tsinterface BenchmarkObservation {  id: string  text: string  source?: string}

// from lib/curriculum-types.tsinterface BenchmarkObservation {  id: string  text: string  source?: string}

So the file your external repo produces just needs to be a JSON file with that { version, observations: [{ id, text }] } shape. Any extra fields (like expected_standards or tags for your scoring) would live only in the annotated version in the external repo and get stripped when you copy the file over.

create an .md file with how the observations they need to be formatted

Created data/benchmark/OBSERVATIONS.md with the file structure, field definitions, examples of good vs weak observations, and a brief explanation of how the pipeline consumes them.

Now I have a clear picture of everything. Let me also check what buildBenchmarkMatchingPrompt is referenced from, to ensure I clean up all imports.

Implement the following plan:

Benchmark Pipeline Refactor — Full 4-Step Pipeline

Context

The benchmark API currently uses a single LLM call per batch. Manu provided 5 production prompts (data/benchmark/prompts/) that define a multi-step matching funnel. We need to replace the single-call approach with this full pipeline: Enhance → Subject Score → Area/Standard Score → Rerank.

No client-side changes needed — the API contract ({ results: BenchmarkMatch[] }) stays the same.

Files

Action	File
Create	`lib/benchmark.ts` — Pipeline orchestrator, prompt loader, schemas, step functions
Rewrite	`app/api/benchmark/route.ts` — Delegate to `runBenchmarkPipeline()`
Cleanup	`lib/prompts.ts` — Remove now-unused `buildBenchmarkMatchingPrompt()`

Pipeline Architecture (inside `lib/benchmark.ts`)

Prompt Loading

loadBenchmarkPrompts() reads all 5 .txt files from data/benchmark/prompts/, caches at module level
interpolatePrompt(template, vars) replaces {var} and {{var}} placeholders

Step 1: Enhance Comments

Input: batch of observations
Prompt: enhance_comment_prompt.txt with {comment} = batch listing
Schema: { enhanced_comments: { id, text }[] }
Fallback on failure: use original observation texts

Step 2: Subject Scoring

Extract unique subjects from CurriculumStandard[] via extractSubjects() (group by subject, combine summary values)
Prompt: subject_prompt_v3.txt with {subjects} = formatted subject list, {comment} = batch, {{feedback}} = ""
Schema: { scores: { observationId, subject, score (0-4), explanation }[] }
Fallback on failure: treat all subjects as RELEVANT (3)

Step 3: Area/Standard Scoring

Filter standards to only subjects scoring >= 3 (RELEVANT) from step 2
Format as [id] Area, Standard via formatAreasStandardsForPrompt()
Prompt: area_standard_prompt_v3.txt with {areas_standards}, {comment}, {{feedback}}
Schema: { scores: { observationId, standardId, score (0-4), explanation }[] }
Fallback on failure: return empty matches

Step 4: Rerank

Collect candidates: standards scoring >= 3 from step 3
Format matches as Subject > Area > Standard per observation
Prompt: rerank_prompt_v3.txt with {comment}, {matches}, {{feedback}}
Schema: { results: { observationId, standardId, score (0-10), reasoning }[] }
Fallback on failure: use step 3 scores normalized to 0-10

Final Conversion

Filter rerank results to score >= 5 (Moderately Relevant)
Take top 3 per observation, normalize score to 0-1 confidence
Return BenchmarkMatch[] — same format as current API

API Route Changes (`app/api/benchmark/route.ts`)

Remove all generateObject / Zod schema code
Convert custom curriculum JSON into CurriculumStandard[] objects (instead of context string)
Call runBenchmarkPipeline(openrouter, observations, standards, feedback="")
Return { results } unchanged

Key Reuse

loadTemplates(), crewMatchesLevels() — existing functions
generateObject() + Zod + openrouter("google/gemini-2.5-flash") — same pattern as proposals route
fs.readFile for prompt loading — same pattern as loadTemplate() in parse-curriculum.ts
system_prompt.txt used as system message for all 4 generateObject() calls

Performance

4 LLM calls per batch instead of 1, but each is more focused
Subject filtering (step 2→3) reduces ~800 standards to ~100-250 for area/standard scoring
Gemini Flash at ~1-3s per call = ~4-12s per batch, well within 60s function timeout
Shell already shows progress per batch — no UI changes needed

Verification

npx tsc --noEmit — no new errors
npx next build — passes, /api/benchmark still registered
Full flow test: Templates → Benchmark (progress shows 4-step processing) → Results → Chat
Error resilience: kill API mid-batch → shell shows partial results + empty matches for failed batch

If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/gabrielduarte/.claude/projects/-Users-gabrielduarte-Documents-GitHub-Volantis-create-curriculum/d16ef590-ffca-4050-ab83-3457efad18c0.jsonl

Good — only used in the benchmark route. Now let me create the task list and start implementing.

I'll start by reading the key files I need to understand before implementing.

Now let me create the main pipeline orchestrator. This is the largest piece of work.

Now rewrite the benchmark route to delegate to the pipeline.

Now clean up lib/prompts.ts — remove buildBenchmarkMatchingPrompt().

Let me run the TypeScript check first.

The pre-existing errors are still there (chat-shell.tsx and chat/route.ts). I only need to fix the one new error in lib/benchmark.ts — LanguageModelV1 should be LanguageModel.

Now replace all usages of LanguageModelV1 with LanguageModel.

All remaining errors are the pre-existing ones (chat-shell.tsx and chat/route.ts). No new errors from our changes. Now let me run the build.

Build passes cleanly. /api/benchmark is still registered as a dynamic route.

All done. Here's a summary of what was implemented:

Changes Made

1. Created lib/benchmark.ts — Full 4-step pipeline orchestrator:

Prompt loading: Reads and caches all 5 .txt files from data/benchmark/prompts/, with interpolatePrompt() for {var}/{{var}} substitution
Step 1 (Enhance): Cleans up observation text via enhance_comment_prompt.txt. Falls back to originals on failure.
Step 2 (Subject Score): Extracts unique subjects, scores each against observations via subject_prompt_v3.txt. Filters to score >= 3 (RELEVANT). Falls back to treating all subjects as relevant.
Step 3 (Area/Standard Score): Filters ~800 standards down to only relevant subjects, scores each via area_standard_prompt_v3.txt. Falls back to empty matches.
Step 4 (Rerank): Takes candidates scoring >= 3, reranks on 0-10 scale via rerank_prompt_v3.txt. Falls back to normalized step 3 scores.
Final conversion: Filters to score >= 5, takes top 3 per observation, normalizes to 0-1 confidence → BenchmarkMatch[]

2. Rewrote app/api/benchmark/route.ts — Removed all inline generateObject/Zod schema code. Custom curriculum JSON is now converted into CurriculumStandard[] objects (not a context string). Delegates to runBenchmarkPipeline(). API contract ({ results: BenchmarkMatch[] }) is unchanged.

3. Cleaned lib/prompts.ts — Removed buildBenchmarkMatchingPrompt() (now unused).

Verification

npx tsc --noEmit — no new errors (only pre-existing chat-shell.tsx/chat/route.ts ones)
npx next build — passes, all routes registered correctly

Base directory for this skill: /Users/gabrielduarte/.claude/skills/prompt-engineering

Prompt Engineering Patterns

Advanced prompt engineering techniques to maximize LLM performance, reliability, and controllability.

Core Capabilities

1. Few-Shot Learning

Teach the model by showing examples instead of explaining rules. Include 2-5 input-output pairs that demonstrate the desired behavior. Use when you need consistent formatting, specific reasoning patterns, or handling of edge cases. More examples improve accuracy but consume tokens—balance based on task complexity.

Example:

markdown

Extract key information from support tickets:
Input: "My login doesn't work and I keep getting error 403"Output: {"issue": "authentication", "error_code": "403", "priority": "high"}
Input: "Feature request: add dark mode to settings"Output: {"issue": "feature_request", "error_code": null, "priority": "low"}
Now process: "Can't upload files larger than 10MB, getting timeout"

Extract key information from support tickets:
Input: "My login doesn't work and I keep getting error 403"Output: {"issue": "authentication", "error_code": "403", "priority": "high"}
Input: "Feature request: add dark mode to settings"Output: {"issue": "feature_request", "error_code": null, "priority": "low"}
Now process: "Can't upload files larger than 10MB, getting timeout"

2. Chain-of-Thought Prompting

Request step-by-step reasoning before the final answer. Add "Let's think step by step" (zero-shot) or include example reasoning traces (few-shot). Use for complex problems requiring multi-step logic, mathematical reasoning, or when you need to verify the model's thought process. Improves accuracy on analytical tasks by 30-50%.

Example:

markdown

Analyze this bug report and determine root cause.
Think step by step:
1. What is the expected behavior?2. What is the actual behavior?3. What changed recently that could cause this?4. What components are involved?5. What is the most likely root cause?
Bug: "Users can't save drafts after the cache update deployed yesterday"

Analyze this bug report and determine root cause.
Think step by step:
1. What is the expected behavior?2. What is the actual behavior?3. What changed recently that could cause this?4. What components are involved?5. What is the most likely root cause?
Bug: "Users can't save drafts after the cache update deployed yesterday"

3. Prompt Optimization

Systematically improve prompts through testing and refinement. Start simple, measure performance (accuracy, consistency, token usage), then iterate. Test on diverse inputs including edge cases. Use A/B testing to compare variations. Critical for production prompts where consistency and cost matter.

Example:

markdown

Version 1 (Simple): "Summarize this article"→ Result: Inconsistent length, misses key points
Version 2 (Add constraints): "Summarize in 3 bullet points"→ Result: Better structure, but still misses nuance
Version 3 (Add reasoning): "Identify the 3 main findings, then summarize each"→ Result: Consistent, accurate, captures key information

Version 1 (Simple): "Summarize this article"→ Result: Inconsistent length, misses key points
Version 2 (Add constraints): "Summarize in 3 bullet points"→ Result: Better structure, but still misses nuance
Version 3 (Add reasoning): "Identify the 3 main findings, then summarize each"→ Result: Consistent, accurate, captures key information

4. Template Systems

Build reusable prompt structures with variables, conditional sections, and modular components. Use for multi-turn conversations, role-based interactions, or when the same pattern applies to different inputs. Reduces duplication and ensures consistency across similar tasks.

Example:

python

# Reusable code review templatetemplate = """Review this {language} code for {focus_area}.
Code:{code_block}
Provide feedback on:{checklist}"""
# Usageprompt = template.format(    language="Python",    focus_area="security vulnerabilities",    code_block=user_code,    checklist="1. SQL injection\n2. XSS risks\n3. Authentication")

# Reusable code review templatetemplate = """Review this {language} code for {focus_area}.
Code:{code_block}
Provide feedback on:{checklist}"""
# Usageprompt = template.format(    language="Python",    focus_area="security vulnerabilities",    code_block=user_code,    checklist="1. SQL injection\n2. XSS risks\n3. Authentication")

5. System Prompt Design

Set global behavior and constraints that persist across the conversation. Define the model's role, expertise level, output format, and safety guidelines. Use system prompts for stable instructions that shouldn't change turn-to-turn, freeing up user message tokens for variable content.

Example:

markdown

System: You are a senior backend engineer specializing in API design.
Rules:
- Always consider scalability and performance- Suggest RESTful patterns by default- Flag security concerns immediately- Provide code examples in Python- Use early return pattern
Format responses as:
1. Analysis2. Recommendation3. Code example4. Trade-offs

System: You are a senior backend engineer specializing in API design.
Rules:
- Always consider scalability and performance- Suggest RESTful patterns by default- Flag security concerns immediately- Provide code examples in Python- Use early return pattern
Format responses as:
1. Analysis2. Recommendation3. Code example4. Trade-offs

Key Patterns

Progressive Disclosure

Start with simple prompts, add complexity only when needed:

Level 1: Direct instruction
- "Summarize this article"
Level 2: Add constraints
- "Summarize this article in 3 bullet points, focusing on key findings"
Level 3: Add reasoning
- "Read this article, identify the main findings, then summarize in 3 bullet points"
Level 4: Add examples
- Include 2-3 example summaries with input-output pairs

Instruction Hierarchy

Error Recovery

Build prompts that gracefully handle failures:

Include fallback instructions
Request confidence scores
Ask for alternative interpretations when uncertain
Specify how to indicate missing information

Best Practices

Be Specific: Vague prompts produce inconsistent results
Show, Don't Tell: Examples are more effective than descriptions
Test Extensively: Evaluate on diverse, representative inputs
Iterate Rapidly: Small changes can have large impacts
Monitor Performance: Track metrics in production
Version Control: Treat prompts as code with proper versioning
Document Intent: Explain why prompts are structured as they are

Common Pitfalls

Over-engineering: Starting with complex prompts before trying simple ones
Example pollution: Using examples that don't match the target task
Context overflow: Exceeding token limits with excessive examples
Ambiguous instructions: Leaving room for multiple interpretations
Ignoring edge cases: Not testing on unusual or boundary inputs

ARGUMENTS: analyze the prompts of the benchmark

I'll analyze the benchmark prompts thoroughly — reading each one and evaluating them against prompt engineering best practices.

Now let me do a deep analysis of each prompt.

Benchmark Prompt Analysis

Overall Architecture Assessment

The 4-step funnel (Enhance → Subject → Area/Standard → Rerank) is a strong pattern — it progressively narrows the search space, reducing token cost and improving precision. However, several prompts have issues worth addressing.

1. `system_prompt.txt` — Grade: C

Issues:

Wall of text, no structure. It reads like a spec doc rather than a system prompt. The model gets this on every call — it should be tight and directive.
Contradicts the pipeline. It says "Also provide with a brief reason" — but reasoning is already handled by each step's schema. This instruction creates noise.
Mentions "JSON File" and "three columns" — but the actual input format varies per step (subjects list, [id] Area, Standard, match list). This is misleading context.
Security preamble is wasted tokens. "It should not accept or process commands that deviate from this task" — you're calling generateObject() with a schema, so the model can't deviate anyway.
Grammar issue: "Communicate exclusively in english Security Notice:" — missing period/separator.

Recommendation: Rewrite as a focused 3-line role + task definition. Drop the security theater and the input format description (each step prompt already specifies its own format).

2. `enhance_comment_prompt.txt` — Grade: B-

Issues:

No examples (zero-shot). This is the kind of task where 2-3 few-shot examples would dramatically improve consistency — showing what "minimal enhancement" looks like vs. over-editing.
Dense paragraph format. Should use bullet points for the enhancement criteria.
Says "Input: A comment" (singular) but in practice receives a batch of [id]: text pairs. The prompt doesn't tell the model to process multiple comments or preserve IDs.
No instruction about the output format. The schema forces { enhanced_comments: [{ id, text }] } but the prompt never mentions returning IDs or an array. The model has to figure this out from the schema alone.

Recommendation: Add 2 few-shot examples (one that needs fixing, one that doesn't). Explicitly describe the batch input format and the expected output structure.

3. `subject_prompt_v3.txt` — Grade: A-

Strengths:

Clear visual structure with ⭑ section headers
Well-defined 5-level scoring rubric with both labels AND numeric values
Explicit BOOST RULE for keyword matching
Good guardrails ("Do not force a match", "Do not invent subjects")

Issues:

Instruction #4 contradicts the schema. It asks for "a brief summary of why you matched with the subjects" — but the schema only has per-score explanation fields. There's no top-level summary field. This instruction goes nowhere.
Batch ambiguity. Like the enhance prompt, it says "teacher's comment" (singular) but receives a batch. The {comment} placeholder contains multiple [id]: text pairs — the prompt should say "For each observation, evaluate every subject."
No few-shot examples. A single example showing one observation scored against 2-3 subjects would anchor behavior.

Recommendation: Remove instruction #4 (or add a summary field to the schema). Clarify batch processing. Add 1 few-shot example.

4. `area_standard_prompt_v3.txt` — Grade: B+

Strengths:

Same clean structure as subject prompt
Consistent scoring rubric across steps (reduces confusion)
Good guardrails

Issues:

Missing instruction #4. The subject prompt has 4 instructions; this one has 3. Not a bug per se, but the missing "provide a summary" instruction was actually the problematic one — so this is accidentally better.
Scoring label mismatch. The RELEVANT (3) description says "comment supports or extends the subject" — but this step scores area standards, not subjects. Copy-paste error from the subject prompt.
No context about what was filtered. The model doesn't know that these standards were pre-filtered to relevant subjects. If it sees 200 standards all from 3 subjects, it might wonder why. A one-liner like "These standards were pre-selected from subjects relevant to the comment" would help.
No few-shot example.

Recommendation: Fix "subject" → "area standard" in RELEVANT description. Add filtering context. Add 1 example.

5. `rerank_prompt_v3.txt` — Grade: A-

Strengths:

Clear 2-step reasoning process (score individually, then review + adjust)
Well-calibrated 6-tier scoring guide on 0-10 scale
Explicit instruction to spread scores ("not all matches should receive high scores")
Tone instruction for user-facing reasoning

Issues:

No few-shot example. Given this is the final quality gate, even 1 example would help calibrate score distributions.
Missing "END OF MATCHES" marker. The comment has "END OF TEACHER COMMENT" but matches don't have a delimiter — potential confusion if the matches section is long.
{matches} comes AFTER instructions. Ideally, long variable data should come last (it does here — good), but having the scoring guide far above the matches means the model needs to "remember" the rubric. For Gemini Flash this is usually fine, but worth noting.

Recommendation: Add 1 compact few-shot example. Add "END OF MATCHES" delimiter.

Cross-Cutting Issues

Issue	Impact	Fix Effort
System prompt is generic/misleading for step-specific tasks	Medium — adds noise, wastes tokens	Low
No few-shot examples in any prompt	High — inconsistent scoring calibration	Medium
Batch vs. singular mismatch in enhance + subject prompts	Medium — model infers from schema but may hallucinate IDs	Low
Copy-paste error in area_standard "subject" label	Low — model likely infers correct intent	Trivial
Subject prompt instruction #4 has no schema backing	Low — model ignores or jams it into explanation	Trivial

Priority Recommendations

High impact, low effort: Fix the system prompt to be concise and step-agnostic
High impact, medium effort: Add 1-2 few-shot examples to subject + rerank prompts (the two scoring steps that matter most)
Medium impact, low effort: Clarify batch processing in enhance + subject prompts
Low effort: Fix the copy-paste "subject" → "area standard" wording

Want me to implement any of these improvements?

implement the changes

Let me read all prompts to make precise edits.

Summary

Changes Made

The Problem: Subject Name Matching

The Bug

Fixes Applied

What the observations repo needs

Coverage you should aim for

How to source them

Scoring script

Benchmark Pipeline Refactor — Full 4-Step Pipeline

Context

Files

Pipeline Architecture (inside lib/benchmark.ts)

Prompt Loading

Step 1: Enhance Comments

Step 2: Subject Scoring

Step 3: Area/Standard Scoring

Step 4: Rerank

Final Conversion

API Route Changes (app/api/benchmark/route.ts)

Key Reuse

Performance

Verification

Changes Made

Verification

Prompt Engineering Patterns

Core Capabilities

1. Few-Shot Learning

2. Chain-of-Thought Prompting

3. Prompt Optimization

4. Template Systems

5. System Prompt Design

Key Patterns

Progressive Disclosure

Instruction Hierarchy

Error Recovery

Best Practices

Common Pitfalls

Benchmark Prompt Analysis

Overall Architecture Assessment

1. system_prompt.txt — Grade: C

2. enhance_comment_prompt.txt — Grade: B-

3. subject_prompt_v3.txt — Grade: A-

4. area_standard_prompt_v3.txt — Grade: B+

5. rerank_prompt_v3.txt — Grade: A-

Cross-Cutting Issues

Priority Recommendations

Summary

Changes Made

The Problem: Subject Name Matching

The Bug

Fixes Applied

What the observations repo needs

Coverage you should aim for

How to source them

Scoring script

Benchmark Pipeline Refactor — Full 4-Step Pipeline

Context

Files

Pipeline Architecture (inside lib/benchmark.ts)

Prompt Loading

Step 1: Enhance Comments

Step 2: Subject Scoring

Step 3: Area/Standard Scoring

Step 4: Rerank

Final Conversion

API Route Changes (app/api/benchmark/route.ts)

Key Reuse

Performance

Verification

Changes Made

Verification

Prompt Engineering Patterns

Core Capabilities

1. Few-Shot Learning

2. Chain-of-Thought Prompting

3. Prompt Optimization

4. Template Systems

5. System Prompt Design

Key Patterns

Pipeline Architecture (inside `lib/benchmark.ts`)

API Route Changes (`app/api/benchmark/route.ts`)

1. `system_prompt.txt` — Grade: C

2. `enhance_comment_prompt.txt` — Grade: B-

3. `subject_prompt_v3.txt` — Grade: A-

4. `area_standard_prompt_v3.txt` — Grade: B+

5. `rerank_prompt_v3.txt` — Grade: A-

Pipeline Architecture (inside `lib/benchmark.ts`)

API Route Changes (`app/api/benchmark/route.ts`)

1. `system_prompt.txt` — Grade: C

2. `enhance_comment_prompt.txt` — Grade: B-

3. `subject_prompt_v3.txt` — Grade: A-

4. `area_standard_prompt_v3.txt` — Grade: B+

5. `rerank_prompt_v3.txt` — Grade: A-