Starting Model Test on Art Theory CategoryPublic

Agent

Model

GPT-5.3 Codex

Shared

Apr 11, 4:49 AM

Messages60

Downloads18

Summary

@setup.txt

we're gonna start with the art_theory category of MMMU. Lets start by running the model against the dev set

Done — I ran LFM2.5-VL-1.6B on MMMU / Art_Theory / dev against your local llama-server.

What I verified first

Art_Theory split sizes:
- dev: 5
- validation: 30
- test: 429

Dev-set results (Art_Theory)

dev_Art_Theory_1 → pred A, gold D ❌
dev_Art_Theory_2 → pred A, gold A ✅
dev_Art_Theory_3 → pred B, gold A ❌
dev_Art_Theory_4 → pred A, gold B ❌
dev_Art_Theory_5 → pred C, gold C ✅

Accuracy: 2/5 = 40.0%

I also saved a reusable script at:

run_mmmu_eval.py

Run it with:

If you want, I can next run the same on validation for this category and give a fuller breakdown.

can you write a requirements file

Added requirements.txt at:

requirements.txt

Contents:

datasets
pyarrow
pillow
requests

If you want, I can also add a short README snippet with venv setup/install/run commands.

git commit and continue

Implemented and committed ✅

Commit: 43cdacb
Message: Log MMMU eval runs, subject results, and predictions to SQLite

What’s now in `run_mmmu_eval.py`

Auto-creates/uses SQLite DB (works whether file exists or not).
New CLI arg:
- --db-path (default: mmmu_runs.db)
Logs:
1. runs (one row per script invocation)
2. subject_results (per-subject aggregate)
3. predictions (per-question detail)
Marks run status as:
- running → completed or failed
Stores failure message if run errors out.

User-facing DB summary added

At the end of each run it prints:

db path
run_id
rows added to each table (runs, subject_results, predictions)
rows logged by this run

So you get immediate feedback on how much data was written.

add better error handling if the backend server is not reachable. keep it concise tho

don't bloat my code dawg

Summary

What I verified first

Dev-set results (Art_Theory)

What’s now in run_mmmu_eval.py

User-facing DB summary added

Summary

What I verified first

Dev-set results (Art_Theory)

What’s now in run_mmmu_eval.py

User-facing DB summary added

What’s now in `run_mmmu_eval.py`

What’s now in `run_mmmu_eval.py`