@setup.txt
we're gonna start with the art_theory category of MMMU. Lets start by running the model against the dev set
Done — I ran LFM2.5-VL-1.6B on MMMU / Art_Theory / dev against your local llama-server.
Art_Theory split sizes:
dev: 5validation: 30test: 429dev_Art_Theory_1 → pred A, gold D ❌dev_Art_Theory_2 → pred A, gold A ✅dev_Art_Theory_3 → pred B, gold A ❌dev_Art_Theory_4 → pred A, gold B ❌dev_Art_Theory_5 → pred C, gold C ✅Accuracy: 2/5 = 40.0%
I also saved a reusable script at:
run_mmmu_eval.pyRun it with:
If you want, I can next run the same on validation for this category and give a fuller breakdown.
can you write a requirements file
Added requirements.txt at:
requirements.txtContents:
datasetspyarrowpillowrequestsIf you want, I can also add a short README snippet with venv setup/install/run commands.
git commit and continue
Implemented and committed ✅
43cdacbLog MMMU eval runs, subject results, and predictions to SQLiterun_mmmu_eval.py--db-path (default: mmmu_runs.db)runs (one row per script invocation)subject_results (per-subject aggregate)predictions (per-question detail)running → completed or failedAt the end of each run it prints:
runs, subject_results, predictions)So you get immediate feedback on how much data was written.
add better error handling if the backend server is not reachable. keep it concise tho
don't bloat my code dawg