@setup.txt
we're gonna start with the art_theory category of MMMU. Lets start by running the model against the dev set
ls -latotal 8 drwxr-xr-x@ 3 mukund staff 96 Apr 11 00:46 . drwxr-xr-x@ 27 mukund staff 864 Apr 11 00:40 .. -rw-r--r--@ 1 mukund staff 432 Apr 11 00:46 setup.txt
find .. -maxdepth 4 -iname '*mmmu*' -print../2026-04-11-lfm2.5-VL-1.6B-testing-on-mmmu
curl -s http://127.0.0.1:8080/health || true{"status":"ok"}python - <<'PY'
from datasets import load_dataset
# try loading specific config
try:
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
except Exception as e:
print('err1',e)
ds=load_dataset('MMMU/MMMU',split='dev')
print(ds)
raise
print(ds)
print(ds.features)
print('len',len(ds))
print(ds[0].keys())
for k,v in ds[0].items():
if k.startswith('image'):
print(k, type(v), getattr(v,'size',None))
else:
print(k,repr(v)[:200])
PY/bin/bash: python: command not found Command exited with code 127
python3 - <<'PY'
from datasets import load_dataset
try:
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
except Exception as e:
print('err1',e)
ds=load_dataset('MMMU/MMMU',split='dev')
print(ds)
raise
print(ds)
print(ds.features)
print('len',len(ds))
print(ds[0].keys())
for k,v in ds[0].items():
if k.startswith('image'):
print(k, type(v), getattr(v,'size',None))
else:
print(k,repr(v)[:200])
PYTraceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'datasets' Command exited with code 1
hf --help | headUsage: hf [OPTIONS] COMMAND [ARGS]...
Hugging Face Hub CLI
Options:
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
-h, --help Show this message and exit.hf repo-files MMMU/MMMU --repo-type dataset | head -n 80Usage: hf repo-files [OPTIONS] COMMAND [ARGS]... Try 'hf repo-files -h' for help. Error: No such command 'MMMU/MMMU'.
hf repo-files -hUsage: hf repo-files [OPTIONS] COMMAND [ARGS]... (Deprecated) Manage files in a repo on the Hub. Use `hf repos delete-files` instead. Options: -h, --help Show this message and exit. Main commands: delete
hf download MMMU/MMMU --repo-type dataset --dry-run | head -n 120[dry-run] Fetching 94 files: 0%| | 0/94 [00:00<?, ?it/s] [dry-run] Fetching 94 files: 100%|██████████| 94/94 [00:00<00:00, 9573.25it/s] [dry-run] Will download 0 files (out of 94) totalling 0.0. File Bytes to download --------------------------------------------------------------------- ----------------- .gitattributes - Accounting/dev-00000-of-00001.parquet - Accounting/test-00000-of-00001.parquet - Accounting/validation-00000-of-00001.parquet - Agriculture/dev-00000-of-00001.parquet - Agriculture/test-00000-of-00002.parquet - Agriculture/test-00001-of-00002.parquet - Agriculture/validation-00000-of-00001.parquet - Architecture_and_Engineering/dev-00000-of-00001.parquet - Architecture_and_Engineering/test-00000-of-00001.parquet - Architecture_and_Engineering/validation-00000-of-00001.parquet - Art/dev-00000-of-00001.parquet - Art/test-00000-of-00001.parquet - Art/validation-00000-of-00001.parquet - Art_Theory/dev-00000-of-00001.parquet - Art_Theory/test-00000-of-00002.parquet - Art_Theory/test-00001-of-00002.parquet - Art_Theory/validation-00000-of-00001.parquet - Basic_Medical_Science/dev-00000-of-00001.parquet - Basic_Medical_Science/test-00000-of-00001.parquet - Basic_Medical_Science/validation-00000-of-00001.parquet - Biology/dev-00000-of-00001.parquet - Biology/test-00000-of-00001.parquet - Biology/validation-00000-of-00001.parquet - Chemistry/dev-00000-of-00001.parquet - Chemistry/test-00000-of-00001.parquet - Chemistry/validation-00000-of-00001.parquet - Clinical_Medicine/dev-00000-of-00001.parquet - Clinical_Medicine/test-00000-of-00001.parquet - Clinical_Medicine/validation-00000-of-00001.parquet - Computer_Science/dev-00000-of-00001.parquet - Computer_Science/test-00000-of-00001.parquet - Computer_Science/validation-00000-of-00001.parquet - Design/dev-00000-of-00001.parquet - Design/test-00000-of-00001.parquet - Design/validation-00000-of-00001.parquet - Diagnostics_and_Laboratory_Medicine/dev-00000-of-00001.parquet - Diagnostics_and_Laboratory_Medicine/test-00000-of-00001.parquet - Diagnostics_and_Laboratory_Medicine/validation-00000-of-00001.parquet - Economics/dev-00000-of-00001.parquet - Economics/test-00000-of-00001.parquet - Economics/validation-00000-of-00001.parquet - Electronics/dev-00000-of-00001.parquet - Electronics/test-00000-of-00001.parquet - Electronics/validation-00000-of-00001.parquet - Energy_and_Power/dev-00000-of-00001.parquet - Energy_and_Power/test-00000-of-00001.parquet - Energy_and_Power/validation-00000-of-00001.parquet - Finance/dev-00000-of-00001.parquet - Finance/test-00000-of-00001.parquet - Finance/validation-00000-of-00001.parquet - Geography/dev-00000-of-00001.parquet - Geography/test-00000-of-00001.parquet - Geography/validation-00000-of-00001.parquet - History/dev-00000-of-00001.parquet - History/test-00000-of-00001.parquet - History/validation-00000-of-00001.parquet - Literature/dev-00000-of-00001.parquet - Literature/test-00000-of-00001.parquet - Literature/validation-00000-of-00001.parquet - Manage/dev-00000-of-00001.parquet - Manage/test-00000-of-00001.parquet - Manage/validation-00000-of-00001.parquet - Marketing/dev-00000-of-00001.parquet - Marketing/test-00000-of-00001.parquet - Marketing/validation-00000-of-00001.parquet - Materials/dev-00000-of-00001.parquet - Materials/test-00000-of-00001.parquet - Materials/validation-00000-of-00001.parquet - Math/dev-00000-of-00001.parquet - Math/test-00000-of-00001.parquet - Math/validation-00000-of-00001.parquet - Mechanical_Engineering/dev-00000-of-00001.parquet - Mechanical_Engineering/test-00000-of-00001.parquet - Mechanical_Engineering/validation-00000-of-00001.parquet - Music/dev-00000-of-00001.parquet - Music/test-00000-of-00001.parquet - Music/validation-00000-of-00001.parquet - Pharmacy/dev-00000-of-00001.parquet - Pharmacy/test-00000-of-00001.parquet - Pharmacy/validation-00000-of-00001.parquet - Physics/dev-00000-of-00001.parquet - Physics/test-00000-of-00001.parquet - Physics/validation-00000-of-00001.parquet - Psychology/dev-00000-of-00001.parquet - Psychology/test-00000-of-00001.parquet - Psychology/validation-00000-of-00001.parquet - Public_Health/dev-00000-of-00001.parquet - Public_Health/test-00000-of-00001.parquet - Public_Health/validation-00000-of-00001.parquet - README.md - Sociology/dev-00000-of-00001.parquet - Sociology/test-00000-of-00001.parquet - Sociology/validation-00000-of-00001.parquet -
hf download MMMU/MMMU --repo-type dataset --include 'Art_Theory/dev-00000-of-00001.parquet'Fetching 1 files: 0%| | 0/1 [00:00<?, ?it/s] Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 3258.98it/s] /Users/mukund/.cache/huggingface/hub/datasets--MMMU--MMMU/snapshots/21d1d90a93c7450d30bddb579d7b510c00b8a9ab
python3 - <<'PY'
import pyarrow.parquet as pq
path='/Users/mukund/.cache/huggingface/hub/datasets--MMMU--MMMU/snapshots/21d1d90a93c7450d30bddb579d7b510c00b8a9ab/Art_Theory/dev-00000-of-00001.parquet'
t=pq.read_table(path)
print(t.schema)
print('rows',t.num_rows)
print('cols',t.num_columns)
row=t.slice(0,1).to_pylist()[0]
print(row.keys())
for k,v in row.items():
if isinstance(v,(bytes,bytearray)):
print(k,'bytes',len(v))
elif isinstance(v,dict):
print(k,'dict',v.keys())
elif isinstance(v,list):
print(k,'list',len(v), type(v[0]).__name__ if v else None)
else:
print(k,type(v).__name__,str(v)[:140])
PYTraceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'pyarrow' Command exited with code 1
which duckdb || which parquet-tools || which sqlite3/usr/bin/sqlite3
python3 -m pip install --user datasets pyarrow pillow requests -qerror: externally-managed-environment
× This environment is externally managed
╰─> To install Python packages system-wide, try brew install
xyz, where xyz is the package you are trying to
install.
If you wish to install a Python library that isn't in Homebrew,
use a virtual environment:
python3 -m venv path/to/venv
source path/to/venv/bin/activate
python3 -m pip install xyz
If you wish to install a Python application that isn't in Homebrew,
it may be easiest to use 'pipx install xyz', which will manage a
virtual environment for you. You can install pipx with
brew install pipx
You may restore the old behavior of pip by passing
the '--break-system-packages' flag to pip, or by adding
'break-system-packages = true' to your pip.conf file. The latter
will permanently disable this error.
If you disable this error, we STRONGLY recommend that you additionally
pass the '--user' flag to pip, or set 'user = true' in your pip.conf
file. Failure to do this can result in a broken Homebrew installation.
Read more about this behavior here: <https://peps.python.org/pep-0668/>
note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.
Command exited with code 1python3 -m venv .venv && . .venv/bin/activate && pip install -q datasets pyarrow pillow requests[notice] A new release of pip is available: 26.0 -> 26.0.1 [notice] To update, run: pip install --upgrade pip
. .venv/bin/activate && python - <<'PY'
from datasets import load_dataset
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
print(ds)
print('len',len(ds))
print(ds.features)
r=ds[0]
print('keys',r.keys())
for k,v in r.items():
if k.startswith('image'):
print(k, type(v), getattr(v,'size',None))
else:
s=str(v)
print(k, s[:200].replace('\n',' '))
PYGenerating dev split: 0%| | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 348.34 examples/s]
Generating validation split: 0%| | 0/30 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 30/30 [00:00<00:00, 740.72 examples/s]
Generating test split: 0%| | 0/429 [00:00<?, ? examples/s]
Generating test split: 18%|█▊ | 77/429 [00:00<00:01, 347.88 examples/s]
Generating test split: 68%|██████▊ | 292/429 [00:00<00:00, 536.69 examples/s]
Generating test split: 100%|██████████| 429/429 [00:00<00:00, 674.26 examples/s]
Generating test split: 100%|██████████| 429/429 [00:00<00:00, 603.65 examples/s]
Dataset({
features: ['id', 'question', 'options', 'explanation', 'image_1', 'image_2', 'image_3', 'image_4', 'image_5', 'image_6', 'image_7', 'img_type', 'answer', 'topic_difficulty', 'question_type', 'subfield'],
num_rows: 5
})
len 5
{'id': Value('string'), 'question': Value('string'), 'options': Value('string'), 'explanation': Value('string'), 'image_1': Image(mode=None, decode=True), 'image_2': Image(mode=None, decode=True), 'image_3': Image(mode=None, decode=True), 'image_4': Image(mode=None, decode=True), 'image_5': Image(mode=None, decode=True), 'image_6': Image(mode=None, decode=True), 'image_7': Image(mode=None, decode=True), 'img_type': Value('string'), 'answer': Value('string'), 'topic_difficulty': Value('string'), 'question_type': Value('string'), 'subfield': Value('string')}
keys dict_keys(['id', 'question', 'options', 'explanation', 'image_1', 'image_2', 'image_3', 'image_4', 'image_5', 'image_6', 'image_7', 'img_type', 'answer', 'topic_difficulty', 'question_type', 'subfield'])
id dev_Art_Theory_1
question The painting on the right focuses on the <image 1>
options ['contribution of Native Americans to landscape preservation', 'implementation of the Homestead Act', 'impact of the gold rush on landscape development', 'idea of Manifest Destiny']
explanation
image_1 <class 'PIL.PngImagePlugin.PngImageFile'> (938, 574)
image_2 <class 'NoneType'> None
image_3 <class 'NoneType'> None
image_4 <class 'NoneType'> None
image_5 <class 'NoneType'> None
image_6 <class 'NoneType'> None
image_7 <class 'NoneType'> None
img_type ['Paintings', 'Landscapes']
answer D
topic_difficulty Easy
question_type multiple-choice
subfield Art History. .venv/bin/activate && python - <<'PY'
import base64, io, json, requests
from datasets import load_dataset
def pil_to_data_url(img):
buf=io.BytesIO(); img.save(buf, format='PNG'); b=base64.b64encode(buf.getvalue()).decode('utf-8')
return f'data:image/png;base64,{b}'
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
r=ds[0]
question=r['question']
opts=r['options']
text=question+'\n'+'\n'.join([f"{chr(65+i)}. {o}" for i,o in enumerate(opts)])+'\nAnswer with only the letter (A, B, C, or D).'
content=[{"type":"text","text":text}]
for i in range(1,8):
img=r[f'image_{i}']
if img is not None:
content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})
payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0}
resp=requests.post('http://127.0.0.1:8080/v1/chat/completions',json=payload,timeout=120)
print(resp.status_code)
print(resp.text[:1000])
PY200
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"E"}}],"created":1775882962,"model":"LiquidAI/LFM2.5-VL-1.6B-GGUF:F16","system_fingerprint":"b8680-15f786e65","object":"chat.completion","usage":{"completion_tokens":2,"prompt_tokens":1018,"total_tokens":1020,"prompt_tokens_details":{"cached_tokens":0}},"id":"chatcmpl-yHOPKdo7HdJPFF8P1215EDN2gedlZv07","timings":{"cache_n":0,"prompt_n":1018,"prompt_ms":2274.558,"prompt_per_token_ms":2.2343398821218075,"prompt_per_second":447.55948188615105,"predicted_n":2,"predicted_ms":34.497,"predicted_per_token_ms":17.2485,"predicted_per_second":57.97605588891788}}. .venv/bin/activate && python - <<'PY'
import base64, io, re, requests
from datasets import load_dataset
SERVER='http://127.0.0.1:8080/v1/chat/completions'
# verify split sizes
for split in ['dev','validation','test']:
ds_split=load_dataset('MMMU/MMMU','Art_Theory',split=split)
print(f'Art_Theory {split}: {len(ds_split)}')
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
pat=re.compile(r'\b([A-Z])\b')
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
def pil_to_data_url(img):
buf=io.BytesIO(); img.save(buf, format='PNG')
return 'data:image/png;base64,'+base64.b64encode(buf.getvalue()).decode('utf-8')
correct=0
for idx,row in enumerate(ds):
options=row['options']
opt_lines=[f"{letters[i]}. {opt}" for i,opt in enumerate(options)]
allowed=', '.join(letters[i] for i in range(len(options)))
text=(
"Answer the multiple-choice question based on the provided image(s).\n"
"Return ONLY the single option letter and nothing else.\n\n"
f"Question: {row['question']}\n"
"Options:\n" + "\n".join(opt_lines) + "\n\n"
f"Valid letters: {allowed}."
)
content=[{"type":"text","text":text}]
for i in range(1,8):
img=row[f'image_{i}']
if img is not None:
content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})
payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0,"max_tokens":8}
r=requests.post(SERVER,json=payload,timeout=240)
r.raise_for_status()
out=r.json()['choices'][0]['message']['content'].strip()
m=pat.search(out)
pred=m.group(1) if m else out[:1].upper()
gold=row['answer'].strip().upper()
ok=(pred==gold)
correct+=ok
print(f"{idx+1}/{len(ds)} id={row['id']} pred={pred!r} raw={out!r} gold={gold!r} {'OK' if ok else 'WRONG'}")
print(f"Accuracy: {correct}/{len(ds)} = {correct/len(ds):.1%}")
PYArt_Theory dev: 5 Art_Theory validation: 30 Art_Theory test: 429 Traceback (most recent call last): File "<stdin>", line 23, in <module> IndexError: string index out of range Command exited with code 1
. .venv/bin/activate && python - <<'PY'
from datasets import load_dataset
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
for row in ds:
opts=row['options']
print(row['id'],type(opts),opts, 'len',len(opts))
if len(opts)>26:
print('too many')
for i,opt in enumerate(opts):
if i>=len(letters):
print('i too high',i)
PYdev_Art_Theory_1 <class 'str'> ['contribution of Native Americans to landscape preservation', 'implementation of the Homestead Act', 'impact of the gold rush on landscape development', 'idea of Manifest Destiny'] len 181 too many i too high 26 i too high 27 i too high 28 i too high 29 i too high 30 i too high 31 i too high 32 i too high 33 i too high 34 i too high 35 i too high 36 i too high 37 i too high 38 i too high 39 i too high 40 i too high 41 i too high 42 i too high 43 i too high 44 i too high 45 i too high 46 i too high 47 i too high 48 i too high 49 i too high 50 i too high 51 i too high 52 i too high 53 i too high 54 i too high 55 i too high 56 i too high 57 i too high 58 i too high 59 i too high 60 i too high 61 i too high 62 i too high 63 i too high 64 i too high 65 i too high 66 i too high 67 i too high 68 i too high 69 i too high 70 i too high 71 i too high 72 i too high 73 i too high 74 i too high 75 i too high 76 i too high 77 i too high 78 i too high 79 i too high 80 i too high 81 i too high 82 i too high 83 i too high 84 i too high 85 i too high 86 i too high 87 i too high 88 i too high 89 i too high 90 i too high 91 i too high 92 i too high 93 i too high 94 i too high 95 i too high 96 i too high 97 i too high 98 i too high 99 i too high 100 i too high 101 i too high 102 i too high 103 i too high 104 i too high 105 i too high 106 i too high 107 i too high 108 i too high 109 i too high 110 i too high 111 i too high 112 i too high 113 i too high 114 i too high 115 i too high 116 i too high 117 i too high 118 i too high 119 i too high 120 i too high 121 i too high 122 i too high 123 i too high 124 i too high 125 i too high 126 i too high 127 i too high 128 i too high 129 i too high 130 i too high 131 i too high 132 i too high 133 i too high 134 i too high 135 i too high 136 i too high 137 i too high 138 i too high 139 i too high 140 i too high 141 i too high 142 i too high 143 i too high 144 i too high 145 i too high 146 i too high 147 i too high 148 i too high 149 i too high 150 i too high 151 i too high 152 i too high 153 i too high 154 i too high 155 i too high 156 i too high 157 i too high 158 i too high 159 i too high 160 i too high 161 i too high 162 i too high 163 i too high 164 i too high 165 i too high 166 i too high 167 i too high 168 i too high 169 i too high 170 i too high 171 i too high 172 i too high 173 i too high 174 i too high 175 i too high 176 i too high 177 i too high 178 i too high 179 i too high 180 dev_Art_Theory_2 <class 'str'> ['Roman emperor', 'Greek philosopher', 'Carolingian king', 'Romanesque abbot'] len 78 too many i too high 26 i too high 27 i too high 28 i too high 29 i too high 30 i too high 31 i too high 32 i too high 33 i too high 34 i too high 35 i too high 36 i too high 37 i too high 38 i too high 39 i too high 40 i too high 41 i too high 42 i too high 43 i too high 44 i too high 45 i too high 46 i too high 47 i too high 48 i too high 49 i too high 50 i too high 51 i too high 52 i too high 53 i too high 54 i too high 55 i too high 56 i too high 57 i too high 58 i too high 59 i too high 60 i too high 61 i too high 62 i too high 63 i too high 64 i too high 65 i too high 66 i too high 67 i too high 68 i too high 69 i too high 70 i too high 71 i too high 72 i too high 73 i too high 74 i too high 75 i too high 76 i too high 77 dev_Art_Theory_3 <class 'str'> ['timber roofs', 'quadripartite vaults', 'pendentive domes', 'masonry barrel vaults'] len 85 too many i too high 26 i too high 27 i too high 28 i too high 29 i too high 30 i too high 31 i too high 32 i too high 33 i too high 34 i too high 35 i too high 36 i too high 37 i too high 38 i too high 39 i too high 40 i too high 41 i too high 42 i too high 43 i too high 44 i too high 45 i too high 46 i too high 47 i too high 48 i too high 49 i too high 50 i too high 51 i too high 52 i too high 53 i too high 54 i too high 55 i too high 56 i too high 57 i too high 58 i too high 59 i too high 60 i too high 61 i too high 62 i too high 63 i too high 64 i too high 65 i too high 66 i too high 67 i too high 68 i too high 69 i too high 70 i too high 71 i too high 72 i too high 73 i too high 74 i too high 75 i too high 76 i too high 77 i too high 78 i too high 79 i too high 80 i too high 81 i too high 82 i too high 83 i too high 84 dev_Art_Theory_4 <class 'str'> ['accurately copying figures and subject matter from nature', 'aligning the Old Testament and the New Testament through typology', 'emphasizing the role of the saints in the church', 'asserting imperial authority over the church'] len 230 too many i too high 26 i too high 27 i too high 28 i too high 29 i too high 30 i too high 31 i too high 32 i too high 33 i too high 34 i too high 35 i too high 36 i too high 37 i too high 38 i too high 39 i too high 40 i too high 41 i too high 42 i too high 43 i too high 44 i too high 45 i too high 46 i too high 47 i too high 48 i too high 49 i too high 50 i too high 51 i too high 52 i too high 53 i too high 54 i too high 55 i too high 56 i too high 57 i too high 58 i too high 59 i too high 60 i too high 61 i too high 62 i too high 63 i too high 64 i too high 65 i too high 66 i too high 67 i too high 68 i too high 69 i too high 70 i too high 71 i too high 72 i too high 73 i too high 74 i too high 75 i too high 76 i too high 77 i too high 78 i too high 79 i too high 80 i too high 81 i too high 82 i too high 83 i too high 84 i too high 85 i too high 86 i too high 87 i too high 88 i too high 89 i too high 90 i too high 91 i too high 92 i too high 93 i too high 94 i too high 95 i too high 96 i too high 97 i too high 98 i too high 99 i too high 100 i too high 101 i too high 102 i too high 103 i too high 104 i too high 105 i too high 106 i too high 107 i too high 108 i too high 109 i too high 110 i too high 111 i too high 112 i too high 113 i too high 114 i too high 115 i too high 116 i too high 117 i too high 118 i too high 119 i too high 120 i too high 121 i too high 122 i too high 123 i too high 124 i too high 125 i too high 126 i too high 127 i too high 128 i too high 129 i too high 130 i too high 131 i too high 132 i too high 133 i too high 134 i too high 135 i too high 136 i too high 137 i too high 138 i too high 139 i too high 140 i too high 141 i too high 142 i too high 143 i too high 144 i too high 145 i too high 146 i too high 147 i too high 148 i too high 149 i too high 150 i too high 151 i too high 152 i too high 153 i too high 154 i too high 155 i too high 156 i too high 157 i too high 158 i too high 159 i too high 160 i too high 161 i too high 162 i too high 163 i too high 164 i too high 165 i too high 166 i too high 167 i too high 168 i too high 169 i too high 170 i too high 171 i too high 172 i too high 173 i too high 174 i too high 175 i too high 176 i too high 177 i too high 178 i too high 179 i too high 180 i too high 181 i too high 182 i too high 183 i too high 184 i too high 185 i too high 186 i too high 187 i too high 188 i too high 189 i too high 190 i too high 191 i too high 192 i too high 193 i too high 194 i too high 195 i too high 196 i too high 197 i too high 198 i too high 199 i too high 200 i too high 201 i too high 202 i too high 203 i too high 204 i too high 205 i too high 206 i too high 207 i too high 208 i too high 209 i too high 210 i too high 211 i too high 212 i too high 213 i too high 214 i too high 215 i too high 216 i too high 217 i too high 218 i too high 219 i too high 220 i too high 221 i too high 222 i too high 223 i too high 224 i too high 225 i too high 226 i too high 227 i too high 228 i too high 229 dev_Art_Theory_5 <class 'str'> ['friezes', 'pediments', 'registers', 'polyptychs'] len 51 too many i too high 26 i too high 27 i too high 28 i too high 29 i too high 30 i too high 31 i too high 32 i too high 33 i too high 34 i too high 35 i too high 36 i too high 37 i too high 38 i too high 39 i too high 40 i too high 41 i too high 42 i too high 43 i too high 44 i too high 45 i too high 46 i too high 47 i too high 48 i too high 49 i too high 50
. .venv/bin/activate && python - <<'PY'
import base64, io, re, ast, requests
from datasets import load_dataset
SERVER='http://127.0.0.1:8080/v1/chat/completions'
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
pat=re.compile(r'\b([A-Z])\b')
def parse_options(o):
if isinstance(o,list):
return o
if isinstance(o,str):
try:
v=ast.literal_eval(o)
if isinstance(v,list):
return [str(x) for x in v]
except Exception:
pass
raise ValueError(f'Unexpected options format: {type(o)} {o!r}')
def pil_to_data_url(img):
buf=io.BytesIO(); img.save(buf, format='PNG')
return 'data:image/png;base64,'+base64.b64encode(buf.getvalue()).decode('utf-8')
for split in ['dev','validation','test']:
print(f'Art_Theory {split}:', len(load_dataset('MMMU/MMMU','Art_Theory',split=split)))
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
correct=0
for idx,row in enumerate(ds):
options=parse_options(row['options'])
opt_lines=[f"{letters[i]}. {opt}" for i,opt in enumerate(options)]
allowed=', '.join(letters[i] for i in range(len(options)))
text=(
"Answer the multiple-choice question based on the provided image(s).\n"
"Return ONLY the single option letter and nothing else.\n\n"
f"Question: {row['question']}\n"
"Options:\n" + "\n".join(opt_lines) + "\n\n"
f"Valid letters: {allowed}."
)
content=[{"type":"text","text":text}]
for i in range(1,8):
img=row[f'image_{i}']
if img is not None:
content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})
payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0,"max_tokens":8}
resp=requests.post(SERVER,json=payload,timeout=240)
resp.raise_for_status()
out=resp.json()['choices'][0]['message']['content'].strip()
m=pat.search(out)
pred=(m.group(1) if m else out[:1].upper())
gold=row['answer'].strip().upper()
ok=(pred==gold)
correct+=ok
print(f"{idx+1}/{len(ds)} {row['id']} pred={pred} gold={gold} raw={out!r} {'OK' if ok else 'WRONG'}")
print(f"Accuracy: {correct}/{len(ds)} = {correct/len(ds):.1%}")
PYArt_Theory dev: 5 Art_Theory validation: 30 Art_Theory test: 429 1/5 dev_Art_Theory_1 pred=A gold=D raw='A' WRONG 2/5 dev_Art_Theory_2 pred=A gold=A raw='A' OK 3/5 dev_Art_Theory_3 pred=B gold=A raw='B' WRONG 4/5 dev_Art_Theory_4 pred=A gold=B raw='A' WRONG 5/5 dev_Art_Theory_5 pred=C gold=C raw='C' OK Accuracy: 2/5 = 40.0%
. .venv/bin/activate && python run_mmmu_eval.py --subject Art_Theory --split devArt_Theory dev: 5 rows 1/5 dev_Art_Theory_1 pred=A gold=D raw='A' WRONG 2/5 dev_Art_Theory_2 pred=A gold=A raw='A' OK 3/5 dev_Art_Theory_3 pred=B gold=A raw='B' WRONG 4/5 dev_Art_Theory_4 pred=A gold=B raw='A' WRONG 5/5 dev_Art_Theory_5 pred=C gold=C raw='C' OK Accuracy: 2/5 = 40.0%
Done — I ran LFM2.5-VL-1.6B on MMMU / Art_Theory / dev against your local llama-server.
Art_Theory split sizes:
dev: 5validation: 30test: 429dev_Art_Theory_1 → pred A, gold D ❌dev_Art_Theory_2 → pred A, gold A ✅dev_Art_Theory_3 → pred B, gold A ❌dev_Art_Theory_4 → pred A, gold B ❌dev_Art_Theory_5 → pred C, gold C ✅Accuracy: 2/5 = 40.0%
I also saved a reusable script at:
run_mmmu_eval.pyRun it with:
If you want, I can next run the same on validation for this category and give a fuller breakdown.
can you write a requirements file
Added requirements.txt at:
requirements.txtContents:
datasetspyarrowpillowrequestsIf you want, I can also add a short README snippet with venv setup/install/run commands.
git commit and continue
git status --short?? requirements.txt ?? run_mmmu_eval.py ?? setup.txt
git add requirements.txt run_mmmu_eval.py && git commit -m "Add MMMU Art_Theory eval script and Python requirements"[master (root-commit) 190690a] Add MMMU Art_Theory eval script and Python requirements 2 files changed, 92 insertions(+) create mode 100644 requirements.txt create mode 100644 run_mmmu_eval.py
uv run python run_mmmu_eval.py --subject Art_Theory --split devImplemented and committed ✅
43cdacbLog MMMU eval runs, subject results, and predictions to SQLiterun_mmmu_eval.py--db-path (default: mmmu_runs.db)runs (one row per script invocation)subject_results (per-subject aggregate)predictions (per-question detail)running → completed or failedAt the end of each run it prints:
runs, subject_results, predictions)So you get immediate feedback on how much data was written.
add better error handling if the backend server is not reachable. keep it concise tho
don't bloat my code dawg