maujim shared

Public

Starting Model Test on Art Theory CategoryPublic

Agent

Model

GPT-5.3 Codex

Shared

Apr 11, 4:49 AM

Messages60

Summary

@setup.txt

we're gonna start with the art_theory category of MMMU. Lets start by running the model against the dev set

python - <<'PY'
from datasets import load_dataset

# try loading specific config
try:
    ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
except Exception as e:
    print('err1',e)
    ds=load_dataset('MMMU/MMMU',split='dev')
    print(ds)
    raise

print(ds)
print(ds.features)
print('len',len(ds))
print(ds[0].keys())
for k,v in ds[0].items():
    if k.startswith('image'):
        print(k, type(v), getattr(v,'size',None))
    else:
        print(k,repr(v)[:200])
PY

/bin/bash: python: command not found


Command exited with code 127

python3 - <<'PY'
from datasets import load_dataset

try:
    ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
except Exception as e:
    print('err1',e)
    ds=load_dataset('MMMU/MMMU',split='dev')
    print(ds)
    raise

print(ds)
print(ds.features)
print('len',len(ds))
print(ds[0].keys())
for k,v in ds[0].items():
    if k.startswith('image'):
        print(k, type(v), getattr(v,'size',None))
    else:
        print(k,repr(v)[:200])
PY

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'datasets'


Command exited with code 1

>hf download MMMU/MMMU --repo-type dataset --dry-run | head -n 120

[dry-run] Fetching 94 files:   0%|          | 0/94 [00:00<?, ?it/s]
[dry-run] Fetching 94 files: 100%|██████████| 94/94 [00:00<00:00, 9573.25it/s]
[dry-run] Will download 0 files (out of 94) totalling 0.0.
File                                                                  Bytes to download
--------------------------------------------------------------------- -----------------
.gitattributes                                                        -                
Accounting/dev-00000-of-00001.parquet                                 -                
Accounting/test-00000-of-00001.parquet                                -                
Accounting/validation-00000-of-00001.parquet                          -                
Agriculture/dev-00000-of-00001.parquet                                -                
Agriculture/test-00000-of-00002.parquet                               -                
Agriculture/test-00001-of-00002.parquet                               -                
Agriculture/validation-00000-of-00001.parquet                         -                
Architecture_and_Engineering/dev-00000-of-00001.parquet               -                
Architecture_and_Engineering/test-00000-of-00001.parquet              -                
Architecture_and_Engineering/validation-00000-of-00001.parquet        -                
Art/dev-00000-of-00001.parquet                                        -                
Art/test-00000-of-00001.parquet                                       -                
Art/validation-00000-of-00001.parquet                                 -                
Art_Theory/dev-00000-of-00001.parquet                                 -                
Art_Theory/test-00000-of-00002.parquet                                -                
Art_Theory/test-00001-of-00002.parquet                                -                
Art_Theory/validation-00000-of-00001.parquet                          -                
Basic_Medical_Science/dev-00000-of-00001.parquet                      -                
Basic_Medical_Science/test-00000-of-00001.parquet                     -                
Basic_Medical_Science/validation-00000-of-00001.parquet               -                
Biology/dev-00000-of-00001.parquet                                    -                
Biology/test-00000-of-00001.parquet                                   -                
Biology/validation-00000-of-00001.parquet                             -                
Chemistry/dev-00000-of-00001.parquet                                  -                
Chemistry/test-00000-of-00001.parquet                                 -                
Chemistry/validation-00000-of-00001.parquet                           -                
Clinical_Medicine/dev-00000-of-00001.parquet                          -                
Clinical_Medicine/test-00000-of-00001.parquet                         -                
Clinical_Medicine/validation-00000-of-00001.parquet                   -                
Computer_Science/dev-00000-of-00001.parquet                           -                
Computer_Science/test-00000-of-00001.parquet                          -                
Computer_Science/validation-00000-of-00001.parquet                    -                
Design/dev-00000-of-00001.parquet                                     -                
Design/test-00000-of-00001.parquet                                    -                
Design/validation-00000-of-00001.parquet                              -                
Diagnostics_and_Laboratory_Medicine/dev-00000-of-00001.parquet        -                
Diagnostics_and_Laboratory_Medicine/test-00000-of-00001.parquet       -                
Diagnostics_and_Laboratory_Medicine/validation-00000-of-00001.parquet -                
Economics/dev-00000-of-00001.parquet                                  -                
Economics/test-00000-of-00001.parquet                                 -                
Economics/validation-00000-of-00001.parquet                           -                
Electronics/dev-00000-of-00001.parquet                                -                
Electronics/test-00000-of-00001.parquet                               -                
Electronics/validation-00000-of-00001.parquet                         -                
Energy_and_Power/dev-00000-of-00001.parquet                           -                
Energy_and_Power/test-00000-of-00001.parquet                          -                
Energy_and_Power/validation-00000-of-00001.parquet                    -                
Finance/dev-00000-of-00001.parquet                                    -                
Finance/test-00000-of-00001.parquet                                   -                
Finance/validation-00000-of-00001.parquet                             -                
Geography/dev-00000-of-00001.parquet                                  -                
Geography/test-00000-of-00001.parquet                                 -                
Geography/validation-00000-of-00001.parquet                           -                
History/dev-00000-of-00001.parquet                                    -                
History/test-00000-of-00001.parquet                                   -                
History/validation-00000-of-00001.parquet                             -                
Literature/dev-00000-of-00001.parquet                                 -                
Literature/test-00000-of-00001.parquet                                -                
Literature/validation-00000-of-00001.parquet                          -                
Manage/dev-00000-of-00001.parquet                                     -                
Manage/test-00000-of-00001.parquet                                    -                
Manage/validation-00000-of-00001.parquet                              -                
Marketing/dev-00000-of-00001.parquet                                  -                
Marketing/test-00000-of-00001.parquet                                 -                
Marketing/validation-00000-of-00001.parquet                           -                
Materials/dev-00000-of-00001.parquet                                  -                
Materials/test-00000-of-00001.parquet                                 -                
Materials/validation-00000-of-00001.parquet                           -                
Math/dev-00000-of-00001.parquet                                       -                
Math/test-00000-of-00001.parquet                                      -                
Math/validation-00000-of-00001.parquet                                -                
Mechanical_Engineering/dev-00000-of-00001.parquet                     -                
Mechanical_Engineering/test-00000-of-00001.parquet                    -                
Mechanical_Engineering/validation-00000-of-00001.parquet              -                
Music/dev-00000-of-00001.parquet                                      -                
Music/test-00000-of-00001.parquet                                     -                
Music/validation-00000-of-00001.parquet                               -                
Pharmacy/dev-00000-of-00001.parquet                                   -                
Pharmacy/test-00000-of-00001.parquet                                  -                
Pharmacy/validation-00000-of-00001.parquet                            -                
Physics/dev-00000-of-00001.parquet                                    -                
Physics/test-00000-of-00001.parquet                                   -                
Physics/validation-00000-of-00001.parquet                             -                
Psychology/dev-00000-of-00001.parquet                                 -                
Psychology/test-00000-of-00001.parquet                                -                
Psychology/validation-00000-of-00001.parquet                          -                
Public_Health/dev-00000-of-00001.parquet                              -                
Public_Health/test-00000-of-00001.parquet                             -                
Public_Health/validation-00000-of-00001.parquet                       -                
README.md                                                             -                
Sociology/dev-00000-of-00001.parquet                                  -                
Sociology/test-00000-of-00001.parquet                                 -                
Sociology/validation-00000-of-00001.parquet                           -

python3 - <<'PY'
import pyarrow.parquet as pq
path='/Users/mukund/.cache/huggingface/hub/datasets--MMMU--MMMU/snapshots/21d1d90a93c7450d30bddb579d7b510c00b8a9ab/Art_Theory/dev-00000-of-00001.parquet'
t=pq.read_table(path)
print(t.schema)
print('rows',t.num_rows)
print('cols',t.num_columns)
row=t.slice(0,1).to_pylist()[0]
print(row.keys())
for k,v in row.items():
    if isinstance(v,(bytes,bytearray)):
        print(k,'bytes',len(v))
    elif isinstance(v,dict):
        print(k,'dict',v.keys())
    elif isinstance(v,list):
        print(k,'list',len(v), type(v[0]).__name__ if v else None)
    else:
        print(k,type(v).__name__,str(v)[:140])
PY

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyarrow'


Command exited with code 1

>python3 -m pip install --user datasets pyarrow pillow requests -q

error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try brew install
    xyz, where xyz is the package you are trying to
    install.
    
    If you wish to install a Python library that isn't in Homebrew,
    use a virtual environment:
    
    python3 -m venv path/to/venv
    source path/to/venv/bin/activate
    python3 -m pip install xyz
    
    If you wish to install a Python application that isn't in Homebrew,
    it may be easiest to use 'pipx install xyz', which will manage a
    virtual environment for you. You can install pipx with
    
    brew install pipx
    
    You may restore the old behavior of pip by passing
    the '--break-system-packages' flag to pip, or by adding
    'break-system-packages = true' to your pip.conf file. The latter
    will permanently disable this error.
    
    If you disable this error, we STRONGLY recommend that you additionally
    pass the '--user' flag to pip, or set 'user = true' in your pip.conf
    file. Failure to do this can result in a broken Homebrew installation.
    
    Read more about this behavior here: <https://peps.python.org/pep-0668/>

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.


Command exited with code 1

. .venv/bin/activate && python - <<'PY'
from datasets import load_dataset

ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
print(ds)
print('len',len(ds))
print(ds.features)
r=ds[0]
print('keys',r.keys())
for k,v in r.items():
    if k.startswith('image'):
        print(k, type(v), getattr(v,'size',None))
    else:
        s=str(v)
        print(k, s[:200].replace('\n',' '))
PY

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]
Generating dev split: 100%|██████████| 5/5 [00:00<00:00, 348.34 examples/s]

Generating validation split:   0%|          | 0/30 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 30/30 [00:00<00:00, 740.72 examples/s]

Generating test split:   0%|          | 0/429 [00:00<?, ? examples/s]
Generating test split:  18%|█▊        | 77/429 [00:00<00:01, 347.88 examples/s]
Generating test split:  68%|██████▊   | 292/429 [00:00<00:00, 536.69 examples/s]
Generating test split: 100%|██████████| 429/429 [00:00<00:00, 674.26 examples/s]
Generating test split: 100%|██████████| 429/429 [00:00<00:00, 603.65 examples/s]
Dataset({
    features: ['id', 'question', 'options', 'explanation', 'image_1', 'image_2', 'image_3', 'image_4', 'image_5', 'image_6', 'image_7', 'img_type', 'answer', 'topic_difficulty', 'question_type', 'subfield'],
    num_rows: 5
})
len 5
{'id': Value('string'), 'question': Value('string'), 'options': Value('string'), 'explanation': Value('string'), 'image_1': Image(mode=None, decode=True), 'image_2': Image(mode=None, decode=True), 'image_3': Image(mode=None, decode=True), 'image_4': Image(mode=None, decode=True), 'image_5': Image(mode=None, decode=True), 'image_6': Image(mode=None, decode=True), 'image_7': Image(mode=None, decode=True), 'img_type': Value('string'), 'answer': Value('string'), 'topic_difficulty': Value('string'), 'question_type': Value('string'), 'subfield': Value('string')}
keys dict_keys(['id', 'question', 'options', 'explanation', 'image_1', 'image_2', 'image_3', 'image_4', 'image_5', 'image_6', 'image_7', 'img_type', 'answer', 'topic_difficulty', 'question_type', 'subfield'])
id dev_Art_Theory_1
question The painting on the right focuses on the <image 1>
options ['contribution of Native Americans to landscape preservation', 'implementation of the Homestead Act', 'impact of the gold rush on landscape development', 'idea of Manifest Destiny']
explanation 
image_1 <class 'PIL.PngImagePlugin.PngImageFile'> (938, 574)
image_2 <class 'NoneType'> None
image_3 <class 'NoneType'> None
image_4 <class 'NoneType'> None
image_5 <class 'NoneType'> None
image_6 <class 'NoneType'> None
image_7 <class 'NoneType'> None
img_type ['Paintings', 'Landscapes']
answer D
topic_difficulty Easy
question_type multiple-choice
subfield Art History

. .venv/bin/activate && python - <<'PY'
import base64, io, json, requests
from datasets import load_dataset

def pil_to_data_url(img):
    buf=io.BytesIO(); img.save(buf, format='PNG'); b=base64.b64encode(buf.getvalue()).decode('utf-8')
    return f'data:image/png;base64,{b}'

ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
r=ds[0]
question=r['question']
opts=r['options']
text=question+'\n'+'\n'.join([f"{chr(65+i)}. {o}" for i,o in enumerate(opts)])+'\nAnswer with only the letter (A, B, C, or D).'
content=[{"type":"text","text":text}]
for i in range(1,8):
    img=r[f'image_{i}']
    if img is not None:
        content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})

payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0}
resp=requests.post('http://127.0.0.1:8080/v1/chat/completions',json=payload,timeout=120)
print(resp.status_code)
print(resp.text[:1000])
PY

200
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"E"}}],"created":1775882962,"model":"LiquidAI/LFM2.5-VL-1.6B-GGUF:F16","system_fingerprint":"b8680-15f786e65","object":"chat.completion","usage":{"completion_tokens":2,"prompt_tokens":1018,"total_tokens":1020,"prompt_tokens_details":{"cached_tokens":0}},"id":"chatcmpl-yHOPKdo7HdJPFF8P1215EDN2gedlZv07","timings":{"cache_n":0,"prompt_n":1018,"prompt_ms":2274.558,"prompt_per_token_ms":2.2343398821218075,"prompt_per_second":447.55948188615105,"predicted_n":2,"predicted_ms":34.497,"predicted_per_token_ms":17.2485,"predicted_per_second":57.97605588891788}}

. .venv/bin/activate && python - <<'PY'
import base64, io, re, requests
from datasets import load_dataset

SERVER='http://127.0.0.1:8080/v1/chat/completions'

# verify split sizes
for split in ['dev','validation','test']:
    ds_split=load_dataset('MMMU/MMMU','Art_Theory',split=split)
    print(f'Art_Theory {split}: {len(ds_split)}')

ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')

pat=re.compile(r'\b([A-Z])\b')
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'

def pil_to_data_url(img):
    buf=io.BytesIO(); img.save(buf, format='PNG')
    return 'data:image/png;base64,'+base64.b64encode(buf.getvalue()).decode('utf-8')

correct=0
for idx,row in enumerate(ds):
    options=row['options']
    opt_lines=[f"{letters[i]}. {opt}" for i,opt in enumerate(options)]
    allowed=', '.join(letters[i] for i in range(len(options)))
    text=(
        "Answer the multiple-choice question based on the provided image(s).\n"
        "Return ONLY the single option letter and nothing else.\n\n"
        f"Question: {row['question']}\n"
        "Options:\n" + "\n".join(opt_lines) + "\n\n"
        f"Valid letters: {allowed}."
    )

    content=[{"type":"text","text":text}]
    for i in range(1,8):
        img=row[f'image_{i}']
        if img is not None:
            content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})

    payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0,"max_tokens":8}
    r=requests.post(SERVER,json=payload,timeout=240)
    r.raise_for_status()
    out=r.json()['choices'][0]['message']['content'].strip()

    m=pat.search(out)
    pred=m.group(1) if m else out[:1].upper()
    gold=row['answer'].strip().upper()
    ok=(pred==gold)
    correct+=ok
    print(f"{idx+1}/{len(ds)} id={row['id']} pred={pred!r} raw={out!r} gold={gold!r} {'OK' if ok else 'WRONG'}")

print(f"Accuracy: {correct}/{len(ds)} = {correct/len(ds):.1%}")
PY

Art_Theory dev: 5
Art_Theory validation: 30
Art_Theory test: 429
Traceback (most recent call last):
  File "<stdin>", line 23, in <module>
IndexError: string index out of range


Command exited with code 1

. .venv/bin/activate && python - <<'PY'
from datasets import load_dataset
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
for row in ds:
    opts=row['options']
    print(row['id'],type(opts),opts, 'len',len(opts))
    if len(opts)>26:
        print('too many')
    for i,opt in enumerate(opts):
        if i>=len(letters):
            print('i too high',i)
PY

dev_Art_Theory_1 <class 'str'> ['contribution of Native Americans to landscape preservation', 'implementation of the Homestead Act', 'impact of the gold rush on landscape development', 'idea of Manifest Destiny'] len 181
too many
i too high 26
i too high 27
i too high 28
i too high 29
i too high 30
i too high 31
i too high 32
i too high 33
i too high 34
i too high 35
i too high 36
i too high 37
i too high 38
i too high 39
i too high 40
i too high 41
i too high 42
i too high 43
i too high 44
i too high 45
i too high 46
i too high 47
i too high 48
i too high 49
i too high 50
i too high 51
i too high 52
i too high 53
i too high 54
i too high 55
i too high 56
i too high 57
i too high 58
i too high 59
i too high 60
i too high 61
i too high 62
i too high 63
i too high 64
i too high 65
i too high 66
i too high 67
i too high 68
i too high 69
i too high 70
i too high 71
i too high 72
i too high 73
i too high 74
i too high 75
i too high 76
i too high 77
i too high 78
i too high 79
i too high 80
i too high 81
i too high 82
i too high 83
i too high 84
i too high 85
i too high 86
i too high 87
i too high 88
i too high 89
i too high 90
i too high 91
i too high 92
i too high 93
i too high 94
i too high 95
i too high 96
i too high 97
i too high 98
i too high 99
i too high 100
i too high 101
i too high 102
i too high 103
i too high 104
i too high 105
i too high 106
i too high 107
i too high 108
i too high 109
i too high 110
i too high 111
i too high 112
i too high 113
i too high 114
i too high 115
i too high 116
i too high 117
i too high 118
i too high 119
i too high 120
i too high 121
i too high 122
i too high 123
i too high 124
i too high 125
i too high 126
i too high 127
i too high 128
i too high 129
i too high 130
i too high 131
i too high 132
i too high 133
i too high 134
i too high 135
i too high 136
i too high 137
i too high 138
i too high 139
i too high 140
i too high 141
i too high 142
i too high 143
i too high 144
i too high 145
i too high 146
i too high 147
i too high 148
i too high 149
i too high 150
i too high 151
i too high 152
i too high 153
i too high 154
i too high 155
i too high 156
i too high 157
i too high 158
i too high 159
i too high 160
i too high 161
i too high 162
i too high 163
i too high 164
i too high 165
i too high 166
i too high 167
i too high 168
i too high 169
i too high 170
i too high 171
i too high 172
i too high 173
i too high 174
i too high 175
i too high 176
i too high 177
i too high 178
i too high 179
i too high 180
dev_Art_Theory_2 <class 'str'> ['Roman emperor', 'Greek philosopher', 'Carolingian king', 'Romanesque abbot'] len 78
too many
i too high 26
i too high 27
i too high 28
i too high 29
i too high 30
i too high 31
i too high 32
i too high 33
i too high 34
i too high 35
i too high 36
i too high 37
i too high 38
i too high 39
i too high 40
i too high 41
i too high 42
i too high 43
i too high 44
i too high 45
i too high 46
i too high 47
i too high 48
i too high 49
i too high 50
i too high 51
i too high 52
i too high 53
i too high 54
i too high 55
i too high 56
i too high 57
i too high 58
i too high 59
i too high 60
i too high 61
i too high 62
i too high 63
i too high 64
i too high 65
i too high 66
i too high 67
i too high 68
i too high 69
i too high 70
i too high 71
i too high 72
i too high 73
i too high 74
i too high 75
i too high 76
i too high 77
dev_Art_Theory_3 <class 'str'> ['timber roofs', 'quadripartite vaults', 'pendentive domes', 'masonry barrel vaults'] len 85
too many
i too high 26
i too high 27
i too high 28
i too high 29
i too high 30
i too high 31
i too high 32
i too high 33
i too high 34
i too high 35
i too high 36
i too high 37
i too high 38
i too high 39
i too high 40
i too high 41
i too high 42
i too high 43
i too high 44
i too high 45
i too high 46
i too high 47
i too high 48
i too high 49
i too high 50
i too high 51
i too high 52
i too high 53
i too high 54
i too high 55
i too high 56
i too high 57
i too high 58
i too high 59
i too high 60
i too high 61
i too high 62
i too high 63
i too high 64
i too high 65
i too high 66
i too high 67
i too high 68
i too high 69
i too high 70
i too high 71
i too high 72
i too high 73
i too high 74
i too high 75
i too high 76
i too high 77
i too high 78
i too high 79
i too high 80
i too high 81
i too high 82
i too high 83
i too high 84
dev_Art_Theory_4 <class 'str'> ['accurately copying figures and subject matter from nature', 'aligning the Old Testament and the New Testament through typology', 'emphasizing the role of the saints in the church', 'asserting imperial authority over the church'] len 230
too many
i too high 26
i too high 27
i too high 28
i too high 29
i too high 30
i too high 31
i too high 32
i too high 33
i too high 34
i too high 35
i too high 36
i too high 37
i too high 38
i too high 39
i too high 40
i too high 41
i too high 42
i too high 43
i too high 44
i too high 45
i too high 46
i too high 47
i too high 48
i too high 49
i too high 50
i too high 51
i too high 52
i too high 53
i too high 54
i too high 55
i too high 56
i too high 57
i too high 58
i too high 59
i too high 60
i too high 61
i too high 62
i too high 63
i too high 64
i too high 65
i too high 66
i too high 67
i too high 68
i too high 69
i too high 70
i too high 71
i too high 72
i too high 73
i too high 74
i too high 75
i too high 76
i too high 77
i too high 78
i too high 79
i too high 80
i too high 81
i too high 82
i too high 83
i too high 84
i too high 85
i too high 86
i too high 87
i too high 88
i too high 89
i too high 90
i too high 91
i too high 92
i too high 93
i too high 94
i too high 95
i too high 96
i too high 97
i too high 98
i too high 99
i too high 100
i too high 101
i too high 102
i too high 103
i too high 104
i too high 105
i too high 106
i too high 107
i too high 108
i too high 109
i too high 110
i too high 111
i too high 112
i too high 113
i too high 114
i too high 115
i too high 116
i too high 117
i too high 118
i too high 119
i too high 120
i too high 121
i too high 122
i too high 123
i too high 124
i too high 125
i too high 126
i too high 127
i too high 128
i too high 129
i too high 130
i too high 131
i too high 132
i too high 133
i too high 134
i too high 135
i too high 136
i too high 137
i too high 138
i too high 139
i too high 140
i too high 141
i too high 142
i too high 143
i too high 144
i too high 145
i too high 146
i too high 147
i too high 148
i too high 149
i too high 150
i too high 151
i too high 152
i too high 153
i too high 154
i too high 155
i too high 156
i too high 157
i too high 158
i too high 159
i too high 160
i too high 161
i too high 162
i too high 163
i too high 164
i too high 165
i too high 166
i too high 167
i too high 168
i too high 169
i too high 170
i too high 171
i too high 172
i too high 173
i too high 174
i too high 175
i too high 176
i too high 177
i too high 178
i too high 179
i too high 180
i too high 181
i too high 182
i too high 183
i too high 184
i too high 185
i too high 186
i too high 187
i too high 188
i too high 189
i too high 190
i too high 191
i too high 192
i too high 193
i too high 194
i too high 195
i too high 196
i too high 197
i too high 198
i too high 199
i too high 200
i too high 201
i too high 202
i too high 203
i too high 204
i too high 205
i too high 206
i too high 207
i too high 208
i too high 209
i too high 210
i too high 211
i too high 212
i too high 213
i too high 214
i too high 215
i too high 216
i too high 217
i too high 218
i too high 219
i too high 220
i too high 221
i too high 222
i too high 223
i too high 224
i too high 225
i too high 226
i too high 227
i too high 228
i too high 229
dev_Art_Theory_5 <class 'str'> ['friezes', 'pediments', 'registers', 'polyptychs'] len 51
too many
i too high 26
i too high 27
i too high 28
i too high 29
i too high 30
i too high 31
i too high 32
i too high 33
i too high 34
i too high 35
i too high 36
i too high 37
i too high 38
i too high 39
i too high 40
i too high 41
i too high 42
i too high 43
i too high 44
i too high 45
i too high 46
i too high 47
i too high 48
i too high 49
i too high 50

. .venv/bin/activate && python - <<'PY'
import base64, io, re, ast, requests
from datasets import load_dataset

SERVER='http://127.0.0.1:8080/v1/chat/completions'
letters='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
pat=re.compile(r'\b([A-Z])\b')

def parse_options(o):
    if isinstance(o,list):
        return o
    if isinstance(o,str):
        try:
            v=ast.literal_eval(o)
            if isinstance(v,list):
                return [str(x) for x in v]
        except Exception:
            pass
    raise ValueError(f'Unexpected options format: {type(o)} {o!r}')

def pil_to_data_url(img):
    buf=io.BytesIO(); img.save(buf, format='PNG')
    return 'data:image/png;base64,'+base64.b64encode(buf.getvalue()).decode('utf-8')

for split in ['dev','validation','test']:
    print(f'Art_Theory {split}:', len(load_dataset('MMMU/MMMU','Art_Theory',split=split)))

ds=load_dataset('MMMU/MMMU','Art_Theory',split='dev')
correct=0
for idx,row in enumerate(ds):
    options=parse_options(row['options'])
    opt_lines=[f"{letters[i]}. {opt}" for i,opt in enumerate(options)]
    allowed=', '.join(letters[i] for i in range(len(options)))
    text=(
        "Answer the multiple-choice question based on the provided image(s).\n"
        "Return ONLY the single option letter and nothing else.\n\n"
        f"Question: {row['question']}\n"
        "Options:\n" + "\n".join(opt_lines) + "\n\n"
        f"Valid letters: {allowed}."
    )
    content=[{"type":"text","text":text}]
    for i in range(1,8):
        img=row[f'image_{i}']
        if img is not None:
            content.append({"type":"image_url","image_url":{"url":pil_to_data_url(img)}})

    payload={"model":"local","messages":[{"role":"user","content":content}],"temperature":0,"max_tokens":8}
    resp=requests.post(SERVER,json=payload,timeout=240)
    resp.raise_for_status()
    out=resp.json()['choices'][0]['message']['content'].strip()

    m=pat.search(out)
    pred=(m.group(1) if m else out[:1].upper())
    gold=row['answer'].strip().upper()
    ok=(pred==gold)
    correct+=ok
    print(f"{idx+1}/{len(ds)} {row['id']} pred={pred} gold={gold} raw={out!r} {'OK' if ok else 'WRONG'}")

print(f"Accuracy: {correct}/{len(ds)} = {correct/len(ds):.1%}")
PY

Art_Theory dev: 5
Art_Theory validation: 30
Art_Theory test: 429
1/5 dev_Art_Theory_1 pred=A gold=D raw='A' WRONG
2/5 dev_Art_Theory_2 pred=A gold=A raw='A' OK
3/5 dev_Art_Theory_3 pred=B gold=A raw='B' WRONG
4/5 dev_Art_Theory_4 pred=A gold=B raw='A' WRONG
5/5 dev_Art_Theory_5 pred=C gold=C raw='C' OK
Accuracy: 2/5 = 40.0%

Done — I ran LFM2.5-VL-1.6B on MMMU / Art_Theory / dev against your local llama-server.

What I verified first

Art_Theory split sizes:
- dev: 5
- validation: 30
- test: 429

Dev-set results (Art_Theory)

dev_Art_Theory_1 → pred A, gold D ❌
dev_Art_Theory_2 → pred A, gold A ✅
dev_Art_Theory_3 → pred B, gold A ❌
dev_Art_Theory_4 → pred A, gold B ❌
dev_Art_Theory_5 → pred C, gold C ✅

Accuracy: 2/5 = 40.0%

I also saved a reusable script at:

run_mmmu_eval.py

Run it with:

bash

If you want, I can next run the same on validation for this category and give a fuller breakdown.

can you write a requirements file

Added requirements.txt at:

requirements.txt

Contents:

datasets
pyarrow
pillow
requests

If you want, I can also add a short README snippet with venv setup/install/run commands.

git commit and continue

Implemented and committed ✅

Commit: 43cdacb
Message: Log MMMU eval runs, subject results, and predictions to SQLite

What’s now in `run_mmmu_eval.py`

Auto-creates/uses SQLite DB (works whether file exists or not).
New CLI arg:
- --db-path (default: mmmu_runs.db)
Logs:
1. runs (one row per script invocation)
2. subject_results (per-subject aggregate)
3. predictions (per-question detail)
Marks run status as:
- running → completed or failed
Stores failure message if run errors out.

User-facing DB summary added

At the end of each run it prints:

db path
run_id
rows added to each table (runs, subject_results, predictions)
rows logged by this run

So you get immediate feedback on how much data was written.

add better error handling if the backend server is not reachable. keep it concise tho

don't bloat my code dawg

Editedrun_mmmu_eval.py

Summary

What I verified first

Dev-set results (Art_Theory)

What’s now in run_mmmu_eval.py

User-facing DB summary added

Summary

What I verified first

Dev-set results (Art_Theory)

What’s now in run_mmmu_eval.py

User-facing DB summary added

What’s now in `run_mmmu_eval.py`

What’s now in `run_mmmu_eval.py`