Les Benchmarks

Notes brutes

ATTENTION CE N’EST PAS ENCORE UN BENCHMARK

Synthèse d’un run limité à 50 tests:

Note to myself: le mauvais score de Gemma doit être dû à un mauvais formatage ou de mauvais paramètres, c’est pas cohérents sinon

Modèle Quantisation GSM8K Winogrande MMLU
Qwen 3.5 35B A3B Q4_K_S 88/86 56 68.32
Gemma 4 26 A4B Q4_K_XL 38 52 76.88

GSM8K

Qwen 3.5 35B A3B en Q4_K_S

Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions\
     --model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=Qwen/Qwen3.5-35B-A3B"\
     --tasks "gsm8k"\
     --num_fewshot 8\
     --batch_size 1\
     --limit 50
[...]
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.88|±  |0.0464|
|     |       |strict-match    |     8|exact_match|↑  | 0.86|±  |0.0496|

gemma-4-26B-A4B-it-UD-Q4_K_XL

Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions \
    --model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=google/gemma-4-26B-A4B"\
    --tasks "gsm8k"\
    --num_fewshot 8\
    --batch_size 1\
    --limit 50
[...]
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.38|±  |0.0693|
|     |       |strict-match    |     8|exact_match|↑  | 0.38|±  |0.0693|

``` 

## Winogrande


###  Qwen 3.5 35B A3B en Q4_K_S

Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50 […]

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 8 acc 0.56 ± 0.0709

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653409,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}


### gemma-4-26B-A4B-it-UD-Q4_K_XL

Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50 […]

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 8 acc 0.52 ± 0.0714

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653237,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}


## MMLU

###  Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50

Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.6832 ± 0.0081
- humanities 2 none 5 acc 0.6969 ± 0.0163
- formal_logic 1 none 5 acc 0.3200 ± 0.0666
- high_school_european_history 1 none 5 acc 0.8400 ± 0.0524
- high_school_us_history 1 none 5 acc 0.9000 ± 0.0429
- high_school_world_history 1 none 5 acc 0.9200 ± 0.0388
- international_law 1 none 5 acc 0.9200 ± 0.0388
- jurisprudence 1 none 5 acc 0.8000 ± 0.0571
- logical_fallacies 1 none 5 acc 0.6800 ± 0.0666
- moral_disputes 1 none 5 acc 0.5600 ± 0.0709
- moral_scenarios 1 none 5 acc 0.3200 ± 0.0666
- philosophy 1 none 5 acc 0.8400 ± 0.0524
- prehistory 1 none 5 acc 0.5000 ± 0.0714
- professional_law 1 none 5 acc 0.6800 ± 0.0666
- world_religions 1 none 5 acc 0.7800 ± 0.0592
- other 2 none 5 acc 0.6723 ± 0.0173
- business_ethics 1 none 5 acc 0.7600 ± 0.0610
- clinical_knowledge 1 none 5 acc 0.7400 ± 0.0627
- college_medicine 1 none 5 acc 0.7600 ± 0.0610
- global_facts 1 none 5 acc 0.6000 ± 0.0700
- human_aging 1 none 5 acc 0.5600 ± 0.0709
- management 1 none 5 acc 0.7400 ± 0.0627
- marketing 1 none 5 acc 0.3800 ± 0.0693
- medical_genetics 1 none 5 acc 0.8800 ± 0.0464
- miscellaneous 1 none 5 acc 0.8600 ± 0.0496
- nutrition 1 none 5 acc 0.7200 ± 0.0641
- professional_accounting 1 none 5 acc 0.6400 ± 0.0686
- professional_medicine 1 none 5 acc 0.8200 ± 0.0549
- virology 1 none 5 acc 0.2800 ± 0.0641
- social sciences 2 none 5 acc 0.7333 ± 0.0165
- econometrics 1 none 5 acc 0.4400 ± 0.0709
- high_school_geography 1 none 5 acc 0.9200 ± 0.0388
- high_school_government_and_politics 1 none 5 acc 0.9200 ± 0.0388
- high_school_macroeconomics 1 none 5 acc 0.8000 ± 0.0571
- high_school_microeconomics 1 none 5 acc 0.9400 ± 0.0339
- high_school_psychology 1 none 5 acc 0.9400 ± 0.0339
- human_sexuality 1 none 5 acc 0.8000 ± 0.0571
- professional_psychology 1 none 5 acc 0.7000 ± 0.0655
- public_relations 1 none 5 acc 0.4600 ± 0.0712
- security_studies 1 none 5 acc 0.4200 ± 0.0705
- sociology 1 none 5 acc 0.7400 ± 0.0627
- us_foreign_policy 1 none 5 acc 0.7200 ± 0.0641
- stem 2 none 5 acc 0.6495 ± 0.0147
- abstract_algebra 1 none 5 acc 0.6600 ± 0.0677
- anatomy 1 none 5 acc 0.5600 ± 0.0709
- astronomy 1 none 5 acc 0.6800 ± 0.0666
- college_biology 1 none 5 acc 0.9200 ± 0.0388
- college_chemistry 1 none 5 acc 0.5200 ± 0.0714
- college_computer_science 1 none 5 acc 0.4600 ± 0.0712
- college_mathematics 1 none 5 acc 0.5200 ± 0.0714
- college_physics 1 none 5 acc 0.7400 ± 0.0627
- computer_security 1 none 5 acc 0.8000 ± 0.0571
- conceptual_physics 1 none 5 acc 0.8200 ± 0.0549
- electrical_engineering 1 none 5 acc 0.5800 ± 0.0705
- elementary_mathematics 1 none 5 acc 0.7200 ± 0.0641
- high_school_biology 1 none 5 acc 0.7400 ± 0.0627
- high_school_chemistry 1 none 5 acc 0.8200 ± 0.0549
- high_school_computer_science 1 none 5 acc 0.2200 ± 0.0592
- high_school_mathematics 1 none 5 acc 0.4800 ± 0.0714
- high_school_physics 1 none 5 acc 0.7400 ± 0.0627
- high_school_statistics 1 none 5 acc 0.7600 ± 0.0610
- machine_learning 1 none 5 acc 0.6000 ± 0.0700
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.6832 ± 0.0081
- humanities 2 none 5 acc 0.6969 ± 0.0163
- other 2 none 5 acc 0.6723 ± 0.0173
- social sciences 2 none 5 acc 0.7333 ± 0.0165
- stem 2 none 5 acc 0.6495 ± 0.0147

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775655458,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}


### gemma-4-26B-A4B-it-UD-Q4_K_XL

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50

[…]

Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.7688 ± 0.0076
- humanities 2 none 5 acc 0.7954 ± 0.0156
- formal_logic 1 none 5 acc 0.7200 ± 0.0641
- high_school_european_history 1 none 5 acc 0.7800 ± 0.0592
- high_school_us_history 1 none 5 acc 0.9000 ± 0.0429
- high_school_world_history 1 none 5 acc 0.8600 ± 0.0496
- international_law 1 none 5 acc 0.9000 ± 0.0429
- jurisprudence 1 none 5 acc 0.8600 ± 0.0496
- logical_fallacies 1 none 5 acc 0.8200 ± 0.0549
- moral_disputes 1 none 5 acc 0.7000 ± 0.0655
- moral_scenarios 1 none 5 acc 0.6800 ± 0.0666
- philosophy 1 none 5 acc 0.9000 ± 0.0429
- prehistory 1 none 5 acc 0.7400 ± 0.0627
- professional_law 1 none 5 acc 0.6400 ± 0.0686
- world_religions 1 none 5 acc 0.8400 ± 0.0524
- other 2 none 5 acc 0.7554 ± 0.0162
- business_ethics 1 none 5 acc 0.9000 ± 0.0429
- clinical_knowledge 1 none 5 acc 0.8000 ± 0.0571
- college_medicine 1 none 5 acc 0.7800 ± 0.0592
- global_facts 1 none 5 acc 0.4600 ± 0.0712
- human_aging 1 none 5 acc 0.7200 ± 0.0641
- management 1 none 5 acc 0.8600 ± 0.0496
- marketing 1 none 5 acc 0.9200 ± 0.0388
- medical_genetics 1 none 5 acc 0.8000 ± 0.0571
- miscellaneous 1 none 5 acc 0.8400 ± 0.0524
- nutrition 1 none 5 acc 0.8400 ± 0.0524
- professional_accounting 1 none 5 acc 0.5800 ± 0.0705
- professional_medicine 1 none 5 acc 0.7600 ± 0.0610
- virology 1 none 5 acc 0.5600 ± 0.0709
- social sciences 2 none 5 acc 0.8283 ± 0.0150
- econometrics 1 none 5 acc 0.7200 ± 0.0641
- high_school_geography 1 none 5 acc 0.8400 ± 0.0524
- high_school_government_and_politics 1 none 5 acc 1.0000 ± 0.0000
- high_school_macroeconomics 1 none 5 acc 0.7400 ± 0.0627
- high_school_microeconomics 1 none 5 acc 0.9400 ± 0.0339
- high_school_psychology 1 none 5 acc 0.9400 ± 0.0339
- human_sexuality 1 none 5 acc 0.8200 ± 0.0549
- professional_psychology 1 none 5 acc 0.7600 ± 0.0610
- public_relations 1 none 5 acc 0.6800 ± 0.0666
- security_studies 1 none 5 acc 0.7600 ± 0.0610
- sociology 1 none 5 acc 0.8400 ± 0.0524
- us_foreign_policy 1 none 5 acc 0.9000 ± 0.0429
- stem 2 none 5 acc 0.7221 ± 0.0140
- abstract_algebra 1 none 5 acc 0.6200 ± 0.0693
- anatomy 1 none 5 acc 0.7000 ± 0.0655
- astronomy 1 none 5 acc 0.9600 ± 0.0280
- college_biology 1 none 5 acc 0.9200 ± 0.0388
- college_chemistry 1 none 5 acc 0.6000 ± 0.0700
- college_computer_science 1 none 5 acc 0.7400 ± 0.0627
- college_mathematics 1 none 5 acc 0.4400 ± 0.0709
- college_physics 1 none 5 acc 0.6400 ± 0.0686
- computer_security 1 none 5 acc 0.8000 ± 0.0571
- conceptual_physics 1 none 5 acc 0.7200 ± 0.0641
- electrical_engineering 1 none 5 acc 0.7800 ± 0.0592
- elementary_mathematics 1 none 5 acc 0.6800 ± 0.0666
- high_school_biology 1 none 5 acc 0.9400 ± 0.0339
- high_school_chemistry 1 none 5 acc 0.7400 ± 0.0627
- high_school_computer_science 1 none 5 acc 0.9000 ± 0.0429
- high_school_mathematics 1 none 5 acc 0.5400 ± 0.0712
- high_school_physics 1 none 5 acc 0.5600 ± 0.0709
- high_school_statistics 1 none 5 acc 0.7400 ± 0.0627
- machine_learning 1 none 5 acc 0.7000 ± 0.0655
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.7688 ± 0.0076
- humanities 2 none 5 acc 0.7954 ± 0.0156
- other 2 none 5 acc 0.7554 ± 0.0162
- social sciences 2 none 5 acc 0.8283 ± 0.0150
- stem 2 none 5 acc 0.7221 ± 0.0140

yves@desk:/data/models/unsloth$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775659610,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}