Les Benchmarks

Publié le 7 Apr. 2026

Notes brutes

ATTENTION CE N’EST PAS ENCORE UN BENCHMARK

Synthèse d’un run limité à 50 tests:

Note to myself: le mauvais score de Gemma doit être dû à un mauvais formatage ou de mauvais paramètres, c’est pas cohérents sinon

Modèle	Quantisation	GSM8K	Winogrande	MMLU
Qwen 3.5 35B A3B	Q4_K_S	88/86	56	68.32
Gemma 4 26 A4B	Q4_K_XL	38	52	76.88

GSM8K

Qwen 3.5 35B A3B en Q4_K_S

Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions\
     --model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=Qwen/Qwen3.5-35B-A3B"\
     --tasks "gsm8k"\
     --num_fewshot 8\
     --batch_size 1\
     --limit 50
[...]
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.88|±  |0.0464|
|     |       |strict-match    |     8|exact_match|↑  | 0.86|±  |0.0496|

gemma-4-26B-A4B-it-UD-Q4_K_XL

Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions \
    --model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=google/gemma-4-26B-A4B"\
    --tasks "gsm8k"\
    --num_fewshot 8\
    --batch_size 1\
    --limit 50
[...]
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.38|±  |0.0693|
|     |       |strict-match    |     8|exact_match|↑  | 0.38|±  |0.0693|

``` 

## Winogrande


###  Qwen 3.5 35B A3B en Q4_K_S

Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50 […]

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
winogrande	1	none	8	acc	↑	0.56	±	0.0709

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653409,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}


### gemma-4-26B-A4B-it-UD-Q4_K_XL

Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50 […]

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
winogrande	1	none	8	acc	↑	0.52	±	0.0714

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653237,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}


## MMLU

###  Qwen 3.5 35B A3B en Q4_K_S

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc		0.6832	±	0.0081
- humanities	2	none	5	acc	↑	0.6969	±	0.0163
- formal_logic	1	none	5	acc	↑	0.3200	±	0.0666
- high_school_european_history	1	none	5	acc	↑	0.8400	±	0.0524
- high_school_us_history	1	none	5	acc	↑	0.9000	±	0.0429
- high_school_world_history	1	none	5	acc	↑	0.9200	±	0.0388
- international_law	1	none	5	acc	↑	0.9200	±	0.0388
- jurisprudence	1	none	5	acc	↑	0.8000	±	0.0571
- logical_fallacies	1	none	5	acc	↑	0.6800	±	0.0666
- moral_disputes	1	none	5	acc	↑	0.5600	±	0.0709
- moral_scenarios	1	none	5	acc	↑	0.3200	±	0.0666
- philosophy	1	none	5	acc	↑	0.8400	±	0.0524
- prehistory	1	none	5	acc	↑	0.5000	±	0.0714
- professional_law	1	none	5	acc	↑	0.6800	±	0.0666
- world_religions	1	none	5	acc	↑	0.7800	±	0.0592
- other	2	none	5	acc	↑	0.6723	±	0.0173
- business_ethics	1	none	5	acc	↑	0.7600	±	0.0610
- clinical_knowledge	1	none	5	acc	↑	0.7400	±	0.0627
- college_medicine	1	none	5	acc	↑	0.7600	±	0.0610
- global_facts	1	none	5	acc	↑	0.6000	±	0.0700
- human_aging	1	none	5	acc	↑	0.5600	±	0.0709
- management	1	none	5	acc	↑	0.7400	±	0.0627
- marketing	1	none	5	acc	↑	0.3800	±	0.0693
- medical_genetics	1	none	5	acc	↑	0.8800	±	0.0464
- miscellaneous	1	none	5	acc	↑	0.8600	±	0.0496
- nutrition	1	none	5	acc	↑	0.7200	±	0.0641
- professional_accounting	1	none	5	acc	↑	0.6400	±	0.0686
- professional_medicine	1	none	5	acc	↑	0.8200	±	0.0549
- virology	1	none	5	acc	↑	0.2800	±	0.0641
- social sciences	2	none	5	acc	↑	0.7333	±	0.0165
- econometrics	1	none	5	acc	↑	0.4400	±	0.0709
- high_school_geography	1	none	5	acc	↑	0.9200	±	0.0388
- high_school_government_and_politics	1	none	5	acc	↑	0.9200	±	0.0388
- high_school_macroeconomics	1	none	5	acc	↑	0.8000	±	0.0571
- high_school_microeconomics	1	none	5	acc	↑	0.9400	±	0.0339
- high_school_psychology	1	none	5	acc	↑	0.9400	±	0.0339
- human_sexuality	1	none	5	acc	↑	0.8000	±	0.0571
- professional_psychology	1	none	5	acc	↑	0.7000	±	0.0655
- public_relations	1	none	5	acc	↑	0.4600	±	0.0712
- security_studies	1	none	5	acc	↑	0.4200	±	0.0705
- sociology	1	none	5	acc	↑	0.7400	±	0.0627
- us_foreign_policy	1	none	5	acc	↑	0.7200	±	0.0641
- stem	2	none	5	acc	↑	0.6495	±	0.0147
- abstract_algebra	1	none	5	acc	↑	0.6600	±	0.0677
- anatomy	1	none	5	acc	↑	0.5600	±	0.0709
- astronomy	1	none	5	acc	↑	0.6800	±	0.0666
- college_biology	1	none	5	acc	↑	0.9200	±	0.0388
- college_chemistry	1	none	5	acc	↑	0.5200	±	0.0714
- college_computer_science	1	none	5	acc	↑	0.4600	±	0.0712
- college_mathematics	1	none	5	acc	↑	0.5200	±	0.0714
- college_physics	1	none	5	acc	↑	0.7400	±	0.0627
- computer_security	1	none	5	acc	↑	0.8000	±	0.0571
- conceptual_physics	1	none	5	acc	↑	0.8200	±	0.0549
- electrical_engineering	1	none	5	acc	↑	0.5800	±	0.0705
- elementary_mathematics	1	none	5	acc	↑	0.7200	±	0.0641
- high_school_biology	1	none	5	acc	↑	0.7400	±	0.0627
- high_school_chemistry	1	none	5	acc	↑	0.8200	±	0.0549
- high_school_computer_science	1	none	5	acc	↑	0.2200	±	0.0592
- high_school_mathematics	1	none	5	acc	↑	0.4800	±	0.0714
- high_school_physics	1	none	5	acc	↑	0.7400	±	0.0627
- high_school_statistics	1	none	5	acc	↑	0.7600	±	0.0610
- machine_learning	1	none	5	acc	↑	0.6000	±	0.0700

Groups	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc		0.6832	±	0.0081
- humanities	2	none	5	acc	↑	0.6969	±	0.0163
- other	2	none	5	acc	↑	0.6723	±	0.0173
- social sciences	2	none	5	acc	↑	0.7333	±	0.0165
- stem	2	none	5	acc	↑	0.6495	±	0.0147

(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775655458,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}


### gemma-4-26B-A4B-it-UD-Q4_K_XL

(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50

[…]

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc		0.7688	±	0.0076
- humanities	2	none	5	acc	↑	0.7954	±	0.0156
- formal_logic	1	none	5	acc	↑	0.7200	±	0.0641
- high_school_european_history	1	none	5	acc	↑	0.7800	±	0.0592
- high_school_us_history	1	none	5	acc	↑	0.9000	±	0.0429
- high_school_world_history	1	none	5	acc	↑	0.8600	±	0.0496
- international_law	1	none	5	acc	↑	0.9000	±	0.0429
- jurisprudence	1	none	5	acc	↑	0.8600	±	0.0496
- logical_fallacies	1	none	5	acc	↑	0.8200	±	0.0549
- moral_disputes	1	none	5	acc	↑	0.7000	±	0.0655
- moral_scenarios	1	none	5	acc	↑	0.6800	±	0.0666
- philosophy	1	none	5	acc	↑	0.9000	±	0.0429
- prehistory	1	none	5	acc	↑	0.7400	±	0.0627
- professional_law	1	none	5	acc	↑	0.6400	±	0.0686
- world_religions	1	none	5	acc	↑	0.8400	±	0.0524
- other	2	none	5	acc	↑	0.7554	±	0.0162
- business_ethics	1	none	5	acc	↑	0.9000	±	0.0429
- clinical_knowledge	1	none	5	acc	↑	0.8000	±	0.0571
- college_medicine	1	none	5	acc	↑	0.7800	±	0.0592
- global_facts	1	none	5	acc	↑	0.4600	±	0.0712
- human_aging	1	none	5	acc	↑	0.7200	±	0.0641
- management	1	none	5	acc	↑	0.8600	±	0.0496
- marketing	1	none	5	acc	↑	0.9200	±	0.0388
- medical_genetics	1	none	5	acc	↑	0.8000	±	0.0571
- miscellaneous	1	none	5	acc	↑	0.8400	±	0.0524
- nutrition	1	none	5	acc	↑	0.8400	±	0.0524
- professional_accounting	1	none	5	acc	↑	0.5800	±	0.0705
- professional_medicine	1	none	5	acc	↑	0.7600	±	0.0610
- virology	1	none	5	acc	↑	0.5600	±	0.0709
- social sciences	2	none	5	acc	↑	0.8283	±	0.0150
- econometrics	1	none	5	acc	↑	0.7200	±	0.0641
- high_school_geography	1	none	5	acc	↑	0.8400	±	0.0524
- high_school_government_and_politics	1	none	5	acc	↑	1.0000	±	0.0000
- high_school_macroeconomics	1	none	5	acc	↑	0.7400	±	0.0627
- high_school_microeconomics	1	none	5	acc	↑	0.9400	±	0.0339
- high_school_psychology	1	none	5	acc	↑	0.9400	±	0.0339
- human_sexuality	1	none	5	acc	↑	0.8200	±	0.0549
- professional_psychology	1	none	5	acc	↑	0.7600	±	0.0610
- public_relations	1	none	5	acc	↑	0.6800	±	0.0666
- security_studies	1	none	5	acc	↑	0.7600	±	0.0610
- sociology	1	none	5	acc	↑	0.8400	±	0.0524
- us_foreign_policy	1	none	5	acc	↑	0.9000	±	0.0429
- stem	2	none	5	acc	↑	0.7221	±	0.0140
- abstract_algebra	1	none	5	acc	↑	0.6200	±	0.0693
- anatomy	1	none	5	acc	↑	0.7000	±	0.0655
- astronomy	1	none	5	acc	↑	0.9600	±	0.0280
- college_biology	1	none	5	acc	↑	0.9200	±	0.0388
- college_chemistry	1	none	5	acc	↑	0.6000	±	0.0700
- college_computer_science	1	none	5	acc	↑	0.7400	±	0.0627
- college_mathematics	1	none	5	acc	↑	0.4400	±	0.0709
- college_physics	1	none	5	acc	↑	0.6400	±	0.0686
- computer_security	1	none	5	acc	↑	0.8000	±	0.0571
- conceptual_physics	1	none	5	acc	↑	0.7200	±	0.0641
- electrical_engineering	1	none	5	acc	↑	0.7800	±	0.0592
- elementary_mathematics	1	none	5	acc	↑	0.6800	±	0.0666
- high_school_biology	1	none	5	acc	↑	0.9400	±	0.0339
- high_school_chemistry	1	none	5	acc	↑	0.7400	±	0.0627
- high_school_computer_science	1	none	5	acc	↑	0.9000	±	0.0429
- high_school_mathematics	1	none	5	acc	↑	0.5400	±	0.0712
- high_school_physics	1	none	5	acc	↑	0.5600	±	0.0709
- high_school_statistics	1	none	5	acc	↑	0.7400	±	0.0627
- machine_learning	1	none	5	acc	↑	0.7000	±	0.0655

Groups	Version	Filter	n-shot	Metric		Value		Stderr
mmlu	2	none		acc		0.7688	±	0.0076
- humanities	2	none	5	acc	↑	0.7954	±	0.0156
- other	2	none	5	acc	↑	0.7554	±	0.0162
- social sciences	2	none	5	acc	↑	0.8283	±	0.0150
- stem	2	none	5	acc	↑	0.7221	±	0.0140

yves@desk:/data/models/unsloth$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775659610,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}