Notes brutes
ATTENTION CE N’EST PAS ENCORE UN BENCHMARK
Synthèse d’un run limité à 50 tests:
Note to myself: le mauvais score de Gemma doit être dû à un mauvais formatage ou de mauvais paramètres, c’est pas cohérents sinon
| Modèle | Quantisation | GSM8K | Winogrande | MMLU |
|---|---|---|---|---|
| Qwen 3.5 35B A3B | Q4_K_S | 88/86 | 56 | 68.32 |
| Gemma 4 26 A4B | Q4_K_XL | 38 | 52 | 76.88 |
GSM8K
Qwen 3.5 35B A3B en Q4_K_S
Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions\
--model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=Qwen/Qwen3.5-35B-A3B"\
--tasks "gsm8k"\
--num_fewshot 8\
--batch_size 1\
--limit 50
[...]
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ | 0.88|± |0.0464|
| | |strict-match | 8|exact_match|↑ | 0.86|± |0.0496|
gemma-4-26B-A4B-it-UD-Q4_K_XL
Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval --model local-completions \
--model_args "base_url=http://localhost:8050/v1/completions,api_key=EMPTY,pretrained=google/gemma-4-26B-A4B"\
--tasks "gsm8k"\
--num_fewshot 8\
--batch_size 1\
--limit 50
[...]
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ | 0.38|± |0.0693|
| | |strict-match | 8|exact_match|↑ | 0.38|± |0.0693|
```
## Winogrande
### Qwen 3.5 35B A3B en Q4_K_S
Test gsm8k avec Qwen 3.5 35B A3B en Q4_K_S
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50
[…]
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| winogrande | 1 | none | 8 | acc | ↑ | 0.56 | ± | 0.0709 |
(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653409,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}
### gemma-4-26B-A4B-it-UD-Q4_K_XL
Test avec gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “winogrande”
–num_fewshot 8
–batch_size 1
–limit 50
[…]
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| winogrande | 1 | none | 8 | acc | ↑ | 0.52 | ± | 0.0714 |
(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775653237,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}
## MMLU
### Qwen 3.5 35B A3B en Q4_K_S
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | 0.6832 | ± | 0.0081 | ||
| - humanities | 2 | none | 5 | acc | ↑ | 0.6969 | ± | 0.0163 |
| - formal_logic | 1 | none | 5 | acc | ↑ | 0.3200 | ± | 0.0666 |
| - high_school_european_history | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - high_school_us_history | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - high_school_world_history | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - international_law | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - jurisprudence | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - logical_fallacies | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - moral_disputes | 1 | none | 5 | acc | ↑ | 0.5600 | ± | 0.0709 |
| - moral_scenarios | 1 | none | 5 | acc | ↑ | 0.3200 | ± | 0.0666 |
| - philosophy | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - prehistory | 1 | none | 5 | acc | ↑ | 0.5000 | ± | 0.0714 |
| - professional_law | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - world_religions | 1 | none | 5 | acc | ↑ | 0.7800 | ± | 0.0592 |
| - other | 2 | none | 5 | acc | ↑ | 0.6723 | ± | 0.0173 |
| - business_ethics | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - clinical_knowledge | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - college_medicine | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - global_facts | 1 | none | 5 | acc | ↑ | 0.6000 | ± | 0.0700 |
| - human_aging | 1 | none | 5 | acc | ↑ | 0.5600 | ± | 0.0709 |
| - management | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - marketing | 1 | none | 5 | acc | ↑ | 0.3800 | ± | 0.0693 |
| - medical_genetics | 1 | none | 5 | acc | ↑ | 0.8800 | ± | 0.0464 |
| - miscellaneous | 1 | none | 5 | acc | ↑ | 0.8600 | ± | 0.0496 |
| - nutrition | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - professional_accounting | 1 | none | 5 | acc | ↑ | 0.6400 | ± | 0.0686 |
| - professional_medicine | 1 | none | 5 | acc | ↑ | 0.8200 | ± | 0.0549 |
| - virology | 1 | none | 5 | acc | ↑ | 0.2800 | ± | 0.0641 |
| - social sciences | 2 | none | 5 | acc | ↑ | 0.7333 | ± | 0.0165 |
| - econometrics | 1 | none | 5 | acc | ↑ | 0.4400 | ± | 0.0709 |
| - high_school_geography | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - high_school_government_and_politics | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - high_school_macroeconomics | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - high_school_microeconomics | 1 | none | 5 | acc | ↑ | 0.9400 | ± | 0.0339 |
| - high_school_psychology | 1 | none | 5 | acc | ↑ | 0.9400 | ± | 0.0339 |
| - human_sexuality | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - professional_psychology | 1 | none | 5 | acc | ↑ | 0.7000 | ± | 0.0655 |
| - public_relations | 1 | none | 5 | acc | ↑ | 0.4600 | ± | 0.0712 |
| - security_studies | 1 | none | 5 | acc | ↑ | 0.4200 | ± | 0.0705 |
| - sociology | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - us_foreign_policy | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - stem | 2 | none | 5 | acc | ↑ | 0.6495 | ± | 0.0147 |
| - abstract_algebra | 1 | none | 5 | acc | ↑ | 0.6600 | ± | 0.0677 |
| - anatomy | 1 | none | 5 | acc | ↑ | 0.5600 | ± | 0.0709 |
| - astronomy | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - college_biology | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - college_chemistry | 1 | none | 5 | acc | ↑ | 0.5200 | ± | 0.0714 |
| - college_computer_science | 1 | none | 5 | acc | ↑ | 0.4600 | ± | 0.0712 |
| - college_mathematics | 1 | none | 5 | acc | ↑ | 0.5200 | ± | 0.0714 |
| - college_physics | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - computer_security | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - conceptual_physics | 1 | none | 5 | acc | ↑ | 0.8200 | ± | 0.0549 |
| - electrical_engineering | 1 | none | 5 | acc | ↑ | 0.5800 | ± | 0.0705 |
| - elementary_mathematics | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - high_school_biology | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - high_school_chemistry | 1 | none | 5 | acc | ↑ | 0.8200 | ± | 0.0549 |
| - high_school_computer_science | 1 | none | 5 | acc | ↑ | 0.2200 | ± | 0.0592 |
| - high_school_mathematics | 1 | none | 5 | acc | ↑ | 0.4800 | ± | 0.0714 |
| - high_school_physics | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - high_school_statistics | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - machine_learning | 1 | none | 5 | acc | ↑ | 0.6000 | ± | 0.0700 |
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | 0.6832 | ± | 0.0081 | ||
| - humanities | 2 | none | 5 | acc | ↑ | 0.6969 | ± | 0.0163 |
| - other | 2 | none | 5 | acc | ↑ | 0.6723 | ± | 0.0173 |
| - social sciences | 2 | none | 5 | acc | ↑ | 0.7333 | ± | 0.0165 |
| - stem | 2 | none | 5 | acc | ↑ | 0.6495 | ± | 0.0147 |
(lm-evaluation-harness) yves@desk:/data/benches$ curl http://localhost:8050/v1/models {“models”:[{“name”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“model”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_S.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775655458,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:248320,“n_ctx_train”:262144,“n_embd”:2048,“n_params”:34660610688,“size”:20662856192}}]}
### gemma-4-26B-A4B-it-UD-Q4_K_XL
(lm-evaluation-harness) yves@desk:/data/benches$ lm_eval –model gguf
–model_args “base_url=http://localhost:8050”
–tasks “mmlu”
–num_fewshot 5
–batch_size 1
–limit 50
[…]
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | 0.7688 | ± | 0.0076 | ||
| - humanities | 2 | none | 5 | acc | ↑ | 0.7954 | ± | 0.0156 |
| - formal_logic | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - high_school_european_history | 1 | none | 5 | acc | ↑ | 0.7800 | ± | 0.0592 |
| - high_school_us_history | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - high_school_world_history | 1 | none | 5 | acc | ↑ | 0.8600 | ± | 0.0496 |
| - international_law | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - jurisprudence | 1 | none | 5 | acc | ↑ | 0.8600 | ± | 0.0496 |
| - logical_fallacies | 1 | none | 5 | acc | ↑ | 0.8200 | ± | 0.0549 |
| - moral_disputes | 1 | none | 5 | acc | ↑ | 0.7000 | ± | 0.0655 |
| - moral_scenarios | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - philosophy | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - prehistory | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - professional_law | 1 | none | 5 | acc | ↑ | 0.6400 | ± | 0.0686 |
| - world_religions | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - other | 2 | none | 5 | acc | ↑ | 0.7554 | ± | 0.0162 |
| - business_ethics | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - clinical_knowledge | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - college_medicine | 1 | none | 5 | acc | ↑ | 0.7800 | ± | 0.0592 |
| - global_facts | 1 | none | 5 | acc | ↑ | 0.4600 | ± | 0.0712 |
| - human_aging | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - management | 1 | none | 5 | acc | ↑ | 0.8600 | ± | 0.0496 |
| - marketing | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - medical_genetics | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - miscellaneous | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - nutrition | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - professional_accounting | 1 | none | 5 | acc | ↑ | 0.5800 | ± | 0.0705 |
| - professional_medicine | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - virology | 1 | none | 5 | acc | ↑ | 0.5600 | ± | 0.0709 |
| - social sciences | 2 | none | 5 | acc | ↑ | 0.8283 | ± | 0.0150 |
| - econometrics | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - high_school_geography | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - high_school_government_and_politics | 1 | none | 5 | acc | ↑ | 1.0000 | ± | 0.0000 |
| - high_school_macroeconomics | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - high_school_microeconomics | 1 | none | 5 | acc | ↑ | 0.9400 | ± | 0.0339 |
| - high_school_psychology | 1 | none | 5 | acc | ↑ | 0.9400 | ± | 0.0339 |
| - human_sexuality | 1 | none | 5 | acc | ↑ | 0.8200 | ± | 0.0549 |
| - professional_psychology | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - public_relations | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - security_studies | 1 | none | 5 | acc | ↑ | 0.7600 | ± | 0.0610 |
| - sociology | 1 | none | 5 | acc | ↑ | 0.8400 | ± | 0.0524 |
| - us_foreign_policy | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - stem | 2 | none | 5 | acc | ↑ | 0.7221 | ± | 0.0140 |
| - abstract_algebra | 1 | none | 5 | acc | ↑ | 0.6200 | ± | 0.0693 |
| - anatomy | 1 | none | 5 | acc | ↑ | 0.7000 | ± | 0.0655 |
| - astronomy | 1 | none | 5 | acc | ↑ | 0.9600 | ± | 0.0280 |
| - college_biology | 1 | none | 5 | acc | ↑ | 0.9200 | ± | 0.0388 |
| - college_chemistry | 1 | none | 5 | acc | ↑ | 0.6000 | ± | 0.0700 |
| - college_computer_science | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - college_mathematics | 1 | none | 5 | acc | ↑ | 0.4400 | ± | 0.0709 |
| - college_physics | 1 | none | 5 | acc | ↑ | 0.6400 | ± | 0.0686 |
| - computer_security | 1 | none | 5 | acc | ↑ | 0.8000 | ± | 0.0571 |
| - conceptual_physics | 1 | none | 5 | acc | ↑ | 0.7200 | ± | 0.0641 |
| - electrical_engineering | 1 | none | 5 | acc | ↑ | 0.7800 | ± | 0.0592 |
| - elementary_mathematics | 1 | none | 5 | acc | ↑ | 0.6800 | ± | 0.0666 |
| - high_school_biology | 1 | none | 5 | acc | ↑ | 0.9400 | ± | 0.0339 |
| - high_school_chemistry | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - high_school_computer_science | 1 | none | 5 | acc | ↑ | 0.9000 | ± | 0.0429 |
| - high_school_mathematics | 1 | none | 5 | acc | ↑ | 0.5400 | ± | 0.0712 |
| - high_school_physics | 1 | none | 5 | acc | ↑ | 0.5600 | ± | 0.0709 |
| - high_school_statistics | 1 | none | 5 | acc | ↑ | 0.7400 | ± | 0.0627 |
| - machine_learning | 1 | none | 5 | acc | ↑ | 0.7000 | ± | 0.0655 |
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | 0.7688 | ± | 0.0076 | ||
| - humanities | 2 | none | 5 | acc | ↑ | 0.7954 | ± | 0.0156 |
| - other | 2 | none | 5 | acc | ↑ | 0.7554 | ± | 0.0162 |
| - social sciences | 2 | none | 5 | acc | ↑ | 0.8283 | ± | 0.0150 |
| - stem | 2 | none | 5 | acc | ↑ | 0.7221 | ± | 0.0140 |
yves@desk:/data/models/unsloth$ curl http://localhost:8050/v1/models {“models”:[{“name”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“model”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“modified_at”:"",“size”:"",“digest”:"",“type”:“model”,“description”:"",“tags”:[""],“capabilities”:[“completion”],“parameters”:"",“details”:{“parent_model”:"",“format”:“gguf”,“family”:"",“families”:[""],“parameter_size”:"",“quantization_level”:""}}],“object”:“list”,“data”:[{“id”:“gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”,“aliases”:[],“tags”:[],“object”:“model”,“created”:1775659610,“owned_by”:“llamacpp”,“meta”:{“vocab_type”:2,“n_vocab”:262144,“n_ctx_train”:262144,“n_embd”:2816,“n_params”:25233142046,“size”:17074453624}}]}