EuroEval

The robust European language model benchmark.

EuroEval is a language model benchmarking framework that supports evaluating all types of language models out there: encoders, decoders, encoder-decoders, base models, and instruction tuned models. EuroEval has been battle-tested for more than three years and are the standard evaluation benchmark for many companies, universities and organisations around Europe.

Check out the leaderboards to see how different language models perform on a wide range of tasks in various European languages. The leaderboards are updated regularly with new models and new results. All benchmark results have been computed using the associated EuroEval Python package, which you can use to replicate all the results. It supports all models on the Hugging Face Hub, as well as models accessible through 100+ different APIs, including models you are hosting yourself via, e.g., Ollama or LM Studio.

The idea of EuroEval grew out of the development of Danish language model RøBÆRTa in 2021, when we realised that there was no standard way to evaluate Danish language models. It started as a hobby project including Danish, Swedish and Norwegian, but has since grown to include 8+ European languages.

EuroEval is maintained by Dan Saattrup Nielsen from the Alexandra Institute and Kenneth Enevoldsen from Aarhus University, and is funded by the EU project TrustLLM.