A Robust Multilingual Evaluation Framework for Language Models
ScandEval is a language model benchmarking framework that supports evaluating all types of language models out there: encoders, decoders, encoder-decoders, base models, and instruction tuned models. ScandEval has been battle-tested for more than three years and are the standard evaluation benchmark for many companies, universities and organisations around Europe.
Check out the leaderboards to see how different language models perform on a wide range of tasks in various European languages. The leaderboards are updated regularly with new models and new results. All benchmark results have been computed using the associated ScandEval Python package, which you can use to replicate all the results. It supports all models on the Hugging Face Hub, as well as models accessible through 100+ different APIs, including models you are hosting yourself via, e.g., Ollama or LM Studio.
The idea of ScandEval grew out of the development of Danish language model RøBÆRTa in 2021, when we realised that there was no standard way to evaluate Danish language models. It started as a hobby project including Danish, Swedish and Norwegian, but has since grown to include 8+ European languages.
ScandEval is maintained by Dan Saattrup Nielsen from the Alexandra Institute and Kenneth Enevoldsen from Aarhus University, and is funded by the EU project TrustLLM.
The image used in the logo has been created by the amazing Scandinavia and the World team. Go check them out!