Skip to content

scandeval

source package scandeval

ScandEval - A benchmarking framework for language models.

Classes

  • Benchmarker Benchmarking all the Scandinavian language models.

Functions

source class Benchmarker(save_results: bool = True, task: str | list[str] | None = None, dataset: list[str] | str | None = None, language: str | list[str] = 'all', model_language: str | list[str] | None = None, dataset_language: str | list[str] | None = None, device: Device | None = None, batch_size: int = 32, raise_errors: bool = False, cache_dir: str = '.scandeval_cache', api_key: str | None = None, force: bool = False, verbose: bool = False, trust_remote_code: bool = False, use_flash_attention: bool | None = None, clear_model_cache: bool = False, evaluate_test_split: bool = False, few_shot: bool = True, num_iterations: int = 10, api_base: str | None = None, api_version: str | None = None, debug: bool = False, run_with_cli: bool = False, only_allow_safetensors: bool = False)

Benchmarking all the Scandinavian language models.

Initialise the benchmarker.

Attributes

  • benchmark_config_default_params

    The default parameters for the benchmark configuration.

  • benchmark_config

    The benchmark configuration.

  • force

    Whether to force evaluations of models, even if they have been benchmarked already.

  • results_path

    The path to the results file.

  • benchmark_results : list[BenchmarkResult]

    The benchmark results.

Parameters

  • progress_bar : bool

    Whether progress bars should be shown. Defaults to True.

  • save_results : bool

    Whether to save the benchmark results to 'scandeval_benchmark_results.jsonl'. Defaults to True.

  • task : str | list[str] | None

    The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked.

  • dataset : list[str] | str | None

    The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked.

  • language : str | list[str]

    The language codes of the languages to include, both for models and datasets. Set this to 'all' if all languages should be considered. Defaults to "all".

  • model_language : str | list[str] | None

    The language codes of the languages to include for models. If specified then this overrides the language parameter for model languages. Defaults to None.

  • dataset_language : str | list[str] | None

    The language codes of the languages to include for datasets. If specified then this overrides the language parameter for dataset languages. Defaults to None.

  • device : Device | None

    The device to use for benchmarking. Defaults to None.

  • batch_size : int

    The batch size to use. Defaults to 32.

  • raise_errors : bool

    Whether to raise errors instead of skipping the model evaluation. Defaults to False.

  • cache_dir : str

    Directory to store cached models. Defaults to '.scandeval_cache'.

  • api_key : str | None

    The API key to use for a given inference API.

  • force : bool

    Whether to force evaluations of models, even if they have been benchmarked already. Defaults to False.

  • verbose : bool

    Whether to output additional output. This is automatically set if debug is True. Defaults to False.

  • trust_remote_code : bool

    Whether to trust remote code when loading models. Defaults to False.

  • use_flash_attention : bool | None

    Whether to use Flash Attention. If None then it will be used if it is installed and the model is a decoder model. Defaults to None.

  • clear_model_cache : bool

    Whether to clear the model cache after benchmarking each model. Defaults to False.

  • evaluate_test_split : bool

    Whether to evaluate the test split of the datasets. Defaults to False.

  • few_shot : bool

    Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to True.

  • num_iterations : int

    The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to 10.

  • api_base : str | None

    The base URL for a given inference API. Only relevant if model refers to a model on an inference API. Defaults to None.

  • api_version : str | None

    The version of the API to use. Defaults to None.

  • debug : bool

    Whether to output debug information. Defaults to False.

  • run_with_cli : bool

    Whether the benchmarker is being run from the command-line interface. Defaults to False.

  • only_allow_safetensors : bool

    Whether to only allow models that use the safetensors format. Defaults to False.

Raises

  • ValueError

    If both task and dataset are specified.

Methods

  • benchmark Benchmarks models on datasets.

source property Benchmarker.benchmark_results: list[BenchmarkResult]

The benchmark results.

source method Benchmarker.benchmark(model: list[str] | str, task: str | list[str] | None = None, dataset: list[str] | str | None = None, progress_bar: bool | None = None, save_results: bool | None = None, language: str | list[str] | None = None, model_language: str | list[str] | None = None, dataset_language: str | list[str] | None = None, device: Device | None = None, batch_size: int | None = None, raise_errors: bool | None = None, cache_dir: str | None = None, api_key: str | None = None, force: bool | None = None, verbose: bool | None = None, trust_remote_code: bool | None = None, use_flash_attention: bool | None = None, clear_model_cache: bool | None = None, evaluate_test_split: bool | None = None, few_shot: bool | None = None, num_iterations: int | None = None, only_allow_safetensors: bool | None = None)list[BenchmarkResult]

Benchmarks models on datasets.

Parameters

  • model : list[str] | str

    The full Hugging Face Hub path(s) to the pretrained transformer model. The specific model version to use can be added after the suffix '@': "model@v1.0.0". It can be a branch name, a tag name, or a commit id, and defaults to the latest version if not specified.

  • task : str | list[str] | None

    The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.

  • dataset : list[str] | str | None

    The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.

  • progress_bar : bool | None

    Whether progress bars should be shown. Defaults to the value specified when initialising the benchmarker.

  • save_results : bool | None

    Whether to save the benchmark results to 'scandeval_benchmark_results.jsonl'. Defaults to the value specified when initialising the benchmarker.

  • language : str | list[str] | None

    The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages (also non-Scandinavian) should be considered. Defaults to the value specified when initialising the benchmarker.

  • model_language : str | list[str] | None

    The language codes of the languages to include for models. If specified then this overrides the language parameter for model languages. Defaults to the value specified when initialising the benchmarker.

  • dataset_language : str | list[str] | None

    The language codes of the languages to include for datasets. If specified then this overrides the language parameter for dataset languages. Defaults to the value specified when initialising the benchmarker.

  • device : Device | None

    The device to use for benchmarking. Defaults to the value specified when initialising the benchmarker.

  • batch_size : int | None

    The batch size to use. Defaults to the value specified when initialising the benchmarker.

  • raise_errors : bool | None

    Whether to raise errors instead of skipping the model evaluation.

  • cache_dir : str | None

    Directory to store cached models. Defaults to the value specified when initialising the benchmarker.

  • api_key : str | None

    The API key to use for a given inference server. Defaults to the value specified when initialising the benchmarker.

  • force : bool | None

    Whether to force evaluations of models, even if they have been benchmarked already. Defaults to the value specified when initialising the benchmarker.

  • verbose : bool | None

    Whether to output additional output. Defaults to the value specified when initialising the benchmarker.

  • trust_remote_code : bool | None

    Whether to trust remote code when loading models. Defaults to the value specified when initialising the benchmarker.

  • use_flash_attention : bool | None

    Whether to use Flash Attention. Defaults to the value specified when initialising the benchmarker.

  • clear_model_cache : bool | None

    Whether to clear the model cache after benchmarking each model. Defaults to the value specified when initialising the benchmarker.

  • evaluate_test_split : bool | None

    Whether to evaluate the test split of the datasets. Defaults to the value specified when initialising the benchmarker.

  • few_shot : bool | None

    Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to the value specified when initialising the benchmarker.

  • num_iterations : int | None

    The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to the value specified when initialising the benchmarker.

  • only_allow_safetensors : bool | None

    Whether to only allow models that use the safetensors format. Defaults to the value specified when initialising the benchmarker.

Returns

Raises

  • ValueError

    If both task and dataset are specified.

  • benchmark_output_or_err

  • e

source block_terminal_output()

Blocks libraries from writing output to the terminal.

This filters warnings from some libraries, sets the logging level to ERROR for some libraries, disabled tokeniser progress bars when using Hugging Face tokenisers, and disables most of the logging from the transformers library.