Skip to content

scandeval.benchmark_config_factory

source module scandeval.benchmark_config_factory

Factory class for creating dataset configurations.

Functions

source build_benchmark_config(progress_bar: bool, save_results: bool, task: str | list[str] | None, dataset: str | list[str] | None, language: str | list[str], model_language: str | list[str] | None, dataset_language: str | list[str] | None, device: Device | None, batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, verbose: bool, trust_remote_code: bool, use_flash_attention: bool | None, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, debug: bool, run_with_cli: bool, only_allow_safetensors: bool, first_time: bool = False)BenchmarkConfig

Create a benchmark configuration.

Parameters

  • progress_bar : bool

    Whether to show a progress bar when running the benchmark.

  • save_results : bool

    Whether to save the benchmark results to a file.

  • task : str | list[str] | None

    The tasks to include for dataset. If None then datasets will not be filtered based on their task.

  • dataset : str | list[str] | None

    The datasets to include for task. If None then all datasets will be included, limited by the task parameter.

  • language : str | list[str]

    The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages (also non-Scandinavian) should be considered.

  • model_language : str | list[str] | None

    The language codes of the languages to include for models. If None then the language parameter will be used.

  • dataset_language : str | list[str] | None

    The language codes of the languages to include for datasets. If None then the language parameter will be used.

  • device : Device | None

    The device to use for running the models. If None then the device will be set automatically.

  • batch_size : int

    The batch size to use for running the models.

  • raise_errors : bool

    Whether to raise errors when running the benchmark.

  • cache_dir : str

    The directory to use for caching the models.

  • api_key : str | None

    The API key to use for a given inference server.

  • force : bool

    Whether to force the benchmark to run even if the results are already cached.

  • verbose : bool

    Whether to print verbose output when running the benchmark. This is automatically set if debug is True.

  • trust_remote_code : bool

    Whether to trust remote code when running the benchmark.

  • use_flash_attention : bool | None

    Whether to use Flash Attention for the models. If None then it will be used if it is available.

  • clear_model_cache : bool

    Whether to clear the model cache before running the benchmark.

  • evaluate_test_split : bool

    Whether to use the test split for the datasets.

  • few_shot : bool

    Whether to use few-shot learning for the models.

  • num_iterations : int

    The number of iterations each model should be evaluated for.

  • api_base : str | None

    The base URL for a given inference API. Only relevant if model refers to a model on an inference API.

  • api_version : str | None

    The version of the API to use for a given inference API.

  • debug : bool

    Whether to run the benchmark in debug mode.

  • run_with_cli : bool

    Whether the benchmark is being run with the CLI.

  • only_allow_safetensors : bool

    Whether to only allow evaluations of models stored as safetensors.

  • first_time : bool

    Whether this is the first time the benchmark configuration is being created. Defaults to False.

Returns

source get_correct_language_codes(language_codes: str | list[str])list[str]

Get correct language code(s).

Parameters

  • language_codes : str | list[str]

    The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages (also non-Scandinavian) should be considered.

Returns

  • list[str] The correct language codes.

source prepare_languages(language_codes: str | list[str] | None, default_language_codes: list[str])list[Language]

Prepare language(s) for benchmarking.

Parameters

  • language_codes : str | list[str] | None

    The language codes of the languages to include for models or datasets. If specified then this overrides the language parameter for model or dataset languages.

  • default_language_codes : list[str]

    The default language codes of the languages to include.

Returns

  • list[Language] The prepared model or dataset languages.

source prepare_tasks_and_datasets(task: str | list[str] | None, dataset_languages: list[Language], dataset: str | list[str] | None)tuple[list[Task], list[str]]

Prepare task(s) and dataset(s) for benchmarking.

Parameters

  • task : str | list[str] | None

    The tasks to include for dataset. If None then datasets will not be filtered based on their task.

  • dataset_languages : list[Language]

    The languages of the datasets in the benchmark.

  • dataset : str | list[str] | None

    The datasets to include for task. If None then all datasets will be included, limited by the task and dataset_languages parameters.

Returns

  • tuple[list[Task], list[str]] The prepared tasks and datasets.

Raises

  • InvalidBenchmark

    If the task or dataset is not found in the benchmark tasks or datasets.

source prepare_device(device: Device | None)torch.device

Prepare device for benchmarking.

Parameters

  • device : Device | None

    The device to use for running the models. If None then the device will be set automatically.

Returns

  • torch.device The prepared device.