scandeval.data_models

Data models used in ScandEval.

Classes

MetricConfig — Configuration for a metric.
Task — A dataset task.
Language — A benchmarkable language.
BenchmarkConfig — General benchmarking configuration, across datasets and models.
BenchmarkConfigParams — The parameters for the benchmark configuration.
BenchmarkResult — A benchmark result.
DatasetConfig — Configuration for a dataset.
ModelConfig — Configuration for a model.
PreparedModelInputs — The inputs to a model.
GenerativeModelOutput — The output of a generative model.
SingleGenerativeModelOutput — A single output of a generative model.
HFModelInfo — Information about a Hugging Face model.

source dataclass MetricConfig(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] = field(default_factory=dict), postprocessing_fn: c.Callable[[float], tuple[float, str]] = field(default_factory=lambda: lambda raw_score: (100 * raw_score, f'{raw_score:.2%}')))

Configuration for a metric.

Attributes

name : str —

The name of the metric.
pretty_name : str —

A longer prettier name for the metric, which allows cases and spaces. Used for logging.
huggingface_id : str —

The Hugging Face ID of the metric.
results_key : str —

The name of the key used to extract the metric scores from the results dictionary.
compute_kwargs : dict[str, t.Any] —

Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
postprocessing_fn : c.Callable[[float], tuple[float, str]] —

A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").

source dataclass Task(name: str, task_group: TaskGroup, metrics: list[MetricConfig])

A dataset task.

Attributes

name : str —

The name of the task.
task_group : TaskGroup —

The task group of the task.
metrics : list[MetricConfig] —

The metrics used to evaluate the task.

source dataclass Language(code: str, name: str)

A benchmarkable language.

Attributes

code : str —

The ISO 639-1 language code of the language.
name : str —

The name of the language.

source dataclass BenchmarkConfig(model_languages: list[Language], dataset_languages: list[Language], tasks: list[Task], datasets: list[str], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, progress_bar: bool, save_results: bool, device: torch.device, verbose: bool, trust_remote_code: bool, use_flash_attention: bool | None, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, debug: bool, run_with_cli: bool, only_allow_safetensors: bool)

General benchmarking configuration, across datasets and models.

Attributes

model_languages : list[Language] —

The languages of the models to benchmark.
dataset_languages : list[Language] —

The languages of the datasets in the benchmark.
tasks : list[Task] —

The tasks benchmark the model(s) on.
datasets : list[str] —

The datasets to benchmark on.
batch_size : int —

The batch size to use.
raise_errors : bool —

Whether to raise errors instead of skipping them.
cache_dir : str —

Directory to store cached models and datasets.
api_key : str | None —

The API key to use for a given inference API.
force : bool —

Whether to force the benchmark to run even if the results are already cached.
progress_bar : bool —

Whether to show a progress bar.
save_results : bool —

Whether to save the benchmark results to 'scandeval_benchmark_results.json'.
device : torch.device —

The device to use for benchmarking.
verbose : bool —

Whether to print verbose output.
trust_remote_code : bool —

Whether to trust remote code when loading models from the Hugging Face Hub.
use_flash_attention : bool | None —

Whether to use Flash Attention. If None then this will be used for generative models.
clear_model_cache : bool —

Whether to clear the model cache after benchmarking each model.
evaluate_test_split : bool —

Whether to evaluate on the test split.
few_shot : bool —

Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.
num_iterations : int —

The number of iterations each model should be evaluated for.
api_base : str | None —

The base URL for a given inference API. Only relevant if model refers to a model on an inference API.
api_version : str | None —

The version of the API to use. Only relevant if model refers to a model on an inference API.
debug : bool —

Whether to run the benchmark in debug mode.
run_with_cli : bool —

Whether the benchmark is being run with the CLI.
only_allow_safetensors : bool —

Whether to only allow models that use the safetensors format.

source class BenchmarkConfigParams()

Bases : pydantic.BaseModel

The parameters for the benchmark configuration.

Attributes

model_extra : dict[str, Any] | None — Get extra fields set during validation.
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.

source class BenchmarkResult()

Bases : pydantic.BaseModel

A benchmark result.

Attributes

model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
model_extra : dict[str, Any] | None — Get extra fields set during validation.
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.

Methods

from_dict — Create a benchmark result from a dictionary.
append_to_results — Append the benchmark result to the results file.

source classmethod BenchmarkResult.from_dict(config: dict) → BenchmarkResult

Create a benchmark result from a dictionary.

Parameters

config : dict —

The configuration dictionary.

Returns

BenchmarkResult — The benchmark result.

source method BenchmarkResult.append_to_results(results_path: pathlib.Path) → None

Append the benchmark result to the results file.

Parameters

results_path : pathlib.Path —

The path to the results file.

source dataclass DatasetConfig(name: str, pretty_name: str, huggingface_id: str, task: Task, languages: list[Language], prompt_template: str, max_generated_tokens: int, prompt_prefix: str, num_few_shot_examples: int, instruction_prompt: str, labels: list[str] = field(default_factory=list), prompt_label_mapping: dict[str, str] = field(default_factory=dict), unofficial: bool = False)

Configuration for a dataset.

Attributes

name : str —

The name of the dataset. Must be lower case with no spaces.
pretty_name : str —

A longer prettier name for the dataset, which allows cases and spaces. Used for logging.
huggingface_id : str —

The Hugging Face ID of the dataset.
task : Task —

The task of the dataset.
languages : list[Language] —

The ISO 639-1 language codes of the entries in the dataset.
id2label : dict[int, str] —

The mapping from ID to label.
label2id : dict[str, int] —

The mapping from label to ID.
num_labels : int —

The number of labels in the dataset.
prompt_template : str —

The template for the prompt to use when benchmarking the dataset using few-shot evaluation.
max_generated_tokens : int —

The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation.
prompt_prefix : str —

The prefix to use in the few-shot prompt.
num_few_shot_examples : int —

The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label.
instruction_prompt : str —

The prompt to use when benchmarking the dataset using instruction-based evaluation.
labels : optional —

The labels in the dataset. Defaults to an empty list.
prompt_label_mapping : optional —

A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. Defaults to an empty dictionary.
unofficial : optional —

Whether the dataset is unofficial. Defaults to False.

source property DatasetConfig.id2label: dict[int, str]

The mapping from ID to label.

source property DatasetConfig.label2id: dict[str, int]

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source dataclass ModelConfig(model_id: str, revision: str, task: str, languages: list[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)

Configuration for a model.

Attributes

model_id : str —

The ID of the model.
revision : str —

The revision of the model.
task : str —

The task that the model was trained on.
languages : list[Language] —

The languages of the model.
inference_backend : InferenceBackend —

The backend used to perform inference with the model.
merge : bool —

Whether the model is a merged model.
model_type : ModelType —

The type of the model (e.g., encoder, base decoder, instruction tuned).
fresh : bool —

Whether the model is freshly initialised.
model_cache_dir : str —

The directory to cache the model in.
adapter_base_model_id : str | None —

The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PreparedModelInputs(texts: list[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)

The inputs to a model.

Attributes

texts : list[str] | None —

The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.
input_ids : torch.Tensor | None —

The input IDs of the texts. Can be None if the texts are provided instead.
attention_mask : torch.Tensor | None —

The attention mask of the texts. Can be None if the texts are provided instead.

source dataclass GenerativeModelOutput(sequences: list[str], scores: list[list[list[tuple[str, float]]]] | None = None)

The output of a generative model.

Attributes

sequences : list[str] —

The generated sequences.
scores : list[list[list[tuple[str, float]]]] | None —

The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass SingleGenerativeModelOutput(sequence: str, scores: list[list[tuple[str, float]]] | None = None)

A single output of a generative model.

Attributes

sequence : str —

The generated sequence.
scores : list[list[tuple[str, float]]] | None —

The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass HFModelInfo(pipeline_tag: str, tags: list[str], adapter_base_model_id: str | None)

Information about a Hugging Face model.

Attributes

pipeline_tag : str —

The pipeline tag of the model.
tags : list[str] —

The other tags of the model.
adapter_base_model_id : str | None —

The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.