scandeval.generation

Functions related to text generation of models.

Functions

generate — Evaluate a model on a dataset through generation.
generate_single_iteration — Evaluate a model on a dataset in a single iteration through generation.
debug_log — Log inputs and outputs for debugging purposes.

source generate(model: BenchmarkModule, datasets: list[DatasetDict], model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → list[dict[str, float]]

Evaluate a model on a dataset through generation.

Parameters

Returns

source generate_single_iteration(dataset: Dataset, model: BenchmarkModule, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig, cache: ModelCache) → dict[str, float]

Evaluate a model on a dataset in a single iteration through generation.

Parameters

Returns

dict[str, float] — A list of dictionaries containing the scores for each metric.

Raises

source debug_log(batch: dict[str, t.Any], extracted_labels: list[dict | str | list[str]], dataset_config: DatasetConfig) → None

Log inputs and outputs for debugging purposes.

Parameters

batch : dict[str, t.Any] —

The batch of examples to evaluate on.
extracted_labels : list[dict | str | list[str]] —

The extracted labels from the model output.
dataset_config : DatasetConfig —

The configuration of the dataset.

Raises