scandeval.scores

Aggregation of raw scores into the mean and a confidence interval.

Functions

log_scores — Log the scores.
aggregate_scores — Helper function to compute the mean with confidence intervals.

source log_scores(dataset_name: str, metric_configs: list[MetricConfig], scores: list[dict[str, float]], model_id: str) → ScoreDict

Log the scores.

Parameters

dataset_name : str —

Name of the dataset.
metric_configs : list[MetricConfig] —

List of metrics to log.
scores : list[dict[str, float]] —

The scores that are to be logged. This is a list of dictionaries full of scores.
model_id : str —

The full Hugging Face Hub path to the pretrained transformer model.

Returns

ScoreDict — A dictionary with keys 'raw_scores' and 'total', with 'raw_scores' being identical to scores and 'total' being a dictionary with the aggregated scores (means and standard errors).

source aggregate_scores(scores: list[dict[str, float]], metric_config: MetricConfig) → tuple[float, float]

Helper function to compute the mean with confidence intervals.

Parameters

scores : list[dict[str, float]] —

Dictionary with the names of the metrics as keys, of the form "_", such as "val_f1", and values the metric values.
metric_config : MetricConfig —

The configuration of the metric, which is used to collect the correct metric from scores.

Returns

tuple[float, float] — A pair of floats, containing the score and the radius of its 95% confidence interval.