scandeval.task_utils.question_answering

source module scandeval.task_utils.question_answering

Utility functions related to the question-answering task group.

Classes

QuestionAnsweringTrainer — Trainer subclass for question answering tasks.

Functions

compute_metrics — Compute the metrics needed for evaluation.
extract_labels_from_generation — Extract the predicted labels from the generated output.
prepare_train_examples — Prepare the features for training.
prepare_test_examples — Prepare test examples.
postprocess_predictions_and_labels — Postprocess the predictions and labels, to allow easier metric computation.
find_best_answer — Find the best answer for a given example.
find_valid_answers — Find the valid answers from the start and end indexes.

source class QuestionAnsweringTrainer(**kwargs)

Bases : Trainer

Trainer subclass for question answering tasks.

Initialize the trainer.

Methods

evaluate — Evaluate the model on the given dataset.

source method QuestionAnsweringTrainer.evaluate(eval_dataset: Dataset | None = None, orig_eval_dataset: Dataset | None = None, ignore_keys: list[str] | None = None, metric_key_prefix: str = 'eval') → dict[str, float] | None

Evaluate the model on the given dataset.

Parameters

eval_dataset : Dataset | None —

The dataset to evaluate on. If None, then use the stored evaluation dataset.
orig_eval_dataset : Dataset | None —

The original evaluation dataset, before any postprocessing. If None, then use the stored original evaluation dataset.
ignore_keys : list[str] | None —

The keys to ignore when computing the metrics.
metric_key_prefix : str —

The prefix to use for the metric keys.

Returns

dict[str, float] | None — The metrics computed on the evaluation dataset.

source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels], dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → dict[str, float]

Compute the metrics needed for evaluation.

Parameters

model_outputs_and_labels : tuple[Predictions, Labels] —

The first sequence contains the model outputs and the second sequence contains the true labels.
dataset_config : DatasetConfig —

The configuration of the dataset.
benchmark_config : BenchmarkConfig —

The configuration of the benchmark.

Returns

dict[str, float] — A dictionary with the names of the metrics as keys and the metric values as values.

source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput) → list[t.Any]

Extract the predicted labels from the generated output.

Parameters

input_batch : dict[str, list] —

The input batch, where the keys are the feature names and the values are lists with the feature values.
model_output : GenerativeModelOutput —

The raw generated output of the model.

Returns

list[t.Any] — The predicted labels.

source prepare_train_examples(examples: BatchEncoding, tokenizer: PreTrainedTokenizer) → BatchEncoding

Prepare the features for training.

Parameters

examples : BatchEncoding —

The examples to prepare.
tokenizer : PreTrainedTokenizer —

The tokenizer to use to prepare the examples.

Returns

BatchEncoding — The prepared examples.

source prepare_test_examples(examples: BatchEncoding, tokenizer: PreTrainedTokenizer) → BatchEncoding

Prepare test examples.

Parameters

examples : BatchEncoding —

Dictionary of test examples.
tokenizer : PreTrainedTokenizer —

The tokenizer used to preprocess the examples.

Returns

BatchEncoding — The prepared test examples.

source postprocess_predictions_and_labels(predictions: list, dataset: Dataset, prepared_dataset: Dataset, cls_token_index: int) → tuple[list[dict], list[dict]]

Postprocess the predictions and labels, to allow easier metric computation.

Parameters

predictions : list —

A pair of (start_logits, end_logits) predictions.
dataset : Dataset —

The dataset containing the examples.
prepared_dataset : Dataset —

The dataset containing the prepared examples.
cls_token_index : int —

The index of the CLS token.

Returns

tuple[list[dict], list[dict]] — The postprocessed predictions and labels.

source find_best_answer(all_start_logits: np.ndarray, all_end_logits: np.ndarray, prepared_dataset: Dataset, feature_indices: list[int], context: str, max_answer_length: int, num_best_logits: int, min_null_score: float, cls_token_index: int) → str

Find the best answer for a given example.

Parameters

all_start_logits : np.ndarray —

The start logits for all the features.
all_end_logits : np.ndarray —

The end logits for all the features.
prepared_dataset : Dataset —

The dataset containing the prepared examples.
feature_indices : list[int] —

The indices of the features associated with the current example.
context : str —

The context of the example.
max_answer_length : int —

The maximum length of the answer.
num_best_logits : int —

The number of best logits to consider.
min_null_score : float —

The minimum score an answer can have.
cls_token_index : int —

The index of the CLS token.

Returns

str — The best answer for the example.

source find_valid_answers(start_logits: np.ndarray, end_logits: np.ndarray, offset_mapping: list[tuple[int, int]], context: str, max_answer_length: int, num_best_logits: int, min_null_score: float) → list[dict]

Find the valid answers from the start and end indexes.

Parameters

start_logits : np.ndarray —

The logits for the start of the answer.
end_logits : np.ndarray —

The logits for the end of the answer.
offset_mapping : list[tuple[int, int]] —

The offset mapping, being a list of pairs of integers for each token index, containing the start and end character index in the original context.
context : str —

The context of the example.
max_answer_length : int —

The maximum length of the answer.
num_best_logits : int —

The number of best logits to consider. Note that this function will run in O(num_best_logits ^ 2) time.
min_null_score : float —

The minimum score an answer can have.

Returns

list[dict] — A list of the valid answers, each being a dictionary with keys "text" and "score", the score being the sum of the start and end logits.