scandeval.task_utils.token_classification
source module scandeval.task_utils.token_classification
Utility functions related to the token-classification task group.
Functions
-
compute_metrics — Compute the metrics needed for evaluation.
-
extract_labels_from_generation — Extract the predicted labels from the generated output.
-
tokenize_and_align_labels — Tokenise all texts and align the labels with them.
-
handle_unk_tokens — Replace unknown tokens in the tokens with the corresponding word.
source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels], has_misc_tags: bool, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → dict[str, float]
Compute the metrics needed for evaluation.
Parameters
-
model_outputs_and_labels : tuple[Predictions, Labels] —
The first array contains the probability predictions and the second array contains the true labels.
-
has_misc_tags : bool —
Whether the dataset has MISC tags.
-
dataset_config : DatasetConfig —
The configuration of the dataset.
-
benchmark_config : BenchmarkConfig —
The configuration of the benchmark.
Returns
-
dict[str, float] — A dictionary with the names of the metrics as keys and the metric values as values.
Raises
source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput, dataset_config: DatasetConfig) → list[t.Any]
Extract the predicted labels from the generated output.
Parameters
-
input_batch : dict[str, list] —
The input batch, where the keys are the feature names and the values are lists with the feature values.
-
model_output : GenerativeModelOutput —
The raw generated output of the model.
-
dataset_config : DatasetConfig —
The configuration of the dataset.
Returns
-
list[t.Any] — The predicted labels.
Raises
source tokenize_and_align_labels(examples: dict, tokenizer: PreTrainedTokenizer, label2id: dict[str, int]) → BatchEncoding
Tokenise all texts and align the labels with them.
Parameters
-
examples : dict —
The examples to be tokenised.
-
tokenizer : PreTrainedTokenizer —
A pretrained tokenizer.
-
label2id : dict[str, int] —
A dictionary that converts NER tags to IDs.
Returns
-
BatchEncoding — A dictionary containing the tokenized data as well as labels.
Raises
source handle_unk_tokens(tokenizer: PreTrainedTokenizer, tokens: list[str], words: list[str]) → list[str]
Replace unknown tokens in the tokens with the corresponding word.
Parameters
-
tokenizer : PreTrainedTokenizer —
The tokenizer used to tokenize the words.
-
tokens : list[str] —
The list of tokens.
-
words : list[str] —
The list of words.
Returns
-
list[str] — The list of tokens with unknown tokens replaced by the corresponding word.