scandeval.benchmark_modules.hf

source module scandeval.benchmark_modules.hf

Encoder models from the Hugging Face Hub.

Classes

HuggingFaceEncoderModel — An encoder model from the Hugging Face Hub.

Functions

load_model_and_tokenizer — Load the model and tokenizer.
get_model_repo_info — Get the information about the model from the Hugging Face Hub or a local directory.
load_tokenizer — Load the tokenizer.
get_torch_dtype — Get the torch dtype, used for loading the model.
load_hf_model_config — Load the Hugging Face model configuration.
setup_model_for_question_answering — Setup a model for question answering.
get_children_of_module — Get the children of a module.
align_model_and_tokenizer — Aligns the model and the tokenizer.
task_group_to_class_name — Convert a task group to a class name.

source class HuggingFaceEncoderModel(dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig)

Bases : BenchmarkModule

An encoder model from the Hugging Face Hub.

Initialise the model.

Parameters

model_config : ModelConfig —

The model configuration.
dataset_config : DatasetConfig —

The dataset configuration.
benchmark_config : BenchmarkConfig —

The benchmark configuration.

Attributes

generative_type : GenerativeType | None — Get the generative type of the model.
data_collator : c.Callable[[list[t.Any]], dict[str, t.Any]] — The data collator used to prepare samples during finetuning.
compute_metrics : ComputeMetricsFunction — The function used to compute the metrics.
extract_labels_from_generation : ExtractLabelsFunction — The function used to extract the labels from the generated output.
trainer_class : t.Type[Trainer] — The Trainer class to use for finetuning.

Methods

num_params — The number of parameters in the model.
vocab_size — The vocabulary size of the model.
model_max_length — The maximum context length of the model.
prepare_dataset — Prepare the dataset for the model.
model_exists — Check if a model exists.
get_model_config — Fetch the model configuration.

source method HuggingFaceEncoderModel.num_params() → int

The number of parameters in the model.

Returns

int — The number of parameters in the model.

source method HuggingFaceEncoderModel.vocab_size() → int

The vocabulary size of the model.

Returns

int — The vocabulary size of the model.

source method HuggingFaceEncoderModel.model_max_length() → int

The maximum context length of the model.

Returns

int — The maximum context length of the model.

source property HuggingFaceEncoderModel.data_collator: c.Callable[[list[t.Any]], dict[str, t.Any]]

The data collator used to prepare samples during finetuning.

Returns

c.Callable[[list[t.Any]], dict[str, t.Any]] — The data collator.

source property HuggingFaceEncoderModel.generative_type: GenerativeType | None

Get the generative type of the model.

Returns

GenerativeType | None — The generative type of the model, or None if it has not been set yet.

source property HuggingFaceEncoderModel.extract_labels_from_generation: ExtractLabelsFunction

The function used to extract the labels from the generated output.

Returns

ExtractLabelsFunction — The function used to extract the labels from the generated output.

source property HuggingFaceEncoderModel.trainer_class: t.Type[Trainer]

The Trainer class to use for finetuning.

Returns

t.Type[Trainer] — The Trainer class.

source method HuggingFaceEncoderModel.prepare_dataset(dataset: DatasetDict, task: Task, itr_idx: int) → DatasetDict

Prepare the dataset for the model.

This includes things like tokenisation.

Parameters

dataset : DatasetDict —

The dataset to prepare.
task : Task —

The task to prepare the dataset for.
itr_idx : int —

The index of the dataset in the iterator.

Returns

DatasetDict — The prepared dataset.

Raises

NotImplementedError
InvalidBenchmark

source classmethod HuggingFaceEncoderModel.model_exists(model_id: str, benchmark_config: BenchmarkConfig) → bool | NeedsExtraInstalled | NeedsEnvironmentVariable

Check if a model exists.

Parameters

model_id : str —

The model ID.
benchmark_config : BenchmarkConfig —

The benchmark configuration.

Returns

bool | NeedsExtraInstalled | NeedsEnvironmentVariable — Whether the model exists, or an error describing why we cannot check whether the model exists.

source classmethod HuggingFaceEncoderModel.get_model_config(model_id: str, benchmark_config: BenchmarkConfig) → ModelConfig

Fetch the model configuration.

Parameters

model_id : str —

The model ID.
benchmark_config : BenchmarkConfig —

The benchmark configuration.

Returns

ModelConfig — The model configuration.

Raises

InvalidModel

source load_model_and_tokenizer(model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → tuple[PreTrainedModel, PreTrainedTokenizer]

Load the model and tokenizer.

Parameters

model_config : ModelConfig —

The model configuration.
dataset_config : DatasetConfig —

The dataset configuration.
benchmark_config : BenchmarkConfig —

The benchmark configuration

Returns

tuple[PreTrainedModel, PreTrainedTokenizer] — The loaded model and tokenizer.

Raises

source get_model_repo_info(model_id: str, revision: str, benchmark_config: BenchmarkConfig) → HFModelInfo | None

Get the information about the model from the Hugging Face Hub or a local directory.

Parameters

model_id : str —

The model ID.
revision : str —

The revision of the model.
benchmark_config : BenchmarkConfig —

The benchmark configuration.

Returns

HFModelInfo | None — The information about the model, or None if the model could not be found.

Raises

source load_tokenizer(model: PreTrainedModel | None, model_id: str, trust_remote_code: bool) → PreTrainedTokenizer

Load the tokenizer.

Parameters

model : PreTrainedModel | None —

The model, which is used to determine whether to add a prefix space to the tokens. Can be None.
model_id : str —

The model identifier. Used for logging.
trust_remote_code : bool —

Whether to trust remote code.

Returns

PreTrainedTokenizer — The loaded tokenizer.

Raises

InvalidModel

source get_torch_dtype(device: torch.device, torch_dtype_is_set: bool, bf16_available: bool) → str | torch.dtype

Get the torch dtype, used for loading the model.

Parameters

device : torch.device —

The device to use.
torch_dtype_is_set : bool —

Whether the torch data type is set in the model configuration.
bf16_available : bool —

Whether bfloat16 is available.

Returns

str | torch.dtype — The torch dtype.

source load_hf_model_config(model_id: str, num_labels: int, id2label: dict[int, str], label2id: dict[str, int], revision: str, model_cache_dir: str | None, api_key: str | None, trust_remote_code: bool, run_with_cli: bool) → PretrainedConfig

Load the Hugging Face model configuration.

Parameters

model_id : str —

The Hugging Face model ID.
num_labels : int —

The number of labels in the dataset.
id2label : dict[int, str] —

The mapping from label IDs to labels.
label2id : dict[str, int] —

The mapping from labels to label IDs.
revision : str —

The revision of the model.
model_cache_dir : str | None —

The directory to cache the model in.
api_key : str | None —

The Hugging Face API key.
trust_remote_code : bool —

Whether to trust remote code.
run_with_cli : bool —

Whether the script is being run with the CLI.

Returns

PretrainedConfig — The Hugging Face model configuration.

Raises

source setup_model_for_question_answering(model: PreTrainedModel) → PreTrainedModel

Setup a model for question answering.

Parameters

model : PreTrainedModel —

The model to setup.

Returns

PreTrainedModel — The setup model.

source get_children_of_module(name: str, module: nn.Module) → nn.Module | dict[str, t.Any] | None

Get the children of a module.

Parameters

name : str —

The name of the module.
module : nn.Module —

The module to get the children of.

Returns

nn.Module | dict[str, t.Any] | None — The children of the module, or None if the module has no children.

source align_model_and_tokenizer(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, model_max_length: int, raise_errors: bool = False) → tuple[PreTrainedModel, PreTrainedTokenizer]

Aligns the model and the tokenizer.

Parameters

model : PreTrainedModel —

The model to fix.
tokenizer : PreTrainedTokenizer —

The tokenizer to fix.
model_max_length : int —

The maximum length of the model.
raise_errors : bool —

Whether to raise errors instead of trying to fix them silently.

Returns

tuple[PreTrainedModel, PreTrainedTokenizer] — The fixed model and tokenizer.

Raises

InvalidModel

source task_group_to_class_name(task_group: TaskGroup) → str

Convert a task group to a class name.

Parameters

task_group : TaskGroup —

The task group.

Returns

str — The class name.