scandeval.benchmark_modules.hf
source module scandeval.benchmark_modules.hf
Encoder models from the Hugging Face Hub.
Classes
-
HuggingFaceEncoderModel — An encoder model from the Hugging Face Hub.
Functions
-
load_model_and_tokenizer — Load the model and tokenizer.
-
get_model_repo_info — Get the information about the model from the Hugging Face Hub or a local directory.
-
load_tokenizer — Load the tokenizer.
-
get_torch_dtype — Get the torch dtype, used for loading the model.
-
load_hf_model_config — Load the Hugging Face model configuration.
-
setup_model_for_question_answering — Setup a model for question answering.
-
get_children_of_module — Get the children of a module.
-
align_model_and_tokenizer — Aligns the model and the tokenizer.
-
task_group_to_class_name — Convert a task group to a class name.
source class HuggingFaceEncoderModel(dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig)
Bases : BenchmarkModule
An encoder model from the Hugging Face Hub.
Initialise the model.
Parameters
-
model_config : ModelConfig —
The model configuration.
-
dataset_config : DatasetConfig —
The dataset configuration.
-
benchmark_config : BenchmarkConfig —
The benchmark configuration.
Attributes
-
generative_type : GenerativeType | None — Get the generative type of the model.
-
data_collator : c.Callable[[list[t.Any]], dict[str, t.Any]] — The data collator used to prepare samples during finetuning.
-
compute_metrics : ComputeMetricsFunction — The function used to compute the metrics.
-
extract_labels_from_generation : ExtractLabelsFunction — The function used to extract the labels from the generated output.
-
trainer_class : t.Type[Trainer] — The Trainer class to use for finetuning.
Methods
-
num_params — The number of parameters in the model.
-
vocab_size — The vocabulary size of the model.
-
model_max_length — The maximum context length of the model.
-
prepare_dataset — Prepare the dataset for the model.
-
model_exists — Check if a model exists.
-
get_model_config — Fetch the model configuration.
source method HuggingFaceEncoderModel.num_params() → int
The number of parameters in the model.
Returns
-
int — The number of parameters in the model.
source method HuggingFaceEncoderModel.vocab_size() → int
The vocabulary size of the model.
Returns
-
int — The vocabulary size of the model.
source method HuggingFaceEncoderModel.model_max_length() → int
The maximum context length of the model.
Returns
-
int — The maximum context length of the model.
source property HuggingFaceEncoderModel.data_collator: c.Callable[[list[t.Any]], dict[str, t.Any]]
The data collator used to prepare samples during finetuning.
Returns
-
c.Callable[[list[t.Any]], dict[str, t.Any]] — The data collator.
source property HuggingFaceEncoderModel.generative_type: GenerativeType | None
Get the generative type of the model.
Returns
-
GenerativeType | None — The generative type of the model, or None if it has not been set yet.
source property HuggingFaceEncoderModel.extract_labels_from_generation: ExtractLabelsFunction
The function used to extract the labels from the generated output.
Returns
-
ExtractLabelsFunction — The function used to extract the labels from the generated output.
source property HuggingFaceEncoderModel.trainer_class: t.Type[Trainer]
The Trainer class to use for finetuning.
Returns
-
t.Type[Trainer] — The Trainer class.
source method HuggingFaceEncoderModel.prepare_dataset(dataset: DatasetDict, task: Task, itr_idx: int) → DatasetDict
Prepare the dataset for the model.
This includes things like tokenisation.
Parameters
-
dataset : DatasetDict —
The dataset to prepare.
-
task : Task —
The task to prepare the dataset for.
-
itr_idx : int —
The index of the dataset in the iterator.
Returns
-
DatasetDict — The prepared dataset.
Raises
-
NotImplementedError
source classmethod HuggingFaceEncoderModel.model_exists(model_id: str, benchmark_config: BenchmarkConfig) → bool | NeedsExtraInstalled | NeedsEnvironmentVariable
Check if a model exists.
Parameters
-
model_id : str —
The model ID.
-
benchmark_config : BenchmarkConfig —
The benchmark configuration.
Returns
-
bool | NeedsExtraInstalled | NeedsEnvironmentVariable — Whether the model exists, or an error describing why we cannot check whether the model exists.
source classmethod HuggingFaceEncoderModel.get_model_config(model_id: str, benchmark_config: BenchmarkConfig) → ModelConfig
Fetch the model configuration.
Parameters
-
model_id : str —
The model ID.
-
benchmark_config : BenchmarkConfig —
The benchmark configuration.
Returns
-
ModelConfig — The model configuration.
Raises
source load_model_and_tokenizer(model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → tuple[PreTrainedModel, PreTrainedTokenizer]
Load the model and tokenizer.
Parameters
-
model_config : ModelConfig —
The model configuration.
-
dataset_config : DatasetConfig —
The dataset configuration.
-
benchmark_config : BenchmarkConfig —
The benchmark configuration
Returns
-
tuple[PreTrainedModel, PreTrainedTokenizer] — The loaded model and tokenizer.
Raises
source get_model_repo_info(model_id: str, revision: str, benchmark_config: BenchmarkConfig) → HFModelInfo | None
Get the information about the model from the Hugging Face Hub or a local directory.
Parameters
-
model_id : str —
The model ID.
-
revision : str —
The revision of the model.
-
benchmark_config : BenchmarkConfig —
The benchmark configuration.
Returns
-
HFModelInfo | None — The information about the model, or None if the model could not be found.
Raises
source load_tokenizer(model: PreTrainedModel | None, model_id: str, trust_remote_code: bool) → PreTrainedTokenizer
Load the tokenizer.
Parameters
-
model : PreTrainedModel | None —
The model, which is used to determine whether to add a prefix space to the tokens. Can be None.
-
model_id : str —
The model identifier. Used for logging.
-
trust_remote_code : bool —
Whether to trust remote code.
Returns
-
PreTrainedTokenizer — The loaded tokenizer.
Raises
source get_torch_dtype(device: torch.device, torch_dtype_is_set: bool, bf16_available: bool) → str | torch.dtype
Get the torch dtype, used for loading the model.
Parameters
-
device : torch.device —
The device to use.
-
torch_dtype_is_set : bool —
Whether the torch data type is set in the model configuration.
-
bf16_available : bool —
Whether bfloat16 is available.
Returns
-
str | torch.dtype — The torch dtype.
source load_hf_model_config(model_id: str, num_labels: int, id2label: dict[int, str], label2id: dict[str, int], revision: str, model_cache_dir: str | None, api_key: str | None, trust_remote_code: bool, run_with_cli: bool) → PretrainedConfig
Load the Hugging Face model configuration.
Parameters
-
model_id : str —
The Hugging Face model ID.
-
num_labels : int —
The number of labels in the dataset.
-
id2label : dict[int, str] —
The mapping from label IDs to labels.
-
label2id : dict[str, int] —
The mapping from labels to label IDs.
-
revision : str —
The revision of the model.
-
model_cache_dir : str | None —
The directory to cache the model in.
-
api_key : str | None —
The Hugging Face API key.
-
trust_remote_code : bool —
Whether to trust remote code.
-
run_with_cli : bool —
Whether the script is being run with the CLI.
Returns
-
PretrainedConfig — The Hugging Face model configuration.
Raises
source setup_model_for_question_answering(model: PreTrainedModel) → PreTrainedModel
Setup a model for question answering.
Parameters
-
model : PreTrainedModel —
The model to setup.
Returns
-
PreTrainedModel — The setup model.
source get_children_of_module(name: str, module: nn.Module) → nn.Module | dict[str, t.Any] | None
Get the children of a module.
Parameters
-
name : str —
The name of the module.
-
module : nn.Module —
The module to get the children of.
Returns
-
nn.Module | dict[str, t.Any] | None — The children of the module, or None if the module has no children.
source align_model_and_tokenizer(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, model_max_length: int, raise_errors: bool = False) → tuple[PreTrainedModel, PreTrainedTokenizer]
Aligns the model and the tokenizer.
Parameters
-
model : PreTrainedModel —
The model to fix.
-
tokenizer : PreTrainedTokenizer —
The tokenizer to fix.
-
model_max_length : int —
The maximum length of the model.
-
raise_errors : bool —
Whether to raise errors instead of trying to fix them silently.
Returns
-
tuple[PreTrainedModel, PreTrainedTokenizer] — The fixed model and tokenizer.
Raises
source task_group_to_class_name(task_group: TaskGroup) → str
Convert a task group to a class name.
Parameters
-
task_group : TaskGroup —
The task group.
Returns
-
str — The class name.