scandeval.data_models

docs module scandeval.data_models
"""Data models used in ScandEval."""importcollections.abcascimportimportlib.metadataimportjsonimportpathlibimportreimporttypingastfromdataclassesimportdataclass,fieldimportpydanticimporttorchfrom.enumsimportDevice,InferenceBackend,ModelType,TaskGroupfrom.typesimportScoreDict@dataclassclassMetricConfig:docs
    """Configuration for a metric.    Attributes:        name:            The name of the metric.        pretty_name:            A longer prettier name for the metric, which allows cases and spaces. Used            for logging.        huggingface_id:            The Hugging Face ID of the metric.        results_key:            The name of the key used to extract the metric scores from the results            dictionary.        compute_kwargs:            Keyword arguments to pass to the metric's compute function. Defaults to            an empty dictionary.        postprocessing_fn:            A function to apply to the metric scores after they are computed, taking            the score to the postprocessed score along with its string representation.            Defaults to x -> (100 * x, f"{x:.2%}").    """name:strpretty_name:strhuggingface_id:strresults_key:strcompute_kwargs:dict[str,t.Any]=field(default_factory=dict)postprocessing_fn:c.Callable[[float],tuple[float,str]]=field(default_factory=lambda:lambdaraw_score:(100*raw_score,f"{raw_score:.2%}"))def__hash__(self)->int:        """Return a hash of the metric configuration."""returnhash(self.name)@dataclassclassTask:docs
    """A dataset task.    Attributes:        name:            The name of the task.        task_group:            The task group of the task.        metrics:            The metrics used to evaluate the task.    """name:strtask_group:TaskGroupmetrics:list[MetricConfig]def__hash__(self)->int:        """Return a hash of the task."""returnhash(self.name)@dataclassclassLanguage:docs
    """A benchmarkable language.    Attributes:        code:            The ISO 639-1 language code of the language.        name:            The name of the language.    """code:strname:strdef__hash__(self)->int:        """Return a hash of the language."""returnhash(self.code)@dataclassclassBenchmarkConfig:docs
    """General benchmarking configuration, across datasets and models.    Attributes:        model_languages:            The languages of the models to benchmark.        dataset_languages:            The languages of the datasets in the benchmark.        tasks:            The tasks benchmark the model(s) on.        datasets:            The datasets to benchmark on.        batch_size:            The batch size to use.        raise_errors:            Whether to raise errors instead of skipping them.        cache_dir:            Directory to store cached models and datasets.        api_key:            The API key to use for a given inference API.        force:            Whether to force the benchmark to run even if the results are already            cached.        progress_bar:            Whether to show a progress bar.        save_results:            Whether to save the benchmark results to 'scandeval_benchmark_results.json'.        device:            The device to use for benchmarking.        verbose:            Whether to print verbose output.        trust_remote_code:            Whether to trust remote code when loading models from the Hugging Face Hub.        use_flash_attention:            Whether to use Flash Attention. If None then this will be used for            generative models.        clear_model_cache:            Whether to clear the model cache after benchmarking each model.        evaluate_test_split:            Whether to evaluate on the test split.        few_shot:            Whether to only evaluate the model using few-shot evaluation. Only relevant            if the model is generative.        num_iterations:            The number of iterations each model should be evaluated for.        api_base:            The base URL for a given inference API. Only relevant if `model` refers to a            model on an inference API.        api_version:            The version of the API to use. Only relevant if `model` refers to a model on            an inference API.        debug:            Whether to run the benchmark in debug mode.        run_with_cli:            Whether the benchmark is being run with the CLI.        only_allow_safetensors:            Whether to only allow models that use the safetensors format.    """model_languages:list[Language]dataset_languages:list[Language]tasks:list[Task]datasets:list[str]batch_size:intraise_errors:boolcache_dir:strapi_key:str|Noneforce:boolprogress_bar:boolsave_results:booldevice:torch.deviceverbose:booltrust_remote_code:booluse_flash_attention:bool|Noneclear_model_cache:boolevaluate_test_split:boolfew_shot:boolnum_iterations:intapi_base:str|Noneapi_version:str|Nonedebug:boolrun_with_cli:boolonly_allow_safetensors:boolclassBenchmarkConfigParams(pydantic.BaseModel):docs
    """The parameters for the benchmark configuration."""model_config=pydantic.ConfigDict(protected_namespaces=())progress_bar:boolsave_results:booltask:str|list[str]|Nonedataset:str|list[str]|Nonelanguage:str|list[str]model_language:str|list[str]|Nonedataset_language:str|list[str]|Nonedevice:Device|Nonebatch_size:intraise_errors:boolcache_dir:strapi_key:str|Noneforce:boolverbose:booltrust_remote_code:booluse_flash_attention:bool|Noneclear_model_cache:boolevaluate_test_split:boolfew_shot:boolnum_iterations:intapi_base:str|Noneapi_version:str|Nonedebug:boolrun_with_cli:boolonly_allow_safetensors:boolclassBenchmarkResult(pydantic.BaseModel):docs
    """A benchmark result."""dataset:strtask:strdataset_languages:list[str]model:strresults:ScoreDictnum_model_parameters:intmax_sequence_length:intvocabulary_size:intmerge:boolgenerative:boolgenerative_type:str|Nonefew_shot:boolvalidation_split:boolscandeval_version:str=importlib.metadata.version("scandeval")@classmethoddeffrom_dict(cls,config:dict)->"BenchmarkResult":docs
        """Create a benchmark result from a dictionary.        Args:            config:                The configuration dictionary.        Returns:            The benchmark result.        """# To be backwards compatible, we accept old results which changed the model# name with parameters rather than adding them as explicit parametersval_matches=re.search(r"\(.*val.*\)$",config["model"])few_shot_matches=re.search(r"\(.*few-shot.*\)$",config["model"])zero_shot_matches=re.search(r"\(.*zero-shot.*\)$",config["model"])config["model"]=re.sub(r"\(.*(few-shot|val).*\)$","",config["model"]).strip()if"merge"notinconfig:config["merge"]=Falseif"generative"notinconfig:config["generative"]=(few_shot_matchesisnotNoneorzero_shot_matchesisnotNone)if"generative_type"notinconfig:config["generative_type"]=Noneif"few_shot"notinconfig:config["few_shot"]=zero_shot_matchesisNoneif"validation_split"notinconfig:config["validation_split"]=val_matchesisnotNonereturncls(**config)defappend_to_results(self,results_path:pathlib.Path)->None:docs
        """Append the benchmark result to the results file.        Args:            results_path:                The path to the results file.        """json_str=json.dumps(self.model_dump())withresults_path.open("a")asf:f.write("\n"+json_str)@dataclassclassDatasetConfig:docs
    """Configuration for a dataset.    Attributes:        name:            The name of the dataset. Must be lower case with no spaces.        pretty_name:            A longer prettier name for the dataset, which allows cases and spaces. Used            for logging.        huggingface_id:            The Hugging Face ID of the dataset.        task:            The task of the dataset.        languages:            The ISO 639-1 language codes of the entries in the dataset.        id2label:            The mapping from ID to label.        label2id:            The mapping from label to ID.        num_labels:            The number of labels in the dataset.        prompt_template:            The template for the prompt to use when benchmarking the dataset using            few-shot evaluation.        max_generated_tokens:            The maximum number of tokens to generate when benchmarking the dataset            using few-shot evaluation.        prompt_prefix:            The prefix to use in the few-shot prompt.        num_few_shot_examples:            The number of examples to use when benchmarking the dataset using few-shot            evaluation. For a classification task, these will be drawn evenly from            each label.        instruction_prompt:            The prompt to use when benchmarking the dataset using instruction-based            evaluation.        labels (optional):            The labels in the dataset. Defaults to an empty list.        prompt_label_mapping (optional):            A mapping from the labels to another phrase which is used as a substitute            for the label in few-shot evaluation. Defaults to an empty dictionary.        unofficial (optional):            Whether the dataset is unofficial. Defaults to False.    """name:strpretty_name:strhuggingface_id:strtask:Tasklanguages:list[Language]prompt_template:strmax_generated_tokens:intprompt_prefix:strnum_few_shot_examples:intinstruction_prompt:strlabels:list[str]=field(default_factory=list)prompt_label_mapping:dict[str,str]=field(default_factory=dict)unofficial:bool=False@propertydefid2label(self)->dict[int,str]:docs
        """The mapping from ID to label."""return{idx:labelforidx,labelinenumerate(self.labels)}@propertydeflabel2id(self)->dict[str,int]:docs
        """The mapping from label to ID."""return{label:ifori,labelinenumerate(self.labels)}@propertydefnum_labels(self)->int:docs
        """The number of labels in the dataset."""returnlen(self.labels)def__hash__(self)->int:        """Return a hash of the dataset configuration."""returnhash(self.name)@dataclassclassModelConfig:docs
    """Configuration for a model.    Attributes:        model_id:            The ID of the model.        revision:            The revision of the model.        task:            The task that the model was trained on.        languages:            The languages of the model.        inference_backend:            The backend used to perform inference with the model.        merge:            Whether the model is a merged model.        model_type:            The type of the model (e.g., encoder, base decoder, instruction tuned).        fresh:            Whether the model is freshly initialised.        model_cache_dir:            The directory to cache the model in.        adapter_base_model_id:            The model ID of the base model if the model is an adapter model. Can be None            if the model is not an adapter model.    """model_id:strrevision:strtask:strlanguages:list[Language]inference_backend:InferenceBackendmerge:boolmodel_type:ModelTypefresh:boolmodel_cache_dir:stradapter_base_model_id:str|Nonedef__hash__(self)->int:        """Return a hash of the model configuration."""returnhash(self.model_id)@dataclassclassPreparedModelInputs:docs
    """The inputs to a model.    Attributes:        texts:            The texts to input to the model. Can be None if the input IDs and attention            mask are provided instead.        input_ids:            The input IDs of the texts. Can be None if the texts are provided instead.        attention_mask:            The attention mask of the texts. Can be None if the texts are provided            instead.    """texts:list[str]|None=Noneinput_ids:torch.Tensor|None=Noneattention_mask:torch.Tensor|None=None@dataclassclassGenerativeModelOutput:docs
    """The output of a generative model.    Attributes:        sequences:            The generated sequences.        scores:            The scores of the sequences. This is an array of shape (batch_size,            num_tokens, num_logprobs, 2), where the last dimension contains the            token and its logprob. Can be None if the scores are not available.    """sequences:list[str]scores:list[list[list[tuple[str,float]]]]|None=None@dataclassclassSingleGenerativeModelOutput:docs
    """A single output of a generative model.    Attributes:        sequence:            The generated sequence.        scores:            The scores of the sequence. This is an array of shape (num_tokens,            num_logprobs, 2), where the last dimension contains the token and its            logprob. Can be None if the scores are not available.    """sequence:strscores:list[list[tuple[str,float]]]|None=None@dataclassclassHFModelInfo:docs
    """Information about a Hugging Face model.    Attributes:        pipeline_tag:            The pipeline tag of the model.        tags:            The other tags of the model.        adapter_base_model_id:            The model ID of the base model if the model is an adapter model. Can be None            if the model is not an adapter model.    """pipeline_tag:strtags:list[str]adapter_base_model_id:str|None