scandeval.human_evaluation
source module scandeval.human_evaluation
Gradio app for conducting human evaluation of the tasks.
Classes
-
HumanEvaluator — An app for evaluating human performance on the ScandEval benchmark.
Functions
-
main — Start the Gradio app for human evaluation.
source class HumanEvaluator(title: str, description: str, dummy_model_id: str = 'mistralai/Mistral-7B-v0.1')
An app for evaluating human performance on the ScandEval benchmark.
Initialize the HumanEvaluator.
Parameters
-
annotator_id : int —
The annotator ID for the evaluation.
-
title : str —
The title of the app.
-
description : str —
The description of the app.
-
dummy_model_id : str —
The model ID to use for generating prompts.
Methods
-
create_app — Create the Gradio app for human evaluation.
-
update_dataset_choices — Update the dataset choices based on the selected language and task.
-
update_dataset — Update the dataset based on a selected dataset name.
-
add_entity_to_answer — Add an entity to the answer.
-
reset_entities — Reset the entities in the answer.
-
submit_answer — Submit an answer to the dataset.
-
example_to_markdown — Convert an example to a Markdown string.
-
compute_and_log_scores — Computes and logs the scores for the dataset.
source method HumanEvaluator.create_app() → gr.Blocks
Create the Gradio app for human evaluation.
Returns
-
gr.Blocks — The Gradio app for human evaluation.
source method HumanEvaluator.update_dataset_choices(language: str | None, task: str | None) → gr.Dropdown
Update the dataset choices based on the selected language and task.
Parameters
-
language : str | None —
The language selected by the user.
-
task : str | None —
The task selected by the user.
Returns
-
gr.Dropdown — A list of dataset names that match the selected language and task.
source method HumanEvaluator.update_dataset(dataset_name: str, iteration: int) → tuple[gr.Markdown, gr.Markdown, gr.Dropdown, gr.Textbox, gr.Button, gr.Button, gr.Textbox, gr.Button]
Update the dataset based on a selected dataset name.
Parameters
-
dataset_name : str —
The dataset name selected by the user.
-
iteration : int —
The iteration index of the datasets to evaluate.
Returns
-
tuple[gr.Markdown, gr.Markdown, gr.Dropdown, gr.Textbox, gr.Button, gr.Button, gr.Textbox, gr.Button] — A tuple (task_examples, question, entity_type, entity, entity_add_button, entity_reset_button, answer, submit_button) for the selected dataset.
Raises
-
NotImplementedError
source method HumanEvaluator.add_entity_to_answer(question: str, entity_type: str, entity: str, answer: str) → tuple[gr.Textbox, gr.Textbox]
Add an entity to the answer.
Parameters
-
question : str —
The current question.
-
entity_type : str —
The entity type selected by the user.
-
entity : str —
The entity provided by the user.
-
answer : str —
The current answer.
Returns
-
tuple[gr.Textbox, gr.Textbox] — A tuple (entity, answer) with a (blank) entity and answer.
source method HumanEvaluator.reset_entities() → gr.Textbox
Reset the entities in the answer.
Returns
-
gr.Textbox — A blank answer.
source method HumanEvaluator.submit_answer(dataset_name: str, question: str, answer: str, annotator_id: int) → tuple[str, str]
Submit an answer to the dataset.
Parameters
-
dataset_name : str —
The name of the dataset.
-
question : str —
The question for the dataset.
-
answer : str —
The answer to the question.
-
annotator_id : int —
The annotator ID for the evaluation.
Returns
-
tuple[str, str] — A tuple (question, answer), with
question
being the next question, andanswer
being an empty string.
source method HumanEvaluator.example_to_markdown(example: dict) → tuple[str, str]
Convert an example to a Markdown string.
Parameters
-
example : dict —
The example to convert.
Returns
-
tuple[str, str] — A tuple (task_examples, question) for the example.
source method HumanEvaluator.compute_and_log_scores() → None
Computes and logs the scores for the dataset.