scandeval.human_evaluation

source module scandeval.human_evaluation

Gradio app for conducting human evaluation of the tasks.

Classes

HumanEvaluator — An app for evaluating human performance on the ScandEval benchmark.

Functions

main — Start the Gradio app for human evaluation.

source class HumanEvaluator(title: str, description: str, dummy_model_id: str = 'mistralai/Mistral-7B-v0.1')

An app for evaluating human performance on the ScandEval benchmark.

Initialize the HumanEvaluator.

Parameters

annotator_id : int —

The annotator ID for the evaluation.
title : str —

The title of the app.
description : str —

The description of the app.
dummy_model_id : str —

The model ID to use for generating prompts.

Methods

create_app — Create the Gradio app for human evaluation.
update_dataset_choices — Update the dataset choices based on the selected language and task.
update_dataset — Update the dataset based on a selected dataset name.
add_entity_to_answer — Add an entity to the answer.
reset_entities — Reset the entities in the answer.
submit_answer — Submit an answer to the dataset.
example_to_markdown — Convert an example to a Markdown string.
compute_and_log_scores — Computes and logs the scores for the dataset.

source method HumanEvaluator.create_app() → gr.Blocks

Create the Gradio app for human evaluation.

Returns

gr.Blocks — The Gradio app for human evaluation.

source method HumanEvaluator.update_dataset_choices(language: str | None, task: str | None) → gr.Dropdown

Update the dataset choices based on the selected language and task.

Parameters

language : str | None —

The language selected by the user.
task : str | None —

The task selected by the user.

Returns

gr.Dropdown — A list of dataset names that match the selected language and task.

source method HumanEvaluator.update_dataset(dataset_name: str, iteration: int) → tuple[gr.Markdown, gr.Markdown, gr.Dropdown, gr.Textbox, gr.Button, gr.Button, gr.Textbox, gr.Button]

Update the dataset based on a selected dataset name.

Parameters

dataset_name : str —

The dataset name selected by the user.
iteration : int —

The iteration index of the datasets to evaluate.

Returns

tuple[gr.Markdown, gr.Markdown, gr.Dropdown, gr.Textbox, gr.Button, gr.Button, gr.Textbox, gr.Button] — A tuple (task_examples, question, entity_type, entity, entity_add_button, entity_reset_button, answer, submit_button) for the selected dataset.

Raises

NotImplementedError

source method HumanEvaluator.add_entity_to_answer(question: str, entity_type: str, entity: str, answer: str) → tuple[gr.Textbox, gr.Textbox]

Add an entity to the answer.

Parameters

question : str —

The current question.
entity_type : str —

The entity type selected by the user.
entity : str —

The entity provided by the user.
answer : str —

The current answer.

Returns

tuple[gr.Textbox, gr.Textbox] — A tuple (entity, answer) with a (blank) entity and answer.

source method HumanEvaluator.reset_entities() → gr.Textbox

Reset the entities in the answer.

Returns

gr.Textbox — A blank answer.

source method HumanEvaluator.submit_answer(dataset_name: str, question: str, answer: str, annotator_id: int) → tuple[str, str]

Submit an answer to the dataset.

Parameters

dataset_name : str —

The name of the dataset.
question : str —

The question for the dataset.
answer : str —

The answer to the question.
annotator_id : int —

The annotator ID for the evaluation.

Returns

tuple[str, str] — A tuple (question, answer), with question being the next question, and answer being an empty string.

source method HumanEvaluator.example_to_markdown(example: dict) → tuple[str, str]

Convert an example to a Markdown string.

Parameters

example : dict —

The example to convert.

Returns

tuple[str, str] — A tuple (task_examples, question) for the example.

source method HumanEvaluator.compute_and_log_scores() → None

Computes and logs the scores for the dataset.

source main(annotator_id: int) → None

Start the Gradio app for human evaluation.

Raises

NeedsExtraInstalled