Summarization

📚 Overview

Summarization is a task of generating a shorter version of a given text, while preserving the main points of the original text. The model receives a long text and has to generate a shorter version of it, typically a handful of sentences long. This is abstractive summarization, meaning that the summary typically do not appear verbatim in the original text, but that the model has to generate new text based on the input.

When evaluating generative models, we allow the model to generate 256 tokens on this task.

📊 Metrics

The primary metric used to evaluate the performance of a model on the summarization task is the BERTScore, which uses a pretrained encoder model to encode each token in both the reference summary and the generated summary, and then uses cosine similarity to measure how the tokens match up. Using an encoder model allows for the model to phrase a summary differently than the reference, while still being rewarded for capturing the same meaning. We use the microsoft/mdeberta-v3-base encoder model for all languages, as it is the best performing encoder model consistently across all languages in the framework.

We also report the ROUGE-L score, which measures the longest sequence of words that the generated summary and the reference summary have in common. This is a more traditional metric for summarization, which is why we report it as well, but it correlates less well with human judgments than BERTScore.

🛠️ How to run

In the command line interface of the ScandEval Python package, you can benchmark your favorite model on the summarization task like so:

$ scandeval --model <model-id> --task summarization