Speed

📚 Overview

Speed is a task of measuring how quickly a model can process a given input. The model receives text passages of varying lengths, and it has to process the documents as quickly as possible, which includes tokenisation of the input. We let the model process the input repeatedly for 3 seconds, and then we measure how quick it was. We use the pyinfer package to perform the speed measurement.

The speed is of course very dependent on available hardware, and for APIs it also fluctuates depending on the number of requests in the queue, so the speed benchmark should be taken as only a rough estimate of the model's speed, rather than an exact measurement.

📊 Metrics

The primary metric used to evaluate the performance of a model on the speed task is the average number of GPT-2 tokens processed per second on GPUs, when the model is processing documents with roughly 100, 200, ..., 1,000 tokens. If the model is only accessible through an API then the speed is measured on the API. The GPUs used here vary, depending on the size of the model - we preferably use an NVIDIA RTX 3090 Ti GPU, if the model has less than ~8B parameters, and one or more NVIDIA A100 GPUs is larger.

The secondary metric is the same, but where the documents are shorter, with roughly 12.5, 15, ..., 125 tokens.

🛠️ How to run

In the command line interface of the ScandEval Python package, you can benchmark your favorite model on the speed task like so:

$ scandeval --model <model-id> --task speed