OLMo-Eval - OLMo-Eval 是一个用于评估开放语言模型的存储库。

Created at: 2023-10-13 07:50:51
Language: Python
License: Apache-2.0

OLMo-评估

OLMo-Eval 是一个用于评估开放语言模型的存储库。

概述

该框架是一种在 NLP 任务上运行语言模型评估管道的方法。 代码库是可扩展的,包含和示例配置,这些配置运行一个系列 用于计算模型输出和指标的 Tango 步骤。

olmo_eval
task_sets

使用此管道,你可以在 t task_sets上评估 m 个模型,其中每个task_set由一个或多个单独的任务组成。 使用 task_sets 可以计算多个任务的聚合指标。可以使用可选的集成 用于报告。

google-sheet

该管道是使用 ai2-tango 和 ai2-catwalk 构建的。

安装

克隆仓库后,请运行

conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

Quickstart

The current

task_sets
can be found at configs/task_sets. In this example, we run
gen_tasks
on
EleutherAI/pythia-1b
. The example config is here.

The configuration can be run as follows:

tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

This executes all the steps defined in the config, and saves them in a local

tango
workspace called
my-eval-workspace
. If you add a new task_set or model to your config and run the same command again, it will reuse the previous outputs, and only compute the new outputs.

The output should look like this:

Screen Shot 2023-12-04 at 9 22 35 PM

New models and datasets can be added by modifying the example configuration.

Load pipeline output

from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")

Load individual task results with per instance outputs

result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")

Evaluating common models on standard benchmarks

The eval_table config evaluates

falcon-7b
,
mpt-7b
,
llama2-7b
, and
llama2-13b
, on
standard_benchmarks
and
MMLU
. Run as follows:

tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace

PALOMA

This repository was also used to run evaluations for the PALOMA paper

Details on running the evaluation on PALOMA can be found here.

Advanced