OLMo-Eval - OLMo-Eval 是一个用于评估开放语言模型的存储库。

Created at: 2023-10-13 07:50:51

Language: Python

编号: https://github.com/allenai/OLMo-Eval

License: Apache-2.0

OLMo-评估

OLMo-Eval 是一个用于评估开放语言模型的存储库。

概述

该框架是一种在 NLP 任务上运行语言模型评估管道的方法。代码库是可扩展的，包含和示例配置，这些配置运行一个系列用于计算模型输出和指标的 Tango 步骤。

olmo_eval

task_sets


使用此管道，你可以在 t task_sets上评估 m 个模型，其中每个task_set由一个或多个单独的任务组成。
使用 task_sets 可以计算多个任务的聚合指标。可以使用可选的集成
用于报告。
google-sheet

该管道是使用 ai2-tango 和 ai2-catwalk 构建的。
安装
克隆仓库后，请运行
conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .

Quickstart
The current 
task_sets
 can be found at configs/task_sets. In this example, we run gen_tasks
 on EleutherAI/pythia-1b
. The example config is here.
The configuration can be run as follows:
tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace

This executes all the steps defined in the config, and saves them in a local 
tango
 workspace called my-eval-workspace
. If you add a new task_set or model to your config and run the same command again, it will reuse the previous outputs, and only compute the new outputs.
The output should look like this:

New models and datasets can be added by modifying the example configuration.
Load pipeline output
from tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")
Load individual task results with per instance outputs
result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")
Evaluating common models on standard benchmarks
The eval_table config evaluates 
falcon-7b
, mpt-7b
, llama2-7b
, and llama2-13b
, on standard_benchmarks and MMLU. Run as follows:
tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace

PALOMA
This repository was also used to run evaluations for the PALOMA paper
Details on running the evaluation on PALOMA can be found here.
Advanced

Save output to google sheet
Use a remote workspace
Run without Tango (useful for debugging)
Run on Beaker

OLMo-Eval - OLMo-Eval 是一个用于评估开放语言模型的存储库。

OLMo-评估

概述

安装

Quickstart

Load pipeline output

Evaluating common models on standard benchmarks

PALOMA

Advanced

About

Author：allenai

热门github