OLMo-Eval 是一个用于评估开放语言模型的存储库。
该框架是一种在 NLP 任务上运行语言模型评估管道的方法。
代码库是可扩展的,包含和示例配置,这些配置运行一个系列
用于计算模型输出和指标的 Tango
步骤。
olmo_eval
task_sets
使用此管道,你可以在 t task_sets上评估 m 个模型,其中每个task_set由一个或多个单独的任务组成。 使用 task_sets 可以计算多个任务的聚合指标。可以使用可选的集成 用于报告。
google-sheet
该管道是使用 ai2-tango 和 ai2-catwalk 构建的。
克隆仓库后,请运行
conda create -n eval-pipeline python=3.10
conda activate eval-pipeline
cd OLMo-Eval
pip install -e .
QuickstartThe current
task_sets
can be found at configs/task_sets. In this example, we run gen_tasks
on EleutherAI/pythia-1b
. The example config is here.
The configuration can be run as follows:
tango --settings tango.yml run configs/example_config.jsonnet --workspace my-eval-workspace
This executes all the steps defined in the config, and saves them in a local
tango
workspace called my-eval-workspace
. If you add a new task_set or model to your config and run the same command again, it will reuse the previous outputs, and only compute the new outputs.
The output should look like this:
New models and datasets can be added by modifying the example configuration.
Load pipeline outputfrom tango import Workspace
workspace = Workspace.from_url("local://my-eval-workspace")
result = workspace.step_result("combine-all-outputs")
Load individual task results with per instance outputs
result = workspace.step_result("outputs_pythia-1bstep140000_gen_tasks_drop")
Evaluating common models on standard benchmarksThe eval_table config evaluates
falcon-7b
, mpt-7b
, llama2-7b
, and llama2-13b
, on standard_benchmarks
and MMLU
. Run as follows:
tango --settings tango.yml run configs/eval_table.jsonnet --workspace my-eval-workspace
PALOMAThis repository was also used to run evaluations for the PALOMA paper
Details on running the evaluation on PALOMA can be found here.
Advanced