setfit - 使用 Sentence Transformers 进行高效的小样本学习

Created at: 2022-06-30 15:10:15

Language: Python

编号: https://github.com/huggingface/setfit

License: Apache-2.0

SetFit - 使用句子转换器进行高效的少镜头学习

我们引入了SetFit，这是一个高效且无提示的框架，用于句子变形金刚的少量微调。SetFit以很少的标记数据实现高精度 - 例如，在客户评论情绪数据集上每个类只有8个标记示例，SetFit在完整的3k示例🤯训练集上与微调RoBERTa Big具有竞争力！

与其他很少学习的方法相比，SetFit有几个独特的功能：

🗣 没有提示或口头表达方式：目前用于微调的技巧需要手工制作的提示或口头表达工具，以将示例转换为适合基础语言模型的格式。SetFit 通过直接从文本示例生成丰富的嵌入，完全省去了提示。
🏎 快速训练：SetFit 不需要 T0 或 GPT-3 等大型模型即可实现高精度。因此，训练和运行推理通常要快一个数量级（或更多）。
🌎 多语言支持：SetFit 可以与集线器上的任何句子转换器一起使用，这意味着你只需微调多语言检查点即可对多种语言的文本进行分类。

开始

安装

通过运行以下命令下载并安装：

setfit

python -m pip install setfit

训练集合拟合模型

setfit

与拥抱面部中心集成，并提供两个主要类：

```
SetFitModel
```
：将来自的预训练主体和来自的分类头组合在一起的包装纸
```
sentence_transformers
```
```
scikit-learn
```
```
SetFitTrainer
```
：一个帮助器类，它包装了 SetFit 的微调过程。

下面是一个端到端示例：

from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("emotion")

# Simulate the few-shot regime by sampling 8 examples per class
num_classes = 6
train_ds = dataset["train"].shuffle(seed=42).select(range(8 * num_classes))
test_ds = dataset["test"]

# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

# Push model to the Hub
trainer.push_to_hub("my-awesome-setfit-model")

有关更多示例，请查看该文件夹。

notebooks/

重现论文结果

我们提供脚本来重现 SetFit 的结果以及我们论文表 2 中提供的各种基线。查看目录中的设置和培训说明。

scripts/

开发人员安装

要运行此项目中的代码，请首先使用例如 Conda 创建一个 Python 虚拟环境：

conda create -n setfit python=3.9 && conda activate setfit

然后使用以下命令安装基本要求：

python -m pip install -e '.[dev]'

这将安装和打包，我们用它来确保代码格式一致。接下来，转到其中一个专用基线目录并安装额外的依赖项，例如

datasets

black

isort

cd scripts/setfit
python -m pip install -r requirements.txt

设置代码格式

我们使用并确保代码格式一致。按照安装步骤操作后，你可以通过运行以下命令在本地检查代码：

black

isort

make style && make quality

项目结构

├── LICENSE
├── Makefile        <- Makefile with commands like `make style` or `make tests`
├── README.md       <- The top-level README for developers using this project.
├── notebooks       <- Jupyter notebooks.
├── final_results   <- Model predictions from the paper
├── scripts         <- Scripts for training and inference
├── setup.cfg       <- Configuration file to define package metadata
├── setup.py        <- Make this project pip installable with `pip install -e`
├── src             <- Source code for SetFit
└── tests           <- Unit tests

引文

  doi = {10.48550/ARXIV.2209.11055},
  url = {https://arxiv.org/abs/2209.11055},
  author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Efficient Few-Shot Learning Without Prompts},
  publisher = {arXiv}, 
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}}