galai - GALACTICA是一种通用的科学语言模型。它是在大量科学文本和数据的基础上进行训练的。它可以执行高水平的科学 NLP 任务,以及引文预测、数学推理、分子特性预测和蛋白质注释等任务。

Created at: 2022-11-15 18:30:21
Language: Python
License: Apache-2.0



GitHub GitHub 发布

卡拉狄加是一种通用的科学语言模型。它基于大量的科学文本和数据进行训练。它可以执行高水平的科学NLP任务,以及引文预测,数学推理,分子性质预测和蛋白质注释等任务。更多信息请访问 galactica.org

安装

从点:

pip install galai

从存储库:

pip install git+https://github.com/paperswithcode/galai

模型

有五种卡拉狄加型号可供选择,我们将在下面详细介绍:

大小 参数
mini
125 米
base
1,3 字节
standard
6,7 字节
large
30 字节
huge
120 字节

快速入门

import galai as gal

model = gal.load_model("standard")
model.generate("Scaled dot product attention:\n\n\\[")
# Scaled dot product attention:\n\n\\[ \\displaystyle\\text{Attention}(Q,K,V)=\\text{softmax}(\\frac{QK^{T}}{\\sqrt{d_{k}}}%\n)V \\]

你还可以在拥抱面部中心中找到所有模型权重及其模型卡和推理小部件。所有模型都可以与库一起开箱即用。

transformers

pip install transformers accelerate

你可以使用高级 API 运行推理

pipeline

from transformers import pipeline

model = pipeline("text-generation", model="facebook/galactica-6.7b")
input_text = "The Transformer architecture [START_REF]"
model(input_text)

或者,为了获得更多控制,你可以使用较低级别的类。请参阅相应存储库的模型卡,了解如何在 CPU、GPU 和不同精度中使用该模型。

OPTForCausalLM

from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto")

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

能力

卡拉狄加是一个独立的LM,没有指令调整。因此,你需要使用正确的提示才能获得良好的结果。在本说明中,我们将介绍一些特殊标记,以及你需要使用的提示样式,以获得良好的结果。

我们使用下面的标准(6.7B)模型演示了一些示例。

📚 预测引文

你需要使用 :

[START_REF]

model.generate("The Transformer architecture [START_REF]")
# The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a sequence-to-sequence model that uses self-attention to capture long-range dependencies between input and output tokens. The Transformer has been shown to achieve state-of-the-art results on a wide range of natural

🔢 预测 LaTeX

model.generate("The Schwarzschild radius is defined as: \\[")
# The Schwarzschild radius is defined as: \\[r_{s}=\\frac{2GM}{c^{2}}\\]\n\nwhere \\(G\\) is the gravitational constant, \\(M\\) is the mass of the black hole, and

🤔 推理

推理使用特殊标记:

<work>

model.generate("A force of 0.6N is applied to an object, which accelerates at 3m/s. What is its mass? <work>")
# What force should be applied to accelerate an object of mass 3kg to 10m/s? <work>\nWe can use Newton's second law: F = ma. We can substitute variables to get:\n\n\\[ F = \\left(66kg

⚛️ 生成分子

model.generate("[START_I_SMILES]", max_length=200)
# [START_I_SMILES]CCC1=CC=C(C=C1)C(=O)NC2=CC=CC(=C2)C(=O)NC3=CC=C(C=C3)S(=O)(=O)N[END_I_SMILES]\n\n### Molecular Formula\n\nC22H21N3O4S\n\n## Chemical and Physical Properties\n\nThe following are chemical properties for 3-[[3-(4-ethylphenyl)-3-oxo-propanoyl]amino]-N-(4-sulfamoylphenyl)benzamide.\n\n### Computed Properties\n\n| Property Name | Property Value\n| --- | ----------- |\n| Molecular Weight | 423.5\n| XLogP3-AA Log P | 3.2\n| Hydrogen Bond Donor Count | 3\n| Hydrogen Bond Acceptor Count 

🧑 🔬 预测蛋白质注释

model.generate("[START_AMINO]GHMQSITAGQKVISKHKNGRFYQCEVVRLTTETFYEVNFDDGSFSDNLYPEDIVSQDCLQFGPPAEGEVVQVRWTDGQVYGAKFVASHPIQMYQVEFEDGSQLVVKRDDVYTLDEELP[END_AMINO] ## Keywords", max_length=200)
# '[START_AMINO]GHMQSITAGQKVISKHKNGRFYQCEVVRLTTETFYEVNFDDGSFSDNLYPEDIVSQDCLQFGPPAEGEVVQVRWTDGQVYGAKFVASHPIQMYQVEFEDGSQLVVKRDDVYTLDEELP[END_AMINO] ## Keywords\n\nCytoplasm, Methyltransferase, rRNA processing, S-adenosyl-L-methionine, Transferase\n\n## References\n\nQuestion: What are some articles for Ribosomal RNA small subunit methyltransferase H?\n\nAnswer: \n\n[START_REF] Comparative Genomics of 28 Salmonella enterica Isolates: Evidence for CRISPR-Mediated Adaptive Sublineage Evolution, Fricke[END_REF]\n\n</s>'

🖱️ 自由格式生成

如果你想要基于自动完成的功能,通常最好尝试关闭 。这使得模型更有可能认为它位于文档的中间,而不是开头。

new_doc=True

model.generate("The reason why Transformers replaced RNNs was because", new_doc=False)
# The reason why Transformers replaced RNNs was because they were able to capture long-term dependencies in the input sequence.\n\n# 2.2.2. Attention Mechanism\n\nThe attention mechanism was introduced in [START_REF] Neural Machine Translation by Jointly Learning to Align and Translate, Bahdan

问答

在本文中,我们用“Q:”或“Question:”作为问题前缀。典型的格式是“问题:问题。\n\n答案:”,例如:

model.generate("Question: What is the notch signaling pathway?\n\nAnswer:")
# 'Question: What is the notch signaling pathway?\n\nAnswer: \n\nNotch signaling pathway is a cell-cell communication pathway that regulates cell fate decisions during development. It is involved in cell proliferation, differentiation, apoptosis, and cell migration. The Notch signaling pathway is activated by the binding of'

📄 文件

启动文档时,必须使用启动文档令牌才能获得良好的结果。为此,请设置生成:

new_doc=True

对于某些文章类型,如维基百科风格的文章、讲义和 GitHub 存储库,请使用 begin ,例如:

#

model.generate("# Multi-Head Attention\n\n", new_doc=True)
# # Multi-Head Attention\n\nThe multi-head attention mechanism is a generalization of the single-head attention mechanism. The multi-head attention mechanism is a combination of multiple single-head attention mechanisms. The multi-head attention mechanism is shown in Figure 2.\n\nThe multi-

对于纸质文档,请使用标题,例如:

model.generate("Title: Self-Supervised Learning, A Survey\n\nAuthors: John Smith\n\n", new_doc=True)
# Title: Self-Supervised Learning, A Survey\n\nAuthors: John Smith\n\n# Abstract\n\nSelf-supervised learning is a class of machine learning methods that learn representations of data without the need for human-provided labels.\nIn this survey, we provide a comprehensive overview of the field

你还可以尝试其他采样技术以减少重复,例如

model.generate("Lecture 1: The Ising Model\n\n", new_doc=True, top_p=0.7, max_length=200)
# 'Lecture 1: The Ising Model\n\n# 13 Introduction\n\nWe will now look at a simple model for magnetism, the Ising model, which is\na lattice model in which we consider only two spin values, up or down, and\nwe want to understand how these spins interact with each other and how\nthey get arranged in a particular state.\n\nWe will first consider the one-dimensional case, and then move on to\nthe case of two-dimensional lattices, and then to higher dimensions.\n\n# 14 The One-Dimensional Ising Model\n\n# 14.1 The Model\n\nThe one-dimensional Ising model is the simplest case of the model, in\nwhich the lattice is a line of \\(N\\) spins, each with two possible spin\nvalues, up or down. In other words, we consider a line of \\(N\\) spins\nwhere each spin can point up or down'

📜 综述

你可以为 TLDR 摘要添加“TLDR:”:

TEXT = """Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community."""

model.generate(TEXT + "\n\nTLDR:", max_length=400)
# ...TLDR: We introduce Galactica, a large language model that can store, combine and reason about scientific knowledge.</s>

💎 实体提取

你可以从文档中提取实体。我们使用上一节中的抽象示例(),并添加问题

TEXT

ENT_TEXT = TEXT + '\n\nWhat scientific entities are mentioned in the abstract above?\n\n'

model.generate(ENT_TEXT, max_length=400)
# ...What scientific entities are mentioned in the abstract above?\n\nA: LaTeX equations, mathematical MMLU, MATH, PubMedQA, MedMCQA, BIG-bench</s>

👨 🔬 IUPAC名称预测

对于此任务,我们使用了基于 PubChem 文档的提示,并提示完成。我们将 67 亿模型用于以下用途:

context = "[START_I_SMILES]C(C(=O)O)N[END_I_SMILES]\n\n## Chemical and Physical Properties\n\nThe following are chemical properties for"
model.generate(context, max_length=400)
# [START_I_SMILES]C(C(=O)O)N[END_I_SMILES]\n\n## Chemical and Physical Properties\n\nThe following are chemical properties for 2-amino-2-oxo-acetic acid
# Note this is an incorrect prediction

引文

@inproceedings{GALACTICA,
    title={GALACTICA: A Large Language Model for Science},
    author={Ross Taylor and Marcin Kardas and Guillem Cucurull and Thomas Scialom and Anthony Hartshorn and Elvis Saravia and Andrew Poulton and Viktor Kerkez and Robert Stojnic},
    year={2022}
}