Predibase - 在线微调和部署开源大模型

唠唠闲话

Predibase 则基于 LoRAX 提供了一套快速微调和部署开源大模型的在线服务。

LoRAX（LoRA eXchange）是一个允许在单 GPU 上为数千个微调模型提供服务的框架，在不影响吞吐量或延迟的情况下显着降低服务成本。

教程将介绍使用 Predibase 在云端进行微调和推理，相关的 Jupyter Notebook 代码后续上传。

Predibase 服务

实践开始前，先在 Predibase 设置页面获取密钥和 Tenant ID，用于后续的 SDK 调用。

1 2	api_token = "pb_xxxxxx" tenant_id = "xxxx"

Predibase 为新用户提供 25$ 的免费额度，这对小规模的实验完全足够了。

20240518065213

安装依赖

安装 Predibase SDK 和 LoRAX 客户端，以及 ChatTool 用于对话风格的推理：

pip install predibase
pip install lorax-client
pip install chattool
pip install openai

在线微调

使用 Predibase 云服务进行模型微调，基本流程：数据准备，创建微调任务，下载 Lora 权重，以及模型推理。

相关文档：

官方推荐模型：Predibase 模型列表
数据准备格式：Predibase 数据准备格式

除了官方推荐模型，也支持直接使用 HuggingFace 模型，且要求：

模型标签包含 Text Generation 和 Transformer 不含 custom_code
使用支持的模型架构，比如 Llama/Mistral/Qwen 等

举个例子，Qwen1.5-1.8B-Chat 的标签：

20240518071529

数据准备

我们使用 ProofNet 数据集进行演示，将自然语言表述作为 prompt 形式语言表述作为 completion。

在 GitHub 仓库下载数据集：

1	git clone https://github.com/zhangir-azerbayev/ProofNet.git

数据集简介

ProofNet 是一个用于自动形式化和形式证明本科数学的基准测试。ProofNet 数据集包含 371 个例子，每个例子都包含一个 Lean 3 格式的形式定理陈述、一个自然语言定理陈述以及一个自然语言证明。这些问题主要来源于流行的本科纯数学教材，涵盖了实分析、复分析、线性代数、抽象代数和拓扑等主题。ProofNet 旨在成为一个具有挑战性的基准测试，以推动自动形式化和自动定理证明领域的进展。

更多关于 ProofNet 的信息可以访问 GitHub 仓库。

读取测试集和验证集，并查看数据样例：

import json
from pprint import pprint
from copy import deepcopy

# 读取测试数据
test_data = []
with open("ProofNet/benchmark/test.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        test_data.append(json.loads(line))

# 读取验证数据
valid_data = []
with open("ProofNet/benchmark/valid.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        valid_data.append(json.loads(line))

# 打印数据样例
print(len(test_data))
pprint(test_data[0])

数据数量和样例：

186
{'formal_statement': 'theorem exercise_1_1b\n'
                     '(x : ℝ)\n'
                     '(y : ℚ)\n'
                     '(h : y ≠ 0)\n'
                     ': ( irrational x ) -> irrational ( x * y ) :=',
 'id': 'Rudin|exercise_1_1b',
 'nl_proof': '\\begin{proof}\n'
             '\n'
             '    If $r x$ were rational, then $x=\\frac{r x}{r}$ would also '
             'be rational.\n'
             '\n'
             '\\end{proof}',
 'nl_statement': 'If $r$ is rational $(r \\neq 0)$ and $x$ is irrational, '
                 'prove that $rx$ is irrational.',
 'src_header': 'import .common\n'
               '\n'
               'open real complex\n'
               'open topological_space\n'
               'open filter\n'
               'open_locale real \n'
               'open_locale topology\n'
               'open_locale big_operators\n'
               'open_locale complex_conjugate\n'
               'open_locale filter\n'
               '\n'
               '\n'
               'noncomputable theory\n'
               '\n'}

数据处理

我们将其处理成 OpenAI 规范，再转换为 LoRAX 要求的格式。

首先，定义模板函数：

def chat_pair_template(nl_statement, formal_statement):
    """生成一个对话对字典，包含系统提示、用户问题和助理回答"""
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a skilled mathematician specializing in LEAN, the powerful theorem prover."
            },
            {
                "role": "user",
                "content": "Please translate the natural language version to a LEAN version:\nNatural language version: " + 
                nl_statement
            },
            {
                "role": "assistant",
                "content": "LEAN version:\n" + formal_statement
            }
        ]
    }

def data_to_chat_pairs(data):
    """将数据列表转化为聊天对列表"""
    chats = []
    for item in data:
        # 使用chat_pair_template函数创建聊天对
        chat_pair = chat_pair_template(item['nl_statement'], item['formal_statement'])
        chats.append(chat_pair)
    return chats

然后，将数据批量转化：

1 2	instruct_pairs = data_to_chat_pairs(test_data) valid_pairs = data_to_chat_pairs(valid_data)

示例如下：

{'messages': [{'role': 'system', 'content': 'You are a skilled mathematician specializing in LEAN, the powerful theorem prover.'}, {'role': 'user', 'content': 'Please translate the natural language version to a LEAN version:\nNatural language version: If $r$ is rational $(r \\neq 0)$ and $x$ is irrational, prove that $rx$ is irrational.'}, {'role': 'assistant', 'content': 'LEAN version:\ntheorem exercise_1_1b\n(x : ℝ)\n(y : ℚ)\n(h : y ≠ 0)\n: ( irrational x ) -> irrational ( x * y ) :='}]}

最后转换为 LoRAX 的格式：

lorax_entries = [item['messages'] for item in instruct_pairs]
lorax_entries = [
    {
        'prompt': item[0]['content'] + '\n' + item[1]['content'],
        'completion': item[2]['content']
    }
    for item in lorax_entries
]

valid_entries = [item['messages'] for item in valid_pairs]
valid_entries = [
    {
        'prompt': item[0]['content'] + '\n' + item[1]['content'],
        'completion': item[2]['content']
    }
    for item in valid_entries
]

pprint(lorax_entries[0])

转换后的数据样例：

{'completion': 'LEAN version:\n'
               'theorem exercise_1_1b\n'
               '(x : ℝ)\n'
               '(y : ℚ)\n'
               '(h : y ≠ 0)\n'
               ': ( irrational x ) -> irrational ( x * y ) :=',
 'prompt': 'You are a skilled mathematician specializing in LEAN, the powerful '
           'theorem prover.\n'
           'Please translate the natural language version to a LEAN version:\n'
           'Natural language version: If $r$ is rational $(r \\neq 0)$ and $x$ '
           'is irrational, prove that $rx$ is irrational.'}

保存/上传数据

将处理后的数据保存为 JSONL 文件：

import json

def write_to_jsonl(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        for entry in data:
            # 将字典转换为 JSON 格式的字符串并写入文件
            json_line = json.dumps(entry)
            file.write(json_line + '\n')  # 添加换行符以符合 JSONL 格式

# 指定输出文件路径
output_path = 'lorax_dataset.jsonl'

# 调用函数写入数据
write_to_jsonl(lorax_entries, output_path)

将数据集上传到 Predibase：

from predibase import Predibase, FinetuningConfig, DeploymentConfig

api_token = "pb_xxxxxx"
tenant_id = "xxxx"
pb = Predibase(api_token=api_token)

dataset = pb.datasets.from_file("./lorax_dataset.jsonl", name="proofnet")

创建微调任务

先创建适配器仓库，后续我们将通过 proofnet-model-mistral/x 来引用适配器：

1 2	# 创建适配器仓库 repo = pb.repos.create(name="proofnet-model-mistral", description="ProofNet experiment", exists_ok=True)

然后，启动微调任务并设置自定义参数，包括基础模型、训练轮数、rank 和学习率等：

# 启动带有自定义参数的微调任务，等待训练完成
# dataset = pb.datasets.get("proofnet")
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model="mistral-7b",
        epochs=3,  # 默认: 3
        rank=16,  # 默认: 16
        learning_rate=0.0002  # 默认: 0.0002
    ),
    dataset=dataset,
    repo=repo,
    description="ProofNet experiment"
)

输出形如：

Successfully requested finetuning of mistral-7b as `proofnet-model-mistral/2`. (Job UUID: e23f9e75-13e4-49d6-bca9-cd70ea9b9d6e).

Watching progress of finetuning job e23f9e75-13e4-49d6-bca9-cd70ea9b9d6e. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Job is queued for execution. Time in queue: 0:00:01

查看微调任务的状态：

1
2
3

# 获取适配器，如果训练仍在进行中则阻塞
adapter = pb.adapters.get("proofnet-model-mistral/2")
adapter

训练完成后，输出如下：

Adapter proofnet-model-mistral/2 is not yet ready.
Watching progress of finetuning job e23f9e75-13e4-49d6-bca9-cd70ea9b9d6e. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Waiting to receive training metrics...

┌────────────┬────────────┬─────────────────┐
│ checkpoint │ train_loss │ validation_loss │
├────────────┼────────────┼─────────────────┤
│     1      │   0.8073   │        --       │
│     2      │   0.6523   │        --       │
│     3      │   0.5062   │        --       │

其他模型的微调类似，比如千问：

# Create an adapter repository
repo = pb.repos.create(name="proofnet-model-qwen", description="ProofNet experiment", exists_ok=True)

# Start a fine-tuning job with custom parameters, blocks until training is finished
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model="Qwen/Qwen1.5-7B",
        # hf_token="<YOUR HUGGINGFACE TOKEN>" # Required for private Huggingface models
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    ),
    dataset=dataset,
    repo=repo,
    description="changing epochs, rank, and learning rate"
)

效果验证及参数下载

使用微调后的适配器进行推理，并与未微调的模型进行效果比较。

验证集示例：

{'completion': 'LEAN version:\n'
               'theorem exercise_1_1a\n'
               '  (x : ℝ) (y : ℚ) :\n'
               '  ( irrational x ) -> irrational ( x + y ) :=',
 'prompt': 'You are a skilled mathematician specializing in LEAN, the powerful '
           'theorem prover.\n'
           'Please translate the natural language version to a LEAN version:\n'
           'Natural language version: If $r$ is rational $(r \\neq 0)$ and $x$ '
           'is irrational, prove that $r+x$ is irrational.'}

测试代码：

input_prompt = f"<s>[INST] {valid_entries[0]['prompt']} [/INST] "

# 使用微调后的适配器进行推理
lorax_client = pb.deployments.client("mistral-7b")
print(lorax_client.generate(input_prompt, adapter_id="proofnet-model-mistral/2", max_new_tokens=1024).generated_text)

# 使用未微调的模型进行推理
print(lorax_client.generate(input_prompt, max_new_tokens=1024).generated_text)

模型只有 7B，未微调的模型只输出了不断重复的文本。

INST] You are a skilled mathematician specializing in LEAN, the powerful theorem prover.
Please translate the natural language version to a LEAN version:
Natural language version: If $r$ is rational $(r \neq 0)$ and $x$ is irrational, prove that $r+x$ is irrational. [/INST] 2.

[INST] You are a skilled mathematician specializing in LEAN, the powerful theorem prover.
Please translate the natural language version to a LEAN version:
Natural language version: If $r$ is rational $(r \neq 0)$ and $x$ is irrational, prove that $r+x$ is irrational. [/INST] 3.

微调输出如下，相比未微调版本，包含部分正确逻辑，但模型知识仍然不够：

1 2	theorem exercise_1_1_11 {r x : ℝ} (hr : r ≠ 0) (hx : x ≠ 0) : irrational (r + x) :=

下载 Lora 参数，这里 /1 为适配器版本号，多次微调同一任务，可以生成多个版本：

1	pb.adapters.download("proofnet-model-mistral/1")

推理服务

文档：无服务器端点

Predibase 提供了两种模型推理方式：官方 SDK 和 OpenAI SDK。

官方 SDK

使用 Predibase 的官方 SDK 进行推理服务，以 mistral-7b-instruct-v0-2 为例：

from predibase import Predibase, FinetuningConfig, DeploymentConfig

api_token = "pb_xxx"
tenant_id = "xxxxx"

pb = Predibase(api_token=api_token)
# Connected to Predibase as User(id=xxx, username=xxx)

lorax_client = pb.deployments.client("mistral-7b-instruct-v0-2") # 插入部署名称
resp = lorax_client.generate("[INST] What are some popular tourist spots in San Francisco? [/INST]")
print(resp.generated_text)

输出如下：

1	San Francisco, California is known for its unique blend of culture, history, and natural beauty. Here are some popular tourist spots that you may want to consider visiting:

使用流式响应来逐步获取模型的输出：

1
2
3

for resp in lorax_client.generate_stream("[INST] What are some popular tourist spots in San Francisco? [/INST]"):
    if not resp.token.special:
        print(resp.token.text, sep="", end="", flush=True)

加载微调后的适配器进行推理，示例：

1	print(lorax_client.generate(input_prompt, adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

OpenAI 风格推理

Predibase 也支持 OpenAI 的接口，将适配器以 model 参数传入即可。

使用 chattool 进行对话风格的推理：

from chattool import *
import chattool

api_token = "pb_xxx"
tenant_id = "xxxxx"
chattool.api_key = api_token
chattool.api_base = f"https://serving.app.predibase.com/{tenant_id}/deployments/v2/llms/mistral-7b-instruct/v1"

chat = Chat()
chat.model = '' # 插入适配器名称，空字符串则不加载适配器
chat.user("How many helicopters can a human eat in one sitting?")
chat.getresponse()
chat.print_log()

输出以下对话日志：

---------------
user
---------------
How many helicopters can a human eat in one sitting?

---------------
assistant
---------------
It is not possible for a human to eat a helicopter in one sitting, as helicopters are not designed to be consumed by humans. They are aircraft with rotating blades and a main body, not food items.