前端：结构化生成语言 (SGLang)#

前端语言可与本地模型或 API 模型一起使用。它是 OpenAI API 的替代方案。你可能会发现它更易于用于复杂的提示工作流程。

快速入门#

以下示例展示了如何使用 sglang 来回答多轮问题。

使用本地模型#

首先，使用以下命令启动服务器：

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

然后，连接到服务器并回答多轮问题。

from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

使用 OpenAI 模型#

设置 OpenAI API 密钥

export OPENAI_API_KEY=sk-******

然后，回答多轮问题。

from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(OpenAI("gpt-3.5-turbo"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

语言功能#

首先，导入 sglang。

import sglang as sgl

sglang 提供了一些简单的原语，例如 gen、select、fork、image。你可以在用 sgl.function 装饰的函数中实现你的提示流程。然后，你可以使用 run 或 run_batch 调用该函数。系统将为你管理状态、聊天模板、并行性和批处理。

以下示例的完整代码可以在 readme_examples.py 中找到。

控制流#

你可以在函数体中使用任何 Python 代码，包括控制流、嵌套函数调用和外部库。

@sgl.function
def tool_use(s, question):
    s += "To answer this question: " + question + ". "
    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "

    if s["tool"] == "calculator":
        s += "The math expression is" + sgl.gen("expression")
    elif s["tool"] == "search engine":
        s += "The key word to search is" + sgl.gen("word")

并行性#

使用 fork 启动并行提示。因为 sgl.gen 是非阻塞的，所以下面的 for 循环并行发出两个生成调用。

@sgl.function
def tip_suggestion(s):
    s += (
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += f"Now, expand tip {i+1} into a paragraph:\n"
        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")

    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
    s += "In summary" + sgl.gen("summary")

多模态#

使用 sgl.image 将图像作为输入传递。

@sgl.function
def image_qa(s, image_file, question):
    s += sgl.user(sgl.image(image_file) + question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256)

另请参阅 local_example_llava_next.py。

约束解码#

使用 regex 指定正则表达式作为解码约束。这仅适用于本地模型。

@sgl.function
def regular_expression_gen(s):
    s += "Q: What is the IP address of the Google DNS servers?\n"
    s += "A: " + sgl.gen(
        "answer",
        temperature=0,
        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
    )

JSON 解码#

使用 regex 指定带有正则表达式的 JSON 模式。

character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)

@sgl.function
def character_gen(s, name):
    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)

请参阅 json_decode.py 以获取使用 Pydantic 模型指定格式的另一个示例。

批处理#

使用 run_batch 以连续批处理方式运行一批请求。

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True
)

流式传输#

添加 stream=True 以启用流式传输。

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

state = text_qa.run(
    question="What is the capital of France?",
    temperature=0.1,
    stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)

角色#

使用 sgl.system、sgl.user 和 sgl.assistant 在使用聊天模型时设置角色。你也可以使用开始和结束标记定义更复杂的角色提示。

@sgl.function
def chat_example(s):
    s += sgl.system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += sgl.assistant_begin()
    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
    s += sgl.assistant_end()

提示和实现细节#

sgl.gen 中的 choices 参数通过计算所有选项的令牌长度归一化对数概率并选择概率最高的选项来实现。
sgl.gen 中的 regex 参数通过根据正则表达式设置的约束，使用带有 logits 偏差掩码的自回归解码来实现。它与 temperature=0 和 temperature != 0 兼容。

前端：结构化生成语言 (SGLang)

目录