前端:结构化生成语言 (SGLang)#

前端语言可与本地模型或 API 模型一起使用。它是 OpenAI API 的替代方案。你可能会发现它更易于用于复杂的提示工作流程。

快速入门#

以下示例展示了如何使用 sglang 来回答多轮问题。

使用本地模型#

首先,使用以下命令启动服务器:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

然后,连接到服务器并回答多轮问题。

from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

使用 OpenAI 模型#

设置 OpenAI API 密钥

export OPENAI_API_KEY=sk-******

然后,回答多轮问题。

from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(OpenAI("gpt-3.5-turbo"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])

更多示例#

Anthropic 和 VertexAI (Gemini) 模型也受支持。你可以在 examples/quick_start 中找到更多示例。

语言功能#

首先,导入 sglang。

import sglang as sgl

sglang 提供了一些简单的原语,例如 genselectforkimage。你可以在用 sgl.function 装饰的函数中实现你的提示流程。然后,你可以使用 runrun_batch 调用该函数。系统将为你管理状态、聊天模板、并行性和批处理。

以下示例的完整代码可以在 readme_examples.py 中找到。

控制流#

你可以在函数体中使用任何 Python 代码,包括控制流、嵌套函数调用和外部库。

@sgl.function
def tool_use(s, question):
    s += "To answer this question: " + question + ". "
    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "

    if s["tool"] == "calculator":
        s += "The math expression is" + sgl.gen("expression")
    elif s["tool"] == "search engine":
        s += "The key word to search is" + sgl.gen("word")

并行性#

使用 fork 启动并行提示。因为 sgl.gen 是非阻塞的,所以下面的 for 循环并行发出两个生成调用。

@sgl.function
def tip_suggestion(s):
    s += (
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += f"Now, expand tip {i+1} into a paragraph:\n"
        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")

    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
    s += "In summary" + sgl.gen("summary")

多模态#

使用 sgl.image 将图像作为输入传递。

@sgl.function
def image_qa(s, image_file, question):
    s += sgl.user(sgl.image(image_file) + question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256)

另请参阅 local_example_llava_next.py

约束解码#

使用 regex 指定正则表达式作为解码约束。这仅适用于本地模型。

@sgl.function
def regular_expression_gen(s):
    s += "Q: What is the IP address of the Google DNS servers?\n"
    s += "A: " + sgl.gen(
        "answer",
        temperature=0,
        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
    )

JSON 解码#

使用 regex 指定带有正则表达式的 JSON 模式。

character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)

@sgl.function
def character_gen(s, name):
    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)

请参阅 json_decode.py 以获取使用 Pydantic 模型指定格式的另一个示例。

批处理#

使用 run_batch 以连续批处理方式运行一批请求。

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True
)

流式传输#

添加 stream=True 以启用流式传输。

@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

state = text_qa.run(
    question="What is the capital of France?",
    temperature=0.1,
    stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)

角色#

使用 sgl.systemsgl.usersgl.assistant 在使用聊天模型时设置角色。你也可以使用开始和结束标记定义更复杂的角色提示。

@sgl.function
def chat_example(s):
    s += sgl.system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += sgl.assistant_begin()
    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
    s += sgl.assistant_end()

提示和实现细节#

  • sgl.gen 中的 choices 参数通过计算所有选项的 令牌长度归一化对数概率 并选择概率最高的选项来实现。

  • sgl.gen 中的 regex 参数通过根据正则表达式设置的约束,使用带有 logits 偏差掩码的自回归解码来实现。它与 temperature=0temperature != 0 兼容。