前端:结构化生成语言 (SGLang)#
前端语言可与本地模型或 API 模型一起使用。它是 OpenAI API 的替代方案。你可能会发现它更易于用于复杂的提示工作流程。
快速入门#
以下示例展示了如何使用 sglang 来回答多轮问题。
使用本地模型#
首先,使用以下命令启动服务器:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
然后,连接到服务器并回答多轮问题。
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = multi_turn_question.run(
question_1="What is the capital of the United States?",
question_2="List two local attractions.",
)
for m in state.messages():
print(m["role"], ":", m["content"])
print(state["answer_1"])
使用 OpenAI 模型#
设置 OpenAI API 密钥
export OPENAI_API_KEY=sk-******
然后,回答多轮问题。
from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
set_default_backend(OpenAI("gpt-3.5-turbo"))
state = multi_turn_question.run(
question_1="What is the capital of the United States?",
question_2="List two local attractions.",
)
for m in state.messages():
print(m["role"], ":", m["content"])
print(state["answer_1"])
更多示例#
Anthropic 和 VertexAI (Gemini) 模型也受支持。你可以在 examples/quick_start 中找到更多示例。
语言功能#
首先,导入 sglang。
import sglang as sgl
sglang
提供了一些简单的原语,例如 gen
、select
、fork
、image
。你可以在用 sgl.function
装饰的函数中实现你的提示流程。然后,你可以使用 run
或 run_batch
调用该函数。系统将为你管理状态、聊天模板、并行性和批处理。
以下示例的完整代码可以在 readme_examples.py 中找到。
控制流#
你可以在函数体中使用任何 Python 代码,包括控制流、嵌套函数调用和外部库。
@sgl.function
def tool_use(s, question):
s += "To answer this question: " + question + ". "
s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "
if s["tool"] == "calculator":
s += "The math expression is" + sgl.gen("expression")
elif s["tool"] == "search engine":
s += "The key word to search is" + sgl.gen("word")
并行性#
使用 fork
启动并行提示。因为 sgl.gen
是非阻塞的,所以下面的 for 循环并行发出两个生成调用。
@sgl.function
def tip_suggestion(s):
s += (
"Here are two tips for staying healthy: "
"1. Balanced Diet. 2. Regular Exercise.\n\n"
)
forks = s.fork(2)
for i, f in enumerate(forks):
f += f"Now, expand tip {i+1} into a paragraph:\n"
f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
s += "In summary" + sgl.gen("summary")
多模态#
使用 sgl.image
将图像作为输入传递。
@sgl.function
def image_qa(s, image_file, question):
s += sgl.user(sgl.image(image_file) + question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256)
约束解码#
使用 regex
指定正则表达式作为解码约束。这仅适用于本地模型。
@sgl.function
def regular_expression_gen(s):
s += "Q: What is the IP address of the Google DNS servers?\n"
s += "A: " + sgl.gen(
"answer",
temperature=0,
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)
JSON 解码#
使用 regex
指定带有正则表达式的 JSON 模式。
character_regex = (
r"""\{\n"""
+ r""" "name": "[\w\d\s]{1,16}",\n"""
+ r""" "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
+ r""" "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
+ r""" "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
+ r""" "wand": \{\n"""
+ r""" "wood": "[\w\d\s]{1,16}",\n"""
+ r""" "core": "[\w\d\s]{1,16}",\n"""
+ r""" "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
+ r""" \},\n"""
+ r""" "alive": "(Alive|Deceased)",\n"""
+ r""" "patronus": "[\w\d\s]{1,16}",\n"""
+ r""" "bogart": "[\w\d\s]{1,16}"\n"""
+ r"""\}"""
)
@sgl.function
def character_gen(s, name):
s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
请参阅 json_decode.py 以获取使用 Pydantic 模型指定格式的另一个示例。
批处理#
使用 run_batch
以连续批处理方式运行一批请求。
@sgl.function
def text_qa(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", stop="\n")
states = text_qa.run_batch(
[
{"question": "What is the capital of the United Kingdom?"},
{"question": "What is the capital of France?"},
{"question": "What is the capital of Japan?"},
],
progress_bar=True
)
流式传输#
添加 stream=True
以启用流式传输。
@sgl.function
def text_qa(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", stop="\n")
state = text_qa.run(
question="What is the capital of France?",
temperature=0.1,
stream=True
)
for out in state.text_iter():
print(out, end="", flush=True)
角色#
使用 sgl.system
、sgl.user
和 sgl.assistant
在使用聊天模型时设置角色。你也可以使用开始和结束标记定义更复杂的角色提示。
@sgl.function
def chat_example(s):
s += sgl.system("You are a helpful assistant.")
# Same as: s += s.system("You are a helpful assistant.")
with s.user():
s += "Question: What is the capital of France?"
s += sgl.assistant_begin()
s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
s += sgl.assistant_end()
提示和实现细节#
sgl.gen
中的choices
参数通过计算所有选项的 令牌长度归一化对数概率 并选择概率最高的选项来实现。sgl.gen
中的regex
参数通过根据正则表达式设置的约束,使用带有 logits 偏差掩码的自回归解码来实现。它与temperature=0
和temperature != 0
兼容。