SGLang 运行时的采样参数#

本文档介绍了 SGLang 运行时的采样参数。它是运行时的底层端点。如果你想要一个可以自动处理聊天模板的高级端点,请考虑使用 OpenAI 兼容 API

/generate 端点接受 JSON 格式的以下参数。

@dataclass
class GenerateReqInput:
    # The input prompt. It can be a single prompt or a batch of prompts.
    text: Optional[Union[List[str], str]] = None
    # The token ids for text; one can either specify text or input_ids.
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
    image_data: Optional[Union[List[str], str]] = None
    # The sampling_params. See descriptions below.
    sampling_params: Union[List[Dict], Dict] = None
    # The request id.
    rid: Optional[Union[List[str], str]] = None
    # Whether to return logprobs.
    return_logprob: Optional[Union[List[bool], bool]] = None
    # The start location of the prompt for return_logprob.
    # By default, this value is "-1", which means it will only return logprobs for output tokens.
    logprob_start_len: Optional[Union[List[int], int]] = None
    # The number of top logprobs to return.
    top_logprobs_num: Optional[Union[List[int], int]] = None
    # Whether to detokenize tokens in text in the returned logprobs.
    return_text_in_logprobs: bool = False
    # Whether to stream output.
    stream: bool = False

sampling_params 遵循以下格式

# The maximum number of output tokens
max_new_tokens: int = 128,
# Stop when hitting any of the strings in this list.
stop: Optional[Union[str, List[str]]] = None,
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
# `min_new_tokens`.
stop_token_ids: Optional[List[int]] = [],
# Sampling temperature
temperature: float = 1.0,
# Top-p sampling
top_p: float = 1.0,
# Top-k sampling
top_k: int = -1,
# Min-p sampling
min_p: float = 0.0,
# Whether to ignore EOS token.
ignore_eos: bool = False,
# Whether to skip the special tokens during detokenization.
skip_special_tokens: bool = True,
# Whether to add spaces between special tokens during detokenization.
spaces_between_special_tokens: bool = True,
# Constrains the output to follow a given regular expression.
regex: Optional[str] = None,
# Do parallel sampling and return `n` outputs.
n: int = 1,
# Constrains the output to follow a given JSON schema.
# `regex` and `json_schema` cannot be set at the same time.
json_schema: Optional[str] = None,

## Penalties. See [Performance Implications on Penalties] section below for more informations.

# Float that penalizes new tokens based on their frequency in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
frequency_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the generated text so far.
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
presence_penalty: float = 0.0,
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
repetition_penalty: float = 1.0,
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
# difficult to infer the correct token ID by given `stop` strings.
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
min_new_tokens: int = 0,

示例#

正常#

启动服务器

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

发送请求

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

流式#

发送请求并流式输出

import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")

多模态#

启动服务器

python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava

下载图像

curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true

发送请求

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

image_data 可以是文件名、URL 或 base64 编码的字符串。另请参见 python/sglang/srt/utils.py:load_image。流式传输以类似于 上面 的方式支持。

对惩罚的性能影响#

虽然你可以通过提供相关的 sampling_params 来应用惩罚,但这会带来一些缺点。

这些缺点将应用于同一批次中的每个请求,因为惩罚器也应用于批次。

延迟#

虽然我们尝试通过 CUDA 计算惩罚算法,但这仍然是在基本采样逻辑之上进行的额外计算。有关详细的开销,我们建议你运行自己的基准测试,但你可以找到以下示例以了解概况。

内存#

由于我们通过 CUDA 计算惩罚算法,因此该逻辑将相关参数存储在 GPU 上。这通常是 vocab_size 乘以 running_requests 的规模。

你可以在自己的硬件上使用所需的参数运行自己的基准测试,以确保在使用之前不会出现 OOM。

调整 --mem-fraction-static 和/或 --max-running-requests 将有所帮助。有关更多信息,请参见 此处

基准测试#

以下所有基准测试都在 NVIDIA H100 SXM5 上运行。

基线#

dc9d06d886151707f97d0b78095df9de262fd3c9 处测量。

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  66.11
Total input tokens:                      378633
Total generated tokens:                  775651
Total generated tokens (retokenized):    775118
Request throughput (req/s):              45.38
Input token throughput (tok/s):          5727.04
Output token throughput (tok/s):         11732.16
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40881.94
Median E2E Latency (ms):                 43967.10
---------------Time to First Token----------------
Mean TTFT (ms):                          19884.75
Median TTFT (ms):                        14226.56
P99 TTFT (ms):                           47738.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          91.96
Median TPOT (ms):                        90.11
P99 TPOT (ms):                           308.54
---------------Inter-token Latency----------------
Mean ITL (ms):                           174.54
Median ITL (ms):                         58.56
P99 ITL (ms):                            440.18
==================================================

全部#

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
  "frequency_penalty": 1.1,
  "presence_penalty": 1.1,
  "repetition_penalty": 0.1,
  "min_new_tokens": 5
}'

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  78.35
Total input tokens:                      378633
Total generated tokens:                  775651
Total generated tokens (retokenized):    774756
Request throughput (req/s):              38.29
Input token throughput (tok/s):          4832.86
Output token throughput (tok/s):         9900.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   49017.68
Median E2E Latency (ms):                 52825.70
---------------Time to First Token----------------
Mean TTFT (ms):                          23892.60
Median TTFT (ms):                        18895.47
P99 TTFT (ms):                           57426.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          114.54
Median TPOT (ms):                        107.27
P99 TPOT (ms):                           293.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           205.68
Median ITL (ms):                         73.97
P99 ITL (ms):                            453.86
==================================================

频率惩罚#

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
    "frequency_penalty": 1.1
}'

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  72.72
Total input tokens:                      378633
Total generated tokens:                  775651
Total generated tokens (retokenized):    774955
Request throughput (req/s):              41.26
Input token throughput (tok/s):          5206.84
Output token throughput (tok/s):         10666.51
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45445.56
Median E2E Latency (ms):                 48960.39
---------------Time to First Token----------------
Mean TTFT (ms):                          22363.16
Median TTFT (ms):                        17125.02
P99 TTFT (ms):                           52920.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          104.71
Median TPOT (ms):                        98.30
P99 TPOT (ms):                           268.06
---------------Inter-token Latency----------------
Mean ITL (ms):                           191.60
Median ITL (ms):                         67.83
P99 ITL (ms):                            455.46
==================================================

存在惩罚#

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
    "presence_penalty": 1.1
}'

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  72.04
Total input tokens:                      378633
Total generated tokens:                  775651
Total generated tokens (retokenized):    775210
Request throughput (req/s):              41.64
Input token throughput (tok/s):          5255.98
Output token throughput (tok/s):         10767.18
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   44926.61
Median E2E Latency (ms):                 48302.88
---------------Time to First Token----------------
Mean TTFT (ms):                          22095.39
Median TTFT (ms):                        16740.93
P99 TTFT (ms):                           52554.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          103.54
Median TPOT (ms):                        97.37
P99 TPOT (ms):                           271.86
---------------Inter-token Latency----------------
Mean ITL (ms):                           189.86
Median ITL (ms):                         68.45
P99 ITL (ms):                            447.11
==================================================

重复惩罚#

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
    "repetition_penalty": 0.1
}'

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 74.54
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 766008
Request throughput (req/s): 40.24
Input token throughput (tok/s): 5079.36
Output token throughput (tok/s): 10405.35
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 46530.38
Median E2E Latency (ms): 50302.65
---------------Time to First Token----------------
Mean TTFT (ms): 22603.47
Median TTFT (ms): 17167.08
P99 TTFT (ms): 54497.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 117.59
Median TPOT (ms): 101.79
P99 TPOT (ms): 320.04
---------------Inter-token Latency----------------
Mean ITL (ms): 195.26
Median ITL (ms): 69.51
P99 ITL (ms): 433.86
==================================================

最小新词元#

最小新词元惩罚器计算直到生成过程达到给定的 min_new_tokens

与其他惩罚器不同,将此值设置为更高值将具有更多延迟影响。

$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
    "min_new_tokens": 5
}'

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 66.94
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 775220
Request throughput (req/s): 44.81
Input token throughput (tok/s): 5656.13
Output token throughput (tok/s): 11586.90
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 41888.55
Median E2E Latency (ms): 45354.16
---------------Time to First Token----------------
Mean TTFT (ms): 20866.91
Median TTFT (ms): 16219.79
P99 TTFT (ms): 49263.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 97.05
Median TPOT (ms): 89.76
P99 TPOT (ms): 233.50
---------------Inter-token Latency----------------
Mean ITL (ms): 179.17
Median ITL (ms): 55.08
P99 ITL (ms): 409.12
==================================================