[Bug] TP Worker Cuda Graph Capture NCCL Error
Issue Description
We are experiencing a segmentation fault error when capturing cuda graphs in our TP worker. The error occurs during the ncclAllGather
operation, which is used for tensor model parallelism. The stacktrace indicates that the error is related to the ncclGroupCommJoin
function.
Reproduction Steps
To reproduce the issue, follow these steps:
- Run the
launch_server
script with the following command:
python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7010 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 8080
- Run the
launch_server
script with the following command:
python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7020 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 9080
- Observe the error message and stacktrace.
Environment
nccl
version: 2.25.1
Stacktrace
The stacktrace is as follows:
[2025-04-26 20:55:30] Rank 0 scheduler is dead. Please check if there are relevant logs.
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293198
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293197
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293196
[2025-04-26 20:55:32] Exit code: -11
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, inrun_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
launch_server(server_args)
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 700, in launch_server
tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 586, in _launch_subprocesses
data = scheduler_pipe_readers[i].recv()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Additional Information
- The error occurs during the
ncclAllGather
operation, which is used for tensor model parallelism. - The stacktrace indicates that the error is related to the
ncclGroupCommJoin
function. - The
nccl
version is 2.25.1.
Expected Behavior
The ncclAllGather
operation should complete successfully without any errors.
Actual Behavior
Q: What is the TP worker cuda graph capture NCCL error?
A: The TP worker cuda graph capture NCCL error is a segmentation fault error that occurs during the ncclAllGather
operation in the TP worker. This error is related to the ncclGroupCommJoin
function.
Q: What is the cause of the TP worker cuda graph capture NCCL error?
A: The cause of the TP worker cuda graph capture NCCL error is not yet clear. However, it is believed to be related to the ncclGroupCommJoin
function, which is used for tensor model parallelism.
Q: How can I reproduce the TP worker cuda graph capture NCCL error?
A: To reproduce the TP worker cuda graph capture NCCL error, follow these steps:
- Run the
launch_server
script with the following command:
python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7010 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 8080
- Run the
launch_server
script with the following command:
python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7020 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 9080
- Observe the error message and stacktrace.
Q: What is the environment required to reproduce the TP worker cuda graph capture NCCL error?
A: The environment required to reproduce the TP worker cuda graph capture NCCL error is:
nccl
version: 2.25.1
Q: What is the expected behavior of the TP worker cuda graph capture NCCL operation?
A: The expected behavior of the TP worker cuda capture NCCL operation is that it should complete successfully without any errors.
Q: What is the actual behavior of the TP worker cuda graph capture NCCL operation?
A: The actual behavior of the TP worker cuda graph capture NCCL operation is that it fails with a segmentation fault error.
Q: How can I fix the TP worker cuda graph capture NCCL error?
A: Unfortunately, the TP worker cuda graph capture NCCL error is still under investigation, and a fix is not yet available. However, we recommend checking the nccl
version and ensuring that it is up-to-date. Additionally, you can try disabling the ncclGroupCommJoin
function or using a different tensor model parallelism method.