[Bug] TP Worker Cuda Graph Capture NCCL Error

Apr 27, 2025 by ADMIN 46 views

**Bug Report: TP worker cuda graph capture NCCL error**

Issue Description

We are experiencing a segmentation fault error when capturing cuda graphs in our TP worker. The error occurs during the ncclAllGather operation, which is used for tensor model parallelism. The stacktrace indicates that the error is related to the ncclGroupCommJoin function.

Reproduction Steps

To reproduce the issue, follow these steps:

Run the launch_server script with the following command:

python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7010 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 8080

Run the launch_server script with the following command:

python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7020 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 9080

Observe the error message and stacktrace.

Environment

nccl version: 2.25.1

Stacktrace

The stacktrace is as follows:

[2025-04-26 20:55:30] Rank 0 scheduler is dead. Please check if there are relevant logs.
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293198
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293197
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293196
[2025-04-26 20:55:32] Exit code: -11
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, inrun_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 700, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 586, in _launch_subprocesses
    data = scheduler_pipe_readers[i].recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Additional Information

The error occurs during the ncclAllGather operation, which is used for tensor model parallelism.
The stacktrace indicates that the error is related to the ncclGroupCommJoin function.
The nccl version is 2.25.1.

Expected Behavior

The ncclAllGather operation should complete successfully without any errors.

Actual Behavior

Q: What is the TP worker cuda graph capture NCCL error?

A: The TP worker cuda graph capture NCCL error is a segmentation fault error that occurs during the ncclAllGather operation in the TP worker. This error is related to the ncclGroupCommJoin function.

Q: What is the cause of the TP worker cuda graph capture NCCL error?

A: The cause of the TP worker cuda graph capture NCCL error is not yet clear. However, it is believed to be related to the ncclGroupCommJoin function, which is used for tensor model parallelism.

Q: How can I reproduce the TP worker cuda graph capture NCCL error?

A: To reproduce the TP worker cuda graph capture NCCL error, follow these steps:

Run the launch_server script with the following command:

python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7010 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 8080

Run the launch_server script with the following command:

python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7020 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 9080

Observe the error message and stacktrace.

Q: What is the environment required to reproduce the TP worker cuda graph capture NCCL error?

A: The environment required to reproduce the TP worker cuda graph capture NCCL error is:

nccl version: 2.25.1

Q: What is the expected behavior of the TP worker cuda graph capture NCCL operation?

A: The expected behavior of the TP worker cuda capture NCCL operation is that it should complete successfully without any errors.

Q: What is the actual behavior of the TP worker cuda graph capture NCCL operation?

A: The actual behavior of the TP worker cuda graph capture NCCL operation is that it fails with a segmentation fault error.

Q: How can I fix the TP worker cuda graph capture NCCL error?

A: Unfortunately, the TP worker cuda graph capture NCCL error is still under investigation, and a fix is not yet available. However, we recommend checking the nccl version and ensuring that it is up-to-date. Additionally, you can try disabling the ncclGroupCommJoin function or using a different tensor model parallelism method.