[BUG]: Workers Using PCIe Instead Of NVLink On The Same Node

by ADMIN 61 views

[BUG]: workers using PCIe instead of NVLink on the same node

Describe the Bug

Understanding the Issue

When using Nsight System to monitor the transmission bandwidth utilization of a single request, it was observed that the bandwidth utilization of NVLink between two workers was 0. This led to speculation that PCIe might be used for transmission instead, as the bandwidth utilization of PCIe was not 0 at that time. This behavior is unexpected and may have significant implications for the performance and efficiency of the system.

The Importance of NVLink

NVLink is a high-speed interconnect technology that enables fast data transfer between GPUs on the same node. It is designed to provide low-latency and high-bandwidth communication between GPUs, making it an essential component of many high-performance computing applications. In this case, the use of NVLink is critical for achieving optimal performance and efficiency in the system.

Steps to Reproduce

Reproducing the Issue

To reproduce the issue, the following configuration file was used:

Common:
  model: /models/Llama-3.1-8b-Instruct
  block-size: 64
  max-model-len: 3500
  router: kv
  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'

Frontend:
  served_model_name: Llama-3.1-8b-Instruct
  endpoint: dynamo.Processor.chat/completions
  port: 8000

Processor:
  common-configs: [model, block-size, max-model-len, router]

Router:
  min-workers: 1
  common-configs: [model]

VllmWorker:
  # max-num-batched-tokens: 3500
  remote-prefill: true
  conditional-disagg: false
  tensor-parallel-size: 1
  enable-prefix-caching: true
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1
  common-configs: [model, block-size, max-model-len, router, kv-transfer-config]

PrefillWorker:
  max-num-batched-tokens: 3500
  gpu-memory-utilization: 0.98
  tensor-parallel-size: 1
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1
  common-configs: [model, block-size, max-model-len, kv-transfer-config]

Running the Profile Command

The profile command used to capture the issue was:

nsys profile --gpu-metrics-devices all --trace cuda,nvtx -o nsys_nixl_sys_new_70b -f true dynamo serve graphs.disagg_router:Frontend -f configs/disagg_router.yaml

Expected Behavior

Expected Behavior

The expected behavior is that the bandwidth utilization of the NVLink between two workers is not 0. This indicates that NVLink is being used for transmission, as intended.

Actual Behavior

Actual Behavior

However, in addition to the initial observation of 0 bandwidth utilization, it was also observed that NVLink transmission behavior was not seen when using Llama3.1-8b-Instruct for benchmark testing. This suggests that PCIe may be used for transmission instead, which is unexpected and may have significant implications for the performance and efficiency of the system.

Environment


The environment in which the issue was observed is as follows:

  • Installation method: container/build.sh
  • ai-dynamo version: 0.2.0
  • Operating System: Ubuntu 24.04
  • Docker Version: 28.0.4

Additional Context

Additional Context

No additional context is available at this time.

Screenshots

Screenshots

The following screenshots illustrate the issue:

Image

The lower one is the prefill node, the upper one is the decode node. The bandwidth utilization of the nvlink in the middle is 0

Image

However, in the 3.1-8b benchmark test, I unexpectedly saw that the bandwidth utilization of nvlink was not 0.

Conclusion

In conclusion, the issue of workers using PCIe instead of NVLink on the same node is a critical problem that needs to be addressed. The expected behavior is that NVLink is used for transmission, but the actual behavior is that PCIe is used instead. This has significant implications for the performance and efficiency of the system. Further investigation is needed to determine the root cause of this issue and to develop a solution to fix it.

Recommendations

Based on the analysis of the issue, the following recommendations are made:

  1. Investigate the root cause: Further investigation is needed to determine the root cause of this issue. This may involve analyzing the system configuration, the code, and the hardware.
  2. Develop a solution: Once the root cause is identified, a solution needs to be developed to fix the issue. This may involve modifying the system configuration, the code, or the hardware.
  3. Test the solution: The solution needs to be tested to ensure that it fixes the issue and does not introduce any new problems.
  4. Verify the solution: The solution needs to be verified to ensure that it meets the expected behavior and does not have any unintended consequences.

By following these recommendations, it is possible to fix the issue of workers using PCIe instead of NVLink on the same node and to ensure that the system operates as expected.
Q&A: [BUG]: workers using PCIe instead of NVLink on the same node

Q: What is the issue with workers using PCIe instead of NVLink on the same node?

A: The issue is that NVLink is not being used for transmission between workers on the same node, despite being the expected behavior. Instead, PCIe is being used, which may have significant implications for the performance and efficiency of the system.

Q: What is NVLink and why is it important?

A: NVLink is a high-speed interconnect technology that enables fast data transfer between GPUs on the same node. It is designed to provide low-latency and high-bandwidth communication between GPUs, making it an essential component of many high-performance computing applications.

Q: What is PCIe and how does it relate to this issue?

A: PCIe (Peripheral Component Interconnect Express) is a high-speed interface standard for connecting hardware components to a computer's motherboard. In this case, PCIe is being used for transmission between workers on the same node, instead of NVLink.

Q: What are the implications of this issue?

A: The implications of this issue are significant, as it may affect the performance and efficiency of the system. NVLink is designed to provide low-latency and high-bandwidth communication between GPUs, which is critical for many high-performance computing applications.

Q: How can this issue be reproduced?

A: The issue can be reproduced by using the following configuration file:

Common:
  model: /models/Llama-3.1-8b-Instruct
  block-size: 64
  max-model-len: 3500
  router: kv
  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'

Frontend:
  served_model_name: Llama-3.1-8b-Instruct
  endpoint: dynamo.Processor.chat/completions
  port: 8000

Processor:
  common-configs: [model, block-size, max-model-len, router]

Router:
  min-workers: 1
  common-configs: [model]

VllmWorker:
  # max-num-batched-tokens: 3500
  remote-prefill: true
  conditional-disagg: false
  tensor-parallel-size: 1
  enable-prefix-caching: true
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1
  common-configs: [model, block-size, max-model-len, router, kv-transfer-config]

PrefillWorker:
  max-num-batched-tokens: 3500
  gpu-memory-utilization: 0.98
  tensor-parallel-size: 1
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1
  common-configs: [model, block-size, max-model-len, kv-transfer-config]

Q: What is the expected behavior?

A: The expected behavior is that NVLink is used for transmission between workers on the same node.

Q: What is the actual behavior?

A: The actual behavior is that PCIe is used for transmission between workers on the same node, instead of NVLink.

Q: What are the system requirements for this issue?

A: The system requirements for this issue are:

  • Installation method: container/build.sh
  • ai-dynamo version: 0.2.0
  • Operating System: Ubuntu 24.04
  • Docker Version: 28.0.4

Q: What are the next steps to resolve this issue?

A: The next steps to resolve this issue are:

  1. Investigate the root cause: Further investigation is needed to determine the root cause of this issue.
  2. Develop a solution: Once the root cause is identified, a solution needs to be developed to fix the issue.
  3. Test the solution: The solution needs to be tested to ensure that it fixes the issue and does not introduce any new problems.
  4. Verify the solution: The solution needs to be verified to ensure that it meets the expected behavior and does not have any unintended consequences.

By following these steps, it is possible to resolve the issue of workers using PCIe instead of NVLink on the same node and to ensure that the system operates as expected.