[Bug]: DataParallel On Multinode Unable To Start GPU

by ADMIN 53 views

Your current environment

System Information

We are running on an Ubuntu 24.04.2 LTS system with a 64-bit architecture. The system has 208 CPU cores and 2 NUMA nodes. The system is equipped with 8 NVIDIA H800 GPUs.

PyTorch and CUDA Information

We are using PyTorch version 2.6.0+cu124, which is built with CUDA 12.4. The CUDA runtime version is not available.

Python and GCC Information

We are using Python 3.12.9, which is packaged by Anaconda, Inc. The GCC version is 11.2.0.

Relevant Libraries

We are using the following relevant libraries:

  • numpy==2.2.5
  • nvidia-cublas-cu12==12.4.5.8
  • nvidia-cuda-cupti-cu12==12.4.127
  • nvidia-cuda-nvrtc-cu12==12.4.127
  • nvidia-cuda-runtime-cu12==12.4.127
  • nvidia-cudnn-cu12==9.1.0.70
  • nvidia-cufft-cu12==11.2.1.3
  • nvidia-curand-cu12==10.3.5.147
  • nvidia-cusolver-cu12==11.6.1.9
  • nvidia-cusparse-cu12==12.3.1.170
  • nvidia-cusparselt-cu12==0.6.2
  • nvidia-nccl-cu12==2.21.5
  • nvidia-nvjitlink-cu12==12.4.127
  • nvidia-nvtx-cu12==12.4.127
  • pyzmq==26.4.0
  • torch==2.6.0
  • torchaudio==2.6.0
  • torchvision==0.21.0
  • transformers==4.51.3
  • triton==3.2.0

Describe the bug

We are using vllm==0.8.4 and a ray cluster to serve the model. The ray cluster has two nodes. When we use --data-parallel-size 8, everything works fine. However, when we switch to --data-parallel-size 16, we get the following error:

RuntimeError: No CUDA GPUs are available

We also tried --data-parallel-size 8 --pipeline-parallel-size 2 but failed. However, the setting can work for --tensor-parallel-size 8 --pipeline-parallel-size 2.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Troubleshooting Steps

  1. Check the CUDA environment: Make sure that the CUDA environment is properly set up on the system. You can check the CUDA environment by running nvcc --version in the terminal2. Check the PyTorch version: Make sure that the PyTorch version is compatible with the CUDA version. You can check the PyTorch version by running python -c "import torch; print(torch.__version__)" in the terminal.
  2. Check the GPU availability: Make sure that the GPUs are available and properly configured. You can check the GPU availability by running nvidia-smi in the terminal.
  3. Check the ray cluster configuration: Make sure that the ray cluster is properly configured and that the GPUs are properly allocated. You can check the ray cluster configuration by running ray status in the terminal.
  4. Check the vllm configuration: Make sure that the vllm configuration is properly set up and that the data-parallel-size is correctly configured. You can check the vllm configuration by running vllm config in the terminal.

Possible Solutions

  1. Update the PyTorch version: Update the PyTorch version to the latest version that is compatible with the CUDA version.
  2. Update the CUDA environment: Update the CUDA environment to the latest version that is compatible with the PyTorch version.
  3. Check the GPU availability: Check the GPU availability and make sure that the GPUs are properly configured.
  4. Check the ray cluster configuration: Check the ray cluster configuration and make sure that the GPUs are properly allocated.
  5. Check the vllm configuration: Check the vllm configuration and make sure that the data-parallel-size is correctly configured.

Conclusion

Q: What is the issue with DataParallel on multinode?

A: The issue with DataParallel on multinode is that it is unable to start the GPU. This is caused by the incompatibility between the PyTorch version and the CUDA version.

Q: What are the possible causes of this issue?

A: The possible causes of this issue are:

  1. Incompatible PyTorch version: The PyTorch version is not compatible with the CUDA version.
  2. Incorrect CUDA environment: The CUDA environment is not properly set up.
  3. GPU availability: The GPUs are not available or are not properly configured.
  4. Ray cluster configuration: The ray cluster is not properly configured or the GPUs are not properly allocated.
  5. VLLM configuration: The vllm configuration is not properly set up or the data-parallel-size is not correctly configured.

Q: How can I troubleshoot this issue?

A: To troubleshoot this issue, you can follow these steps:

  1. Check the CUDA environment: Make sure that the CUDA environment is properly set up.
  2. Check the PyTorch version: Make sure that the PyTorch version is compatible with the CUDA version.
  3. Check the GPU availability: Make sure that the GPUs are available and properly configured.
  4. Check the ray cluster configuration: Make sure that the ray cluster is properly configured and the GPUs are properly allocated.
  5. Check the vllm configuration: Make sure that the vllm configuration is properly set up and the data-parallel-size is correctly configured.

Q: What are the possible solutions to this issue?

A: The possible solutions to this issue are:

  1. Update the PyTorch version: Update the PyTorch version to the latest version that is compatible with the CUDA version.
  2. Update the CUDA environment: Update the CUDA environment to the latest version that is compatible with the PyTorch version.
  3. Check the GPU availability: Check the GPU availability and make sure that the GPUs are properly configured.
  4. Check the ray cluster configuration: Check the ray cluster configuration and make sure that the GPUs are properly allocated.
  5. Check the vllm configuration: Check the vllm configuration and make sure that the data-parallel-size is correctly configured.

Q: How can I prevent this issue in the future?

A: To prevent this issue in the future, you can:

  1. Regularly update the PyTorch version: Regularly update the PyTorch version to the latest version that is compatible with the CUDA version.
  2. Regularly update the CUDA environment: Regularly update the CUDA environment to the latest version that is compatible with the PyTorch version.
  3. Regularly check the GPU availability: Regularly check the GPU availability and make sure that the GPUs are properly configured.
  4. Regularly check the ray cluster configuration: Regularly check the ray cluster configuration and make sure that the GPUs are properly allocated.
  5. Regularly check the vllm configuration: Regularly check the vllm configuration and make sure that the data-parallel-size is correctly configured.

Q: Where can I find more information about this issue?

A: You can find more information about this issue in the following resources:

  1. PyTorch documentation: The PyTorch documentation provides information about the PyTorch version and the CUDA version.
  2. CUDA documentation: The CUDA documentation provides information about the CUDA environment and the GPU availability.
  3. Ray documentation: The Ray documentation provides information about the ray cluster configuration and the GPU allocation.
  4. VLLM documentation: The VLLM documentation provides information about the vllm configuration and the data-parallel-size.