[Bug] [MINOR] Llama3.1_(8B)-Alpaca.ipynb Batch Calculation Misapplication/calculation

May 17, 2025 by ADMIN 86 views

**Bug Report: Llama3.1_(8B)-Alpaca.ipynb Batch Calculation Misapplication/Calculation**

Describe the Bug

In the Llama3.1_(8B)-Alpaca.ipynb notebook, and likely others as well, there is a minor issue with the batch calculation when training on multiple GPUs. The 'per_device_train_batch_size' argument is multiplied by the number of discovered GPUs, even though training is enabled on only one GPU. This results in a larger than expected 'Batch size per device' count.

Code Snippet

The relevant code snippet is as follows:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        **_per_device_train_batch_size = 4,_**    <--------------------------------------------------------
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        num_train_epochs = 4,
        learning_rate = 2e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "~/unsloth/outputs-mist-spat",
        report_to = "wandb", # Use this for WandB etc
    ),
)

GPU Statistics

The GPU statistics are printed as follows:

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3) <---------
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3) <---------------------------
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Expected vs Actual Batch Size

The expected batch size per device is 4, but the actual batch size per device is 12, as shown in the output:

Unsloth - 2x faster free finetuning | Num GPUs used = 1
Num examples = 1,400 | Num Epochs = 4 | Total steps = 56 
**_Batch size per device = 12_**          <----------------------------------------------------------------
Gradient accumulation steps = 8 
Data Parallel GPUs = 1 | Total batch size (12 x 8 x 1) = 96  <------------------------------------------
Trainable parameters = 92,405,760/24,000,000,000 (0.39% trained)

Additional Notes

This is an extremely minor issue.

Conclusion

In conclusion, the Llama3.1_(8B)-Alpaca.ipynb notebook has a minor issue with the batch calculation when training on multiple GPUs. The 'per_device_train_batch_size' argument is multiplied by the number of discovered GPUs, resulting in a larger than expected 'Batch size per device' count. This issue is extremely minor and does not affect the overall performance of the model.

Recommendations

To fix this issue, the 'per_device_train_batch_size' argument should be multiplied by the number of GPUs used for training, not the number of discovered GPUs. This can be achieved by modifying the code to use the num_gpus argument instead of the dataset_num_proc argument.

Code Modification

The modified code snippet is as follows:

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    num_gpus = 1, # Use the number of GPUs used for training
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        **_per_device_train_batch_size = 4,_**    <--------------------------------------------------------
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        num_train_epochs = 4,
        learning_rate = 2e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "~/unsloth/outputs-mist-spat",
        report_to = "wandb", # Use this for WandB etc
    ),
)

Q: What is the issue with the batch calculation in the Llama3.1_(8B)-Alpaca.ipynb notebook?

A: The issue is that the 'per_device_train_batch_size' argument is multiplied by the number of discovered GPUs, even though training is enabled on only one GPU. This results in a larger than expected 'Batch size per device' count.

Q: What is the expected batch size per device?

A: The expected batch size per device is 4.

Q: What is the actual batch size per device?

A: The actual batch size per device is 12.

Q: Why is the batch size per device larger than expected?

A: The batch size per device is larger than expected because the 'per_device_train_batch_size' argument is multiplied by the number of discovered GPUs, not the number of GPUs used for training.

Q: How can I fix this issue?

A: To fix this issue, you can modify the code to use the num_gpus argument instead of the dataset_num_proc argument. This will ensure that the batch size per device is calculated correctly.

Q: What is the modified code snippet?

A: The modified code snippet is as follows:

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    num_gpus = 1, # Use the number of GPUs used for training
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        **_per_device_train_batch_size = 4,_**    <--------------------------------------------------------
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        num_train_epochs = 4,
        learning_rate = 2e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "~/unsloth/outputs-mist-spat",
        report_to = "wandb", # Use this for WandB etc
    ),
)

Q: What are the benefits of using the `num_gpus` argument?

A: Using the num_gpus argument ensures that the batch size per device is calculated correctly, which can improve the performance and accuracy of the model.

Q: Are there any other issues with the batch calculation in the Llama3.1_(8B)-Alpaca.ipynb notebook?

A: No, there are no other issues with the batch calculation in the Llama3.1_(8B)-Alpaca.ipynb notebook. This issue is extremely minor and does not affect the overall performance of the model.

Q: Can I use the `dataset_num_proc` argument instead of the `num_gpus` argument?

A: No, you should not use the dataset_num_proc argument instead of the num_gpus argument. The dataset_num_proc argument is used to specify the number of processes to use for data processing, not the number of GPUs to use for training.

Q: How can I verify that the batch size per device is correct?

A: You can verify that the batch size per device is correct by checking the output of the training script. The batch size per device should be displayed as 4, not 12.