`IterableDataset` Drops Samples When Resuming From A Checkpoint

Apr 27, 2025 by ADMIN 64 views

Introduction

When working with distributed datasets, it's essential to handle checkpoints and resuming iterations correctly to avoid data loss or inconsistencies. However, the IterableDataset class in PyTorch's datasets module has a known issue where it drops samples when resuming from a checkpoint under specific conditions. In this article, we'll delve into the root cause of this problem and explore a potential solution.

The Issue

When resuming from a checkpoint, IterableDataset will drop samples if num_shards % world_size == 0 and the underlying example supports iter_arrow and needs to be formatted. This occurs because the child iterable increments the shard_example_idx counter before returning the batch for the whole batch size, leading to a portion of samples being skipped if the iteration is stopped mid-batch.

Minimal Reproducer

To demonstrate this issue, we can use the following minimal reproducer:

from datasets import Dataset
from datasets.distributed import split_dataset_by_node

ds = Dataset.from_dict({"n": list(range(24))})
ds = ds.to_iterable_dataset(num_shards=4)

world_size = 4
rank = 0    
ds_rank = split_dataset_by_node(ds, rank, world_size)

it = iter(ds_rank)
examples = []
for idx, example in enumerate(it):
    examples.append(example)
    if idx == 2:
        state_dict = ds_rank.state_dict()
        break

ds_rank.load_state_dict(state_dict)
it_resumed = iter(ds_rank)
examples_resumed = examples[:]

for example in it:
    examples.append(example)

for example in it_resumed:
    examples_resumed.append(example)

print("ORIGINAL ITER EXAMPLES:", examples)
print("RESUMED ITER EXAMPLES:", examples_resumed)

This code creates a dataset with 24 examples, splits it into 4 shards, and then iterates over the shards. It saves the state dictionary after processing 3 examples and then resumes the iteration from the saved checkpoint. The output will show that some samples are missing in the resumed iteration.

Potential Solution

One way to avoid this issue is to signal the child iterable which samples within the chunk are processed by the parent and which are not. This would allow the child iterable to adjust the shard_example_idx counter accordingly. However, this would also require slicing the chunk when resuming, which is a straightforward implementation.

Conclusion

The IterableDataset class in PyTorch's datasets module has a known issue where it drops samples when resuming from a checkpoint under specific conditions. By understanding the root cause of this problem and exploring potential solutions, we can improve the reliability and consistency of our distributed datasets. In this article, we've provided a minimal reproducer and discussed a potential solution to this issue.

Recommendations

To avoid this issue, we recommend the following:

Use a different dataset class: If possible, consider using a different dataset class that does not have this issue.
Implement chunk slicing: When resuming from a checkpoint, implement chunk slicing to ensure that the correct samples are processed.
Signal the child: Signal the child iterable which samples within the chunk are processed by the parent and which are not.

Q: What is the issue with IterableDataset when resuming from a checkpoint?

A: When resuming from a checkpoint, IterableDataset will drop samples if num_shards % world_size == 0 and the underlying example supports iter_arrow and needs to be formatted. This occurs because the child iterable increments the shard_example_idx counter before returning the batch for the whole batch size, leading to a portion of samples being skipped if the iteration is stopped mid-batch.

Q: What is the root cause of this issue?

A: The root cause of this issue is that the child iterable increments the shard_example_idx counter before returning the batch for the whole batch size. This leads to a portion of samples being skipped if the iteration is stopped mid-batch.

Q: How can I reproduce this issue?

A: You can reproduce this issue by using the minimal reproducer provided in the article:

from datasets import Dataset
from datasets.distributed import split_dataset_by_node

ds = Dataset.from_dict({"n": list(range(24))})
ds = ds.to_iterable_dataset(num_shards=4)

world_size = 4
rank = 0    
ds_rank = split_dataset_by_node(ds, rank, world_size)

it = iter(ds_rank)
examples = []
for idx, example in enumerate(it):
    examples.append(example)
    if idx == 2:
        state_dict = ds_rank.state_dict()
        break

ds_rank.load_state_dict(state_dict)
it_resumed = iter(ds_rank)
examples_resumed = examples[:]

for example in it:
    examples.append(example)

for example in it_resumed:
    examples_resumed.append(example)

print("ORIGINAL ITER EXAMPLES:", examples)
print("RESUMED ITER EXAMPLES:", examples_resumed)

Q: How can I avoid this issue?

A: One way to avoid this issue is to signal the child iterable which samples within the chunk are processed by the parent and which are not. This would allow the child iterable to adjust the shard_example_idx counter accordingly. However, this would also require slicing the chunk when resuming, which is a straightforward implementation.

Q: What are the recommendations for avoiding this issue?

A: To avoid this issue, we recommend the following:

Use a different dataset class: If possible, consider using a different dataset class that does not have this issue.
Implement chunk slicing: When resuming from a checkpoint, implement chunk slicing to ensure that the correct samples are processed.
Signal the child: Signal the child iterable which samples within the chunk are processed by the parent and which are not.

Q: Is this issue specific to IterableDataset?

A: No, this issue is not specific to IterableDataset. It can occur with any dataset class that uses sharding and iteration.

Q: Can I report this issue to the PyTorch team?

A: Yes, you can report this issue to the PyTorch team. They will be able to provide more information and potentially fix the issue in future releases.