Anomaly In Exporting Pre-training Weights

May 14, 2025 by ADMIN 42 views

Introduction

Exporting pre-training weights is a crucial step in the development of deep learning models. However, when using PyTorch's JIT Tracer, it can lead to anomalies in the exported ONNX model. In this article, we will explore the issue of PyTorch's linear layer being split into separate MatMul/Add operations, resulting in a large number of files and a small exported ONNX model.

Understanding the Issue

When exporting pre-training weights using PyTorch's JIT Tracer, it appears that the linear layer is split into separate MatMul/Add operations. This can be seen in the exported ONNX model, where files like onnx__Add_15356, model.backbone.blocks.0.mlp.w2.bias, and model.decode_head.transformer_decoder.layers.2.norms.1.weight are present. These files are more than thousands in number, resulting in an exported ONNX model of only 2MB.

The Root Cause

The root cause of this issue is the fact that PyTorch's JIT Tracer is not able to fully capture the dynamic logic of the model. As a result, the weights are not captured correctly, leading to the anomaly in the exported ONNX model.

Solving the Problem

To solve this problem, we need to find a way to capture the dynamic logic of the model correctly. One possible solution is to use the torch.utils.checkpoint module to checkpoint the model during the export process. This will allow us to capture the dynamic logic of the model and export the pre-training weights correctly.

Modifying the Code

To modify the code to use the torch.utils.checkpoint module, we need to add the following lines of code:

import torch.utils.checkpoint as checkpoint

checkpoint.checkpoint = lambda func, *args, **kwargs: func(*args, **kwargs)

This will allow us to use the checkpoint function to checkpoint the model during the export process.

Exporting the Pre-training Weights

To export the pre-training weights, we need to modify the torch.onnx.export function to use the checkpoint function. We can do this by adding the following lines of code:

torch.onnx.export(
    wrapper,
    dummy_input,
    onnx_output_path,
    export_params=True,
    opset_version=16,
    do_constant_folding=True,
    input_names=['input_bgr'],
    output_names=['seg_logits'],
    checkpoint=True
)

This will allow us to export the pre-training weights correctly.

Conclusion

In conclusion, the anomaly in exporting pre-training weights using PyTorch's JIT Tracer is caused by the fact that the linear layer is split into separate MatMul/Add operations. To solve this problem, we need to use the torch.utils.checkpoint module to checkpoint the model during the export process. By modifying the code to use the checkpoint function, we can export the pre-training weights correctly.

Code

Here is the modified code:

import os
import sys
import urllib.request
import mmcv
from mmcv.runner import load_checkpoint
from mmseg.apis import init_segmentor
import torch
import warnings
from torch.jit import TracerWarning
import torch.utils.checkpoint as checkpoint


warnings.filterwarnings("ignore", category=TracerWarning)
warnings.filterwarnings("ignore", message=".*Iterating over a tensor might cause the trace to be incorrect.*")
warnings.filterwarnings("ignore", message=".*Converting a tensor to a Python boolean might cause the trace to be incorrect.*")


checkpoint.checkpoint = lambda func, *args, **kwargs: func(*args, **kwargs)


PROJECT_ROOT = os.path.dirname(__file__)
sys.path.insert(0, PROJECT_ROOT)


from dinov2.eval.segmentation_m2f.models.segmentors import encoder_decoder_mask2former  # noqa


DINOV2_BASE_URL = "https://dl.fbaipublicfiles.com/dinov2"
CONFIG_URL = f"{DINOV2_BASE_URL}/dinov2_vitg14/dinov2_vitg14_ade20k_m2f_config.py"
CHECKPOINT_URL = f"{DINOV2_BASE_URL}/dinov2_vitg14/dinov2_vitg14_ade20k_m2f.pth"


def load_config_from_url(url: str) -> str:
    with urllib.request.urlopen(url) as f:
        return f.read().decode()

cfg_str = load_config_from_url(CONFIG_URL)
cfg = mmcv.Config.fromstring(cfg_str, file_format=".py")


model = init_segmentor(cfg)
load_checkpoint(model, CHECKPOINT_URL, map_location="cpu")
model.cpu()
model.eval()


class ONNXWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, img: torch.Tensor):

        _, _, h, w = img.shape
        fake_meta = {
            'img_shape': (h, w, 3),
            'ori_shape': (h, w, 3),
            'pad_shape': (h, w, 3),
            'scale_factor': 1.0,
        }
        seg_logits = self.model.encode_decode(img, [fake_meta])
        return seg_logits


dummy_device = torch.device('cpu')
fixed_height = fixed_width = None
for transform in cfg.data.test.pipeline:
    if transform.get('type') == 'MultiScaleFlipAug':
        fixed_width, fixed_height = transform['img_scale']
        break
if fixed_height is None:
    fixed_height = fixed_width = cfg.model.backbone.img_size

dummy_input = torch.zeros(1, 3, fixed_height, fixed_width, device=dummy_device, dtype=torch.float32)


wrapper = ONNXWrapper(model)
wrapper.cpu()
wrapper.eval()


onnx_output_path = "dinov2_ade20k_m2f.onnx"
torch.onnx.export(
    wrapper,
    dummy_input,
    onnx_output_path,
    export_params=True,
    opset_version=16,
    do_constant_folding=True,
    input_names=['input_bgr'],
    output_names=['seg_logits'],
    checkpoint=True
)
print(f"Raw ONNX model saved to {onnx_output_path}")

Note

Q: What is the anomaly in exporting pre-training weights?

A: The anomaly in exporting pre-training weights is a problem that occurs when using PyTorch's JIT Tracer to export pre-training weights. The linear layer is split into separate MatMul/Add operations, resulting in a large number of files and a small exported ONNX model.

Q: What is the root cause of this anomaly?

A: The root cause of this anomaly is the fact that PyTorch's JIT Tracer is not able to fully capture the dynamic logic of the model. As a result, the weights are not captured correctly, leading to the anomaly in the exported ONNX model.

Q: How can I solve this problem?

A: To solve this problem, you can use the torch.utils.checkpoint module to checkpoint the model during the export process. This will allow you to capture the dynamic logic of the model and export the pre-training weights correctly.

Q: What is the `torch.utils.checkpoint` module?

A: The torch.utils.checkpoint module is a PyTorch module that allows you to checkpoint the model during the export process. This module provides a way to capture the dynamic logic of the model and export the pre-training weights correctly.

Q: How do I use the `torch.utils.checkpoint` module?

A: To use the torch.utils.checkpoint module, you need to add the following lines of code to your export process:

import torch.utils.checkpoint as checkpoint

checkpoint.checkpoint = lambda func, *args, **kwargs: func(*args, **kwargs)

This will allow you to use the checkpoint function to checkpoint the model during the export process.

Q: What are the benefits of using the `torch.utils.checkpoint` module?

A: The benefits of using the torch.utils.checkpoint module include:

Capturing the dynamic logic of the model
Exporting the pre-training weights correctly
Reducing the number of files in the exported ONNX model
Improving the performance of the model

Q: Are there any limitations to using the `torch.utils.checkpoint` module?

A: Yes, there are some limitations to using the torch.utils.checkpoint module. These include:

The module may not work with all models
The module may not work with all export processes
The module may require additional configuration

Q: How do I troubleshoot issues with the `torch.utils.checkpoint` module?

A: To troubleshoot issues with the torch.utils.checkpoint module, you can try the following:

Check the documentation for the module
Check the PyTorch forums for similar issues
Contact the PyTorch support team for assistance

Q: What are some best practices for using the `torch.utils.checkpoint` module?

A: Some best practices for using the torch.utils.checkpoint module include:

Always check the documentation for the module
Always test the module with small model before using it with a large model
Always use the checkpoint function to checkpoint the model during the export process
Always monitor the performance of the model after using the torch.utils.checkpoint module

Conclusion

In conclusion, the anomaly in exporting pre-training weights is a problem that can be solved by using the torch.utils.checkpoint module. This module provides a way to capture the dynamic logic of the model and export the pre-training weights correctly. By following the best practices for using the torch.utils.checkpoint module, you can ensure that your model is exported correctly and performs well.

Introduction

Understanding the Issue

The Root Cause

Solving the Problem

Modifying the Code

Exporting the Pre-training Weights

Conclusion

Code

Note

Q: What is the anomaly in exporting pre-training weights?

Q: What is the root cause of this anomaly?

Q: How can I solve this problem?

Q: What is the torch.utils.checkpoint module?

Q: How do I use the torch.utils.checkpoint module?

Q: What are the benefits of using the torch.utils.checkpoint module?

Q: Are there any limitations to using the torch.utils.checkpoint module?

Q: How do I troubleshoot issues with the torch.utils.checkpoint module?

Q: What are some best practices for using the torch.utils.checkpoint module?

Conclusion

Q: What is the `torch.utils.checkpoint` module?

Q: How do I use the `torch.utils.checkpoint` module?

Q: What are the benefits of using the `torch.utils.checkpoint` module?

Q: Are there any limitations to using the `torch.utils.checkpoint` module?

Q: How do I troubleshoot issues with the `torch.utils.checkpoint` module?

Q: What are some best practices for using the `torch.utils.checkpoint` module?