[BUG] V2 Instance Managers Keep Crashing On Master-head

Apr 22, 2025 by ADMIN 56 views

Describe the Bug

v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations.

Symptoms

v2 instance managers are crashing frequently, causing Longhorn operations to fail.
The crashes are causing the instance managers to restart, leading to a cycle of crashes and restarts.
The crashes are affecting the overall stability of the Longhorn cluster.

Reproduction Steps

Deploy Longhorn master-head
Enable v2 data engine setting
Observe v2 instance managers keep crashing

Expected Behavior

v2 instance managers stably running.

Support Bundle for Troubleshooting

supportbundle_17d2fdc3-bc50-4438-9eb3-7a139469efb0_2025-04-22T03-14-17Z.zip

Environment

Longhorn version: master-head
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.32.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: sles 15-sp6, ubuntu 24.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

The failures of backup related test cases on master-head arm64 might be related to this.

test_backing_image.py::test_backup_with_backing_image[s3-335544320] FAILED

/src/longhorn-tests # kubectl describe backuptarget -n longhorn-system default
Name:         default
Namespace:    longhorn-system
Labels:       <none>
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         BackupTarget
Metadata:
  Creation Timestamp:  2025-04-22T02:36:05Z
  Finalizers:
    longhorn.io
  Generation:        68
  Resource Version:  10107
  UID:               148219ed-fd5d-4a91-b0d6-62a0dd48f39f
Spec:
  Backup Target URL:  s3://backupbucket@us-east-1/backupstore
  Credential Secret:  minio-secret
  Poll Interval:      30s
  Sync Requested At:  2025-04-22T03:10:16Z
Status:
  Available:  false
  Conditions:
    Last Probe Time:       
    Last Transition Time:  202504-22T02:37:25Z
    Message:               failed to init backup target clients: failed to get running instance manager for proxy client: failed to find a running instance manager for node ip-10-0-2-137
    Reason:                Unavailable
    Status:                True
    Type:                  Unavailable
  Last Synced At:          <nil>
  Owner ID:                ip-10-0-2-137
Events:                    <none>

Workaround and Mitigation

To mitigate the issue, you can try the following:

Disable v2 data engine setting: Disable the v2 data engine setting to see if the issue persists.
Upgrade Longhorn version: Upgrade to the latest Longhorn version to see if the issue is fixed.
Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.

Debugging Steps

Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.

Troubleshooting Steps

Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.

Conclusion

v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations. The issue is likely caused by a combination of factors, including node configuration, underlying infrastructure, and Longhorn version. To mitigate the issue, you can try disabling v2 data engine setting, upgrading Longhorn version, checking instance manager logs, checking node configuration, and checking underlying infrastructure.

Q: What is the issue with v2 instance managers on master-head?

A: v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations.

Q: What are the symptoms of this issue?

A: The symptoms of this issue include v2 instance managers crashing frequently, causing Longhorn operations to fail, and the crashes are causing the instance managers to restart, leading to a cycle of crashes and restarts.

Q: How can I reproduce this issue?

A: To reproduce this issue, you can follow these steps:

Deploy Longhorn master-head
Enable v2 data engine setting
Observe v2 instance managers keep crashing

Q: What is the expected behavior?

A: The expected behavior is for v2 instance managers to stably run without crashing.

Q: What is the support bundle for troubleshooting?

A: The support bundle for troubleshooting is available at supportbundle_17d2fdc3-bc50-4438-9eb3-7a139469efb0_2025-04-22T03-14-17Z.zip

Q: What is the environment in which this issue occurs?

A: The environment in which this issue occurs includes:

Longhorn version: master-head
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.32.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: sles 15-sp6, ubuntu 24.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Q: What is the additional context for this issue?

A: The additional context for this issue includes the failures of backup related test cases on master-head arm64 might be related to this.

Q: What is the workaround and mitigation for this issue?

A: To mitigate the issue, you can try the following:

Disable v2 data engine setting: Disable the v2 data engine setting to see if the issue persists.
Upgrade Longhorn version: Upgrade to the latest Longhorn version to see if the issue is fixed.
Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.

Q: What are the debugging steps for this issue?

A: The debugging steps for this issue include:

Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.

Q: What are the troubleshooting steps for this issue?

A: The troubleshooting steps for this issue include:

Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.

Q: What is the conclusion for this issue?

A: v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations. The issue is likely caused by a combination of factors, including node configuration, underlying infrastructure, and Longhorn version. To mitigate the issue, you can try disabling v2 data engine setting, upgrading Longhorn version, checking instance manager logs, checking node configuration, and checking underlying infrastructure.