[BUG] V2 Instance Managers Keep Crashing On Master-head
Describe the Bug
v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations.
Symptoms
- v2 instance managers are crashing frequently, causing Longhorn operations to fail.
- The crashes are causing the instance managers to restart, leading to a cycle of crashes and restarts.
- The crashes are affecting the overall stability of the Longhorn cluster.
Reproduction Steps
- Deploy Longhorn
master-head
- Enable v2 data engine setting
- Observe v2 instance managers keep crashing
Expected Behavior
v2 instance managers stably running.
Support Bundle for Troubleshooting
supportbundle_17d2fdc3-bc50-4438-9eb3-7a139469efb0_2025-04-22T03-14-17Z.zip
Environment
- Longhorn version: master-head
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.32.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version: sles 15-sp6, ubuntu 24.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
The failures of backup related test cases on master-head
arm64
might be related to this.
test_backing_image.py::test_backup_with_backing_image[s3-335544320] FAILED
/src/longhorn-tests # kubectl describe backuptarget -n longhorn-system default
Name: default
Namespace: longhorn-system
Labels: <none>
Annotations: <none>
API Version: longhorn.io/v1beta2
Kind: BackupTarget
Metadata:
Creation Timestamp: 2025-04-22T02:36:05Z
Finalizers:
longhorn.io
Generation: 68
Resource Version: 10107
UID: 148219ed-fd5d-4a91-b0d6-62a0dd48f39f
Spec:
Backup Target URL: s3://backupbucket@us-east-1/backupstore
Credential Secret: minio-secret
Poll Interval: 30s
Sync Requested At: 2025-04-22T03:10:16Z
Status:
Available: false
Conditions:
Last Probe Time:
Last Transition Time: 202504-22T02:37:25Z
Message: failed to init backup target clients: failed to get running instance manager for proxy client: failed to find a running instance manager for node ip-10-0-2-137
Reason: Unavailable
Status: True
Type: Unavailable
Last Synced At: <nil>
Owner ID: ip-10-0-2-137
Events: <none>
Workaround and Mitigation
To mitigate the issue, you can try the following:
- Disable v2 data engine setting: Disable the v2 data engine setting to see if the issue persists.
- Upgrade Longhorn version: Upgrade to the latest Longhorn version to see if the issue is fixed.
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Debugging Steps
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
- Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
- Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.
Troubleshooting Steps
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
- Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
- Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.
Conclusion
v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations. The issue is likely caused by a combination of factors, including node configuration, underlying infrastructure, and Longhorn version. To mitigate the issue, you can try disabling v2 data engine setting, upgrading Longhorn version, checking instance manager logs, checking node configuration, and checking underlying infrastructure.
Q: What is the issue with v2 instance managers on master-head?
A: v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations.
Q: What are the symptoms of this issue?
A: The symptoms of this issue include v2 instance managers crashing frequently, causing Longhorn operations to fail, and the crashes are causing the instance managers to restart, leading to a cycle of crashes and restarts.
Q: How can I reproduce this issue?
A: To reproduce this issue, you can follow these steps:
- Deploy Longhorn
master-head
- Enable v2 data engine setting
- Observe v2 instance managers keep crashing
Q: What is the expected behavior?
A: The expected behavior is for v2 instance managers to stably run without crashing.
Q: What is the support bundle for troubleshooting?
A: The support bundle for troubleshooting is available at supportbundle_17d2fdc3-bc50-4438-9eb3-7a139469efb0_2025-04-22T03-14-17Z.zip
Q: What is the environment in which this issue occurs?
A: The environment in which this issue occurs includes:
- Longhorn version: master-head
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.32.2+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version: sles 15-sp6, ubuntu 24.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Q: What is the additional context for this issue?
A: The additional context for this issue includes the failures of backup related test cases on master-head
arm64
might be related to this.
Q: What is the workaround and mitigation for this issue?
A: To mitigate the issue, you can try the following:
- Disable v2 data engine setting: Disable the v2 data engine setting to see if the issue persists.
- Upgrade Longhorn version: Upgrade to the latest Longhorn version to see if the issue is fixed.
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
Q: What are the debugging steps for this issue?
A: The debugging steps for this issue include:
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
- Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
- Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.
Q: What are the troubleshooting steps for this issue?
A: The troubleshooting steps for this issue include:
- Check instance manager logs: Check the instance manager logs for any errors or warnings that may indicate the cause of the crash.
- Check node configuration: Check the node configuration to ensure that it meets the minimum requirements for running Longhorn.
- Check underlying infrastructure: Check the underlying infrastructure to ensure that it is stable and not causing any issues.
- Run Longhorn tests: Run Longhorn tests to see if the issue is reproducible.
- Check Longhorn version: Check the Longhorn version to ensure that it is up-to-date.
Q: What is the conclusion for this issue?
A: v2 instance managers keep crashing on master-head, resulting in unstable Longhorn operations. The issue is likely caused by a combination of factors, including node configuration, underlying infrastructure, and Longhorn version. To mitigate the issue, you can try disabling v2 data engine setting, upgrading Longhorn version, checking instance manager logs, checking node configuration, and checking underlying infrastructure.