[NEW] Excessive Connection Attempts To Failed Nodes In ClusterCron() Cause CPU Overhead

May 22, 2025 by ADMIN 88 views

The Problem: Excessive Connection Attempts to Failed Nodes

In the clusterCron() function, the clusterNodeCronHandleReconnect() function attempts to reconnect to nodes whose link == NULL on every iteration (10 times per second by default). This leads to repeated connection attempts to nodes that are unreachable or in a failed (PFAIL or FAIL) state. When there are multiple failing nodes, this retry logic results in high CPU usage due to frequent connConnect() calls and excessive memory allocation/free churn from repeatedly creating and destroying clusterLink objects.

The Impact: High CPU Usage and Memory Allocation/Free Churn

The repeated connection attempts to failed nodes result in high CPU usage and excessive memory allocation/free churn. This can lead to performance issues and even crashes in the system. The high CPU usage is due to the frequent connConnect() calls, while the excessive memory allocation/free churn is due to the repeated creation and destruction of clusterLink objects.

Profiles of the Connected Nodes

The following image shows the profiles of the connected nodes:

Engine CPU/Used Memory Metrics

The following image shows the Engine CPU/Used Memory metrics:

As can be seen from the image, the initial spike is during the time when large numbers of primary nodes (n/2 - 1) were killed, and subsequent increase in the memory seems to be due to frequent reconnect attempts based on the above profile.

The Solution: Backoff Mechanism to Avoid Reconnects with Failed Nodes

To avoid the excessive connection attempts to failed nodes, a backoff mechanism can be implemented. This mechanism will prevent the system from repeatedly attempting to reconnect to nodes that are unreachable or in a failed state.

Description of the Feature

The backoff mechanism will work as follows:

When a node is detected to be in a failed state, the system will wait for a certain period of time before attempting to reconnect to the node.
The waiting period will be increased exponentially with each failed attempt, up to a maximum limit.
If the node is still in a failed state after the maximum waiting period, the system will give up attempting to reconnect to the node.

This backoff mechanism will prevent the system from repeatedly attempting to reconnect to nodes that are unreachable or in a failed state, thus avoiding the high CPU usage and excessive memory allocation/free churn.

Benefits of the Feature

The backoff mechanism will provide several benefits, including:

Reduced CPU usage: By preventing repeated connection attempts to failed nodes, the system will experience reduced CPU usage.
Reduced memory allocation/free churn: By preventing repeated creation and destruction of clusterLink objects, the system will experience reduced memory allocation/free churn.
Improved system stability: By preventing repeated connection attempts to failed nodes, the system will experience improved stability.

Implementation Details

The backoff mechanism will be implemented as follows:

A new function, clusterNodeCronHandleBackoff(), will be added to the clusterCron() function.
This function will take into account the failed state of the node and the number of failed attempts before deciding whether to wait or reconnect.
The waiting period will be increased exponentially with each failed attempt, up to a maximum limit.
If the node is still in a failed state after the maximum waiting period, the system will give up attempting to reconnect to the node.

Conclusion

Q: What is the problem with excessive connection attempts to failed nodes in clusterCron()?

A: The problem is that the clusterNodeCronHandleReconnect() function attempts to reconnect to nodes whose link == NULL on every iteration (10 times per second by default). This leads to repeated connection attempts to nodes that are unreachable or in a failed (PFAIL or FAIL) state. When there are multiple failing nodes, this retry logic results in high CPU usage due to frequent connConnect() calls and excessive memory allocation/free churn from repeatedly creating and destroying clusterLink objects.

Q: What are the consequences of excessive connection attempts to failed nodes?

A: The repeated connection attempts to failed nodes result in high CPU usage and excessive memory allocation/free churn. This can lead to performance issues and even crashes in the system.

Q: What is the impact on Engine CPU/Used Memory metrics?

A: The high CPU usage is due to the frequent connConnect() calls, while the excessive memory allocation/free churn is due to the repeated creation and destruction of clusterLink objects. The Engine CPU/Used Memory metrics show an initial spike during the time when large numbers of primary nodes (n/2 - 1) were killed, and subsequent increase in the memory seems to be due to frequent reconnect attempts based on the above profile.

Q: How does the backoff mechanism solve the problem?

A: The backoff mechanism will prevent the system from repeatedly attempting to reconnect to nodes that are unreachable or in a failed state. When a node is detected to be in a failed state, the system will wait for a certain period of time before attempting to reconnect to the node. The waiting period will be increased exponentially with each failed attempt, up to a maximum limit. If the node is still in a failed state after the maximum waiting period, the system will give up attempting to reconnect to the node.

Q: What are the benefits of the backoff mechanism?

A: The backoff mechanism will provide several benefits, including:

Reduced CPU usage: By preventing repeated connection attempts to failed nodes, the system will experience reduced CPU usage.
Reduced memory allocation/free churn: By preventing repeated creation and destruction of clusterLink objects, the system will experience reduced memory allocation/free churn.
Improved system stability: By preventing repeated connection attempts to failed nodes, the system will experience improved stability.

Q: How is the backoff mechanism implemented?

A: The backoff mechanism will be implemented as follows:

A new function, clusterNodeCronHandleBackoff(), will be added to the clusterCron() function.
This function will take into account the failed state of the node and the number of failed attempts before deciding whether to wait or reconnect.
The waiting period will be increased exponentially with each failed attempt, up to a maximum limit.
If the node is still in a failed state after the maximum waiting period, the system will give up attempting to reconnect to the node.

: What are the key takeaways from this article?

A: The key takeaways from this article are:

Excessive connection attempts to failed nodes in clusterCron() can cause CPU overhead and performance issues.
The backoff mechanism can solve this problem by preventing repeated connection attempts to failed nodes.
The backoff mechanism will provide several benefits, including reduced CPU usage, reduced memory allocation/free churn, and improved system stability.

Conclusion

In conclusion, the backoff mechanism is a solution to the problem of excessive connection attempts to failed nodes in clusterCron(). By preventing repeated connection attempts to failed nodes, the system will experience reduced CPU usage, reduced memory allocation/free churn, and improved system stability. The implementation details of the backoff mechanism are outlined above, and the benefits of the feature are discussed in the previous section.