Coredump Right After Seed Node Decommission

Apr 25, 2025 by ADMIN 44 views

Issue Description

During the disrupt_nodetool_seed_decommission nemesis, right after nodetool decommission has finished, a coredump occurred on another node. The decommissioned node's log shows that it has completed the decommissioning process, while the coredumped node's log shows a series of errors and crashes.

Impact

The impact of this issue is that a node coredumped, which can lead to data loss and system instability.

How Frequently Does it Reproduce?

Argus claims that 10 similar coredumps have occurred, but it is unclear if they are the same issue. The referenced issue https://github.com/scylladb/scylladb/issues/23577 may be related to this issue.

Installation Details

The cluster size is 6 nodes (i4i.4xlarge). The Scylla nodes used in this run are:

longevity-100gb-4h-1876-cql-db-node-0cd19b7e-8 (52.209.197.170 | 10.4.10.106) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-7 (52.50.208.233 | 10.4.8.126) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-6 (18.201.95.7 | 10.4.11.4) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-5 (54.216.181.73 | 10.4.8.197) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-4 (54.74.80.41 | 10.4.11.238) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3 (3.255.179.242 | 10.4.9.81) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-2 (34.244.66.138 | 10.4.9.26) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1 (54.154.33.115 | 10.4.10.126) (shards: 14)

The OS/image used is ami-0c09cf0a4d621b403 (aws: undefined_region).

Logs and Commands

The logs and commands used to reproduce this issue are:

Restore Monitor Stack command: $ hydra investigate show-monitor 0cd19b7e-fb77-4661-89b6-b17d6c812db9
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0cd19b7e-fb77-4661-89b6-b17d6c812db9

The logs used to reproduce this issue are:

longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3 - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_090241/longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3-0cd19b7e.tar.zst
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1 - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_090241/longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1-0cd19b7e.tar.zst
core.scylla-longevity-100gb-4h-1876-cql-db-node-0cd19b7e-6-2025-04-24_11-29-59.gz - https://storage.cloud.google.com/upload.scylladb.com/core.scylla.106.9bd21c9a2b9f478e9eac139e53f03455.8727.1745490756000000./core.scylla.106.9bd21c9a2b9f478e9eac139e53f03455.8727.1745490756000000.zst
db-cluster-0cd19b7e.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/db-cluster-0cd19b7e.tar.zst
sct-runner-events-0cd19b7e.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/sct-runner-events-0cd19b7e.tar.zst
sct-0cd19b7e.log.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/sct-0cd19b7e.log.tar.zst
loader-set-0cd19b7e.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/loader-set-0cd19b7e.tar.zst
monitor-set-0cd19b7e.tar.zst - [https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/monitor-set-0cd19b7e.tar.zst](https://cloudius-jenkins-test
Q&A: Coredump right after seed node decommission =====================================================

Q: What is a coredump?

A: A coredump is a dump of a program's memory at the time of a crash or termination. It is a snapshot of the program's state, including the memory, registers, and stack. Coredumps can be used to diagnose and debug issues with a program.

Q: What is a seed node decommission?

A: A seed node decommission is the process of removing a node from a cluster that is acting as a seed node. A seed node is a node that is responsible for bootstrapping new nodes in a cluster. When a seed node is decommissioned, it is removed from the cluster and its responsibilities are transferred to other nodes.

Q: What is the impact of a coredump right after seed node decommission?

A: The impact of a coredump right after seed node decommission can be significant. It can lead to data loss, system instability, and even complete failure of the cluster. In this case, the coredump occurred on another node, which can lead to further issues and instability in the cluster.

Q: How frequently does this issue reproduce?

A: Argus claims that 10 similar coredumps have occurred, but it is unclear if they are the same issue. The referenced issue https://github.com/scylladb/scylladb/issues/23577 may be related to this issue.

Q: What are the installation details of this issue?

A: The cluster size is 6 nodes (i4i.4xlarge). The Scylla nodes used in this run are:

longevity-100gb-4h-1876-cql-db-node-0cd19b7e-8 (52.209.197.170 | 10.4.10.106) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-7 (52.50.208.233 | 10.4.8.126) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-6 (18.201.95.7 | 10.4.11.4) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-5 (54.216.181.73 | 10.4.8.197) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-4 (54.74.80.41 | 10.4.11.238) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3 (3.255.179.242 | 10.4.9.81) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-2 (34.244.66.138 | 10.4.9.) (shards: 14)
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1 (54.154.33.115 | 10.4.10.126) (shards: 14)

The OS/image used is ami-0c09cf0a4d621b403 (aws: undefined_region).

Q: What are the logs and commands used to reproduce this issue?

A: The logs and commands used to reproduce this issue are:

Restore Monitor Stack command: $ hydra investigate show-monitor 0cd19b7e-fb77-4661-89b6-b17d6c812db9
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0cd19b7e-fb77-4661-89b6-b17d6c812db9

The logs used to reproduce this issue are:

longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3 - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_090241/longevity-100gb-4h-1876-cql-db-node-0cd19b7e-3-0cd19b7e.tar.zst
longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1 - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_090241/longevity-100gb-4h-1876-cql-db-node-0cd19b7e-1-0cd19b7e.tar.zst
core.scylla-longevity-100gb-4h-1876-cql-db-node-0cd19b7e-6-2025-04-24_11-29-59.gz - https://storage.cloud.google.com/upload.scylladb.com/core.scylla.106.9bd21c9a2b9f478e9eac139e53f03455.8727.1745490756000000./core.scylla.106.9bd21c9a2b9f478e9eac139ef03455.8727.1745490756000000.zst
db-cluster-0cd19b7e.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/db-cluster-0cd19b7e.tar.zst
sct-runner-events-0cd19b7e.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/sct-runner-events-0cd19b7e.tar.zst
sct-0cd19b7e.log.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b17d6c812db9/20250424_132921/sct-0cd19b7e.log.tar.zst
loader-set-0cd19b7e.tar.zst - [https://cloudius-jenkins-test.s3.amazonaws.com/0cd19b7e-fb77-4661-89b6-b