ETCD node failover

2024年3月15日 171次阅读来源: AndyZhang

Unreachable member

A cluster with etcd containers is created successfully.
Check the cluster status with the following command.

# etcdctl --endpoint cluster-health

If the cluster is running normally, the output looks like:

member xxx is healthy: got healthy result from https://10.23.2.109:3379
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

If one member failed, the output may look like:

failed to check the health of member xxx on https://10.23.2.109:3379: Get https://10.23.2.109:3379/health: dial tcp 10.23.2.109:3379: connect: connection refused
member xxx is unreachable: [https://10.23.2.109:3379] are all unreachable
member xxx is healthy: got healthy result from https://10.23.2.108:3379
member xxx is healthy: got healthy result from https://10.23.2.110:3379
cluster is healthy

The reason may meet one of the following four cases.

Case 1: The whole environment of an etcd container was destroyed.

Solution

Remove the destroyed member with etcdctl.

# etcdctl member remove xxx

xxx is memberID of the unreachable member.

Create a new etcd container with adding the following environment variables to env in config file.

"ETCD_INITIAL_CLUSTER_STATE": "existing"
"ETCD_INITIAL_CLUSTER": <The cluster peer urls with the new etcd container>

“hostname2=https://10.23.2.108:3380,hostname3=https://10.23.2.110:3380″ in ETCD_INITIAL_CLUSTER are the peer urls of the cluster after removing the destroyed member.

Add the new container to the existing cluster.

# etcdctl --endpoint member add <name> <peerURL>

<name> is hostname in its config file.

<peerURL> is one of ETCD_INITIAL_ADVERTISE_PEER_URLS in its config file.

Case 2: The etcd container doesn’t exist.

Solution

Add “ETCD_INITIAL_CLUSTER_STATE”: “existing” to the container creation config file.
Create the container with the new config file, but keep the other configurations as same as before.

Case 3: The etcd container was stopped.

Solution

Start the container.

# docker start <container>

Case 4: The etcd service was stopped in its container.

Solution

Restart the stopped etcd container.

# docker restart <container>

Unhealthy member

If a member is unhealthy, we can refer to above case 2 to remove its container with metadata, then create a new one to fix it.

    原文作者：AndyZhang
    原文地址: https://segmentfault.com/a/1190000018070081
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。